From “cheetah-noids” to humanoids

In November 2018, MIT Professor Sangbae Kim brought his mini cheetah robot onto “The Tonight Show’s” Tonight Show-botics segment. Much to the delight of host Jimmy Fallon, the mini cheetah did some yoga, got back up after falling, and executed a perfect backflip. Behind the stage, Benjamin Katz ’16, SM ’18 was remotely controlling the cheetah’s nimble maneuvers.

For Katz, waiting in the wings as the robot performed in front of a national audience was the culmination of nearly five years of work.

As an undergraduate at MIT, Katz studied mechanical engineering, opting for the flexible Course 2A degree program with a concentration in controls, instrumentation, and robotics. Toward the end of his first year, he emailed Kim to see if there were any job opportunities in Kim’s Biomimetic Robotics Lab. He then spent the summer in Kim’s lab as part of the MIT Undergraduate Research Opportunities Program (UROP). For his UROP research and undergraduate thesis, he began to look at how to utilize pieces built for the electronics hobby market in robotics. “You can find really high-performance motors built for things like remote control airplanes and drones. I basically thought you could also use these parts for robots, which is something no one was doing,” recalls Katz.

Kim was immediately impressed by Katz’s abilities an engineer and designer.

“Ben is an extremely versatile engineer who can cover structure and mechanism design, electric motor dynamics, power electronics, and classical control, a range of expertise usually requiring four-to-five engineers to cover,” says Kim.

After deciding to pursue a master’s degree in mechanical engineering at MIT, Katz continued working in Kim’s lab and developed solutions for actuators in robotics. While working on the third iteration of Kim’s robot, known as Cheetah 3, Katz and his labmates shifted their focus to developing a smaller version of the robot.

“There are a lot of nice things about having a smaller robot: If something breaks you can easily fix it, it’s cheaper, and it’s safe enough for one person to wrangle alone,” says Katz. “Even though a small robot may not always be the most practical for real-world applications, its controllers, software, and research can be trivially ported to a big robot that can carry larger payloads.”

Drawing upon his undergraduate research, Katz and the research team used 12 motors originally designed for drones to build actuators in each joint of the small quadruped robot that would be dubbed the “mini cheetah.”

Armed with this smaller robot, Katz set out to make the mini cheetah more agile and resilient. Alongside then-EECS student Jared Di Carlo ’19, Mng ’20, Katz focused on controls related to locomotion in the mini cheetah. In class 6.832 (Underactuated Robotics), taught by Professor Russ Tedrake, the pair worked on a project that would allow the mini cheetah safely backflip from a crouched position.

“It was basically a giant offline optimization problem to get the mini cheetah to backflip,” says Katz.

Using offline nonlinear optimization to generate the backflip trajectory, he and Di Carlo were able to program the mini cheetah to crouch and rotate 360 degrees around an axis.

While working on the cheetah, Katz was constantly pursuing other engineering projects as a hobby. This included a very different rotating robot as a pet project. Alongside Di Carlo, Katz utilized the MIT community makerspace known as MITERS to develop a robot that could solve a Rubik’s Cube in a record-breaking 0.38 seconds.

“That project was purely for fun during MIT’s Independent Activities Period,” recalls Katz. “We used custom-built actuators on each of the Rubik’s Cube’s faces alongside webcams to identify the colors and move the blocks accordingly.”

He chronicled his other pet projects on his “build-its” blog, which developed a strong following. Projects included planar magnetic headphones, a desktop Furuta pendulum, and an electric travel ukulele.

“Ben was constantly building and analyzing something along with our lab and class projects during his entire time at MIT,” says Kim. “His incessant desire to learn, build, and analyze is quite remarkable.”

After graduating with his master’s degree in 2018, Katz worked as a technical associate in Kim’s lab before accepting a position at Boston Dynamics in 2019.

As a designer at Boston Dynamics, Katz has transitioned from cheetah robots to humanoid robots on ATLAS, a research platform billed as the “world’s most dynamic humanoid robot.” Much like the mini cheetah, ATLAS can execute incredibly dynamic maneuvers, including backflips and even parkour.

While the mini cheetah holding yoga poses and ATLAS doing parkour seems like entertainment befitting “The Tonight Show,” Katz is quick to remind others that these robots are fulfilling a real-world need. The robots could someday maneuver in areas that are too dangerous for humans — including buildings that are on fire and disaster areas. They could open new possibilities for lifesaving disaster relief and first-responders in emergencies.

“What we did in Sangbae’s lab is going to help make these machines ubiquitous and actually useful in the real world as viable products,” adds Katz.

Read More

Machine learning speeds up vehicle routing

Waiting for a holiday package to be delivered? There’s a tricky math problem that needs to be solved before the delivery truck pulls up to your door, and MIT researchers have a strategy that could speed up the solution.

The approach applies to vehicle routing problems such as last-mile delivery, where the goal is to deliver goods from a central depot to multiple cities while keeping travel costs down. While there are algorithms designed to solve this problem for a few hundred cities, these solutions become too slow when applied to a larger set of cities.

To remedy this, Cathy Wu, the Gilbert W. Winslow Career Development Assistant Professor in Civil and Environmental Engineering and the Institute for Data, Systems, and Society, and her students have come up with a machine-learning strategy that accelerates some of the strongest algorithmic solvers by 10 to 100 times.

The solver algorithms work by breaking up the problem of delivery into smaller subproblems to solve — say, 200 subproblems for routing vehicles between 2,000 cities. Wu and her colleagues augment this process with a new machine-learning algorithm that identifies the most useful subproblems to solve, instead of solving all the subproblems, to increase the quality of the solution while using orders of magnitude less compute.

Their approach, which they call “learning-to-delegate,” can be used across a variety of solvers and a variety of similar problems, including scheduling and pathfinding for warehouse robots, the researchers say.

The work pushes the boundaries on rapidly solving large-scale vehicle routing problems, says Marc Kuo, founder and CEO of Routific, a smart logistics platform for optimizing delivery routes. Some of Routific’s recent algorithmic advances were inspired by Wu’s work, he notes.

“Most of the academic body of research tends to focus on specialized algorithms for small problems, trying to find better solutions at the cost of processing times. But in the real-world, businesses don’t care about finding better solutions, especially if they take too long for compute,” Kuo explains. “In the world of last-mile logistics, time is money, and you cannot have your entire warehouse operations wait for a slow algorithm to return the routes. An algorithm needs to be hyper-fast for it to be practical.”

Wu, social and engineering systems doctoral student Sirui Li, and electrical engineering and computer science doctoral student Zhongxia Yan presented their research this week at the 2021 NeurIPS conference.

Selecting good problems

Vehicle routing problems are a class of combinatorial problems, which involve using heuristic algorithms to find “good-enough solutions” to the problem. It’s typically not possible to come up with the one “best” answer to these problems, because the number of possible solutions is far too huge.

“The name of the game for these types of problems is to design efficient algorithms … that are optimal within some factor,” Wu explains. “But the goal is not to find optimal solutions. That’s too hard. Rather, we want to find as good of solutions as possible. Even a 0.5% improvement in solutions can translate to a huge revenue increase for a company.”

Over the past several decades, researchers have developed a variety of heuristics to yield quick solutions to combinatorial problems. They usually do this by starting with a poor but valid initial solution and then gradually improving the solution — by trying small tweaks to improve the routing between nearby cities, for example. For a large problem like a 2,000-plus city routing challenge, however, this approach just takes too much time.

More recently, machine-learning methods have been developed to solve the problem, but while faster, they tend to be more inaccurate, even at the scale of a few dozen cities. Wu and her colleagues decided to see if there was a beneficial way to combine the two methods to find speedy but high-quality solutions.

“For us, this is where machine learning comes in,” Wu says. “Can we predict which of these subproblems, that if we were to solve them, would lead to more improvement in the solution, saving computing time and expense?”

Traditionally, a large-scale vehicle routing problem heuristic might choose the subproblems to solve in which order either randomly or by applying yet another carefully devised heuristic. In this case, the MIT researchers ran sets of subproblems through a neural network they created to automatically find the subproblems that, when solved, would lead to the greatest gain in quality of the solutions. This process sped up subproblem selection process by 1.5 to 2 times, Wu and colleagues found.

“We don’t know why these subproblems are better than other subproblems,” Wu notes. “It’s actually an interesting line of future work. If we did have some insights here, these could lead to designing even better algorithms.”

Surprising speed-up

Wu and colleagues were surprised by how well the approach worked. In machine learning, the idea of garbage-in, garbage-out applies — that is, the quality of a machine-learning approach relies heavily on the quality of the data. A combinatorial problem is so difficult that even its subproblems can’t be optimally solved. A neural network trained on the “medium-quality” subproblem solutions available as the input data “would typically give medium-quality results,” says Wu. In this case, however, the researchers were able to leverage the medium-quality solutions to achieve high-quality results, significantly faster than state-of-the-art methods.

For vehicle routing and similar problems, users often must design very specialized algorithms to solve their specific problem. Some of these heuristics have been in development for decades.

The learning-to-delegate method offers an automatic way to accelerate these heuristics for large problems, no matter what the heuristic or — potentially — what the problem.

Since the method can work with a variety of solvers, it may be useful for a variety of resource allocation problems, says Wu. “We may unlock new applications that now will be possible because the cost of solving the problem is 10 to 100 times less.”

The research was supported by MIT Indonesia Seed Fund, U.S. Department of Transportation Dwight David Eisenhower Transportation Fellowship Program, and the MIT-IBM Watson AI Lab.

Read More

Machine-learning system flags remedies that might do more harm than good

Sepsis claims the lives of nearly 270,000 people in the U.S. each year. The unpredictable medical condition can progress rapidly, leading to a swift drop in blood pressure, tissue damage, multiple organ failure, and death.

Prompt interventions by medical professionals save lives, but some sepsis treatments can also contribute to a patient’s deterioration, so choosing the optimal therapy can be a difficult task. For instance, in the early hours of severe sepsis, administering too much fluid intravenously can increase a patient’s risk of death.

To help clinicians avoid remedies that may potentially contribute to a patient’s death, researchers at MIT and elsewhere have developed a machine-learning model that could be used to identify treatments that pose a higher risk than other options. Their model can also warn doctors when a septic patient is approaching a medical dead end — the point when the patient will most likely die no matter what treatment is used — so that they can intervene before it is too late.

When applied to a dataset of sepsis patients in a hospital intensive care unit, the researchers’ model indicated that about 12 percent of treatments given to patients who died were detrimental. The study also reveals that about 3 percent of patients who did not survive entered a medical dead end up to 48 hours before they died.

“We see that our model is almost eight hours ahead of a doctor’s recognition of a patient’s deterioration. This is powerful because in these really sensitive situations, every minute counts, and being aware of how the patient is evolving, and the risk of administering certain treatment at any given time, is really important,” says Taylor Killian, a graduate student in the Healthy ML group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

Joining Killian on the paper are his advisor, Assistant Professor Marzyeh Ghassemi, head of the Healthy ML group and senior author; lead author Mehdi Fatemi, a senior researcher at Microsoft Research; and Jayakumar Subramanian, a senior research scientist at Adobe India. The research is being presented at this week’s Conference on Neural Information Processing Systems.  

A dearth of data

This research project was spurred by a 2019 paper Fatemi wrote that explored the use of reinforcement learning in situations where it is too dangerous to explore arbitrary actions, which makes it difficult to generate enough data to effectively train algorithms. These situations, where more data cannot be proactively collected, are known as “offline” settings.

In reinforcement learning, the algorithm is trained through trial and error and learns to take actions that maximize its accumulation of reward. But in a health care setting, it is nearly impossible to generate enough data for these models to learn the optimal treatment, since it isn’t ethical to experiment with possible treatment strategies.

So, the researchers flipped reinforcement learning on its head. They used the limited data from a hospital ICU to train a reinforcement learning model to identify treatments to avoid, with the goal of keeping a patient from entering a medical dead end.

Learning what to avoid is a more statistically efficient approach that requires fewer data, Killian explains.

“When we think of dead ends in driving a car, we might think that is the end of the road, but you could probably classify every foot along that road toward the dead end as a dead end. As soon as you turn away from another route, you are in a dead end. So, that is the way we define a medical dead end: Once you’ve gone on a path where whatever decision you make, the patient will progress toward death,” Killian says.

“One core idea here is to decrease the probability of selecting each treatment in proportion to its chance of forcing the patient to enter a medical dead-end — a property that is called treatment security. This is a hard problem to solve as the data do not directly give us such an insight. Our theoretical results allowed us to recast this core idea as a reinforcement learning problem,” Fatemi says.

To develop their approach, called Dead-end Discovery (DeD), they created two copies of a neural network. The first neural network focuses only on negative outcomes — when a patient died — and the second network only focuses on positive outcomes — when a patient survived. Using two neural networks separately enabled the researchers to detect a risky treatment in one and then confirm it using the other.

They fed each neural network patient health statistics and a proposed treatment. The networks output an estimated value of that treatment and also evaluate the probability the patient will enter a medical dead end. The researchers compared those estimates to set thresholds to see if the situation raises any flags.

A yellow flag means that a patient is entering an area of concern while a red flag identifies a situation where it is very likely the patient will not recover.

Treatment matters

The researchers tested their model using a dataset of patients presumed to be septic from the Beth Israel Deaconess Medical Center intensive care unit. This dataset contains about 19,300 admissions with observations drawn from a 72-hour period centered around when the patients first manifest symptoms of sepsis. Their results confirmed that some patients in the dataset encountered medical dead ends.

The researchers also found that 20 to 40 percent of patients who did not survive raised at least one yellow flag prior to their death, and many raised that flag at least 48 hours before they died. The results also showed that, when comparing the trends of patients who survived versus patients who died, once a patient raises their first flag, there is a very sharp deviation in the value of administered treatments. The window of time around the first flag is a critical point when making treatment decisions.

“This helped us confirm that treatment matters and the treatment deviates in terms of how patients survive and how patients do not. We found that upward of 11 percent of suboptimal treatments could have potentially been avoided because there were better alternatives available to doctors at those times. This is a pretty substantial number, when you consider the worldwide volume of patients who have been septic in the hospital at any given time,” Killian says.

Ghassemi is also quick to point out that the model is intended to assist doctors, not replace them.

“Human clinicians are who we want making decisions about care, and advice about what treatment to avoid isn’t going to change that,” she says. “We can recognize risks and add relevant guardrails based on the outcomes of 19,000 patient treatments — that’s equivalent to a single caregiver seeing more than 50 septic patient outcomes every day for an entire year.”

Moving forward, the researchers also want to estimate causal relationships between treatment decisions and the evolution of patient health. They plan to continue enhancing the model so it can create uncertainty estimates around treatment values that would help doctors make more informed decisions. Another way to provide further validation of the model would be to apply it to data from other hospitals, which they hope to do in the future.

This research was supported in part by Microsoft Research, a Canadian Institute for Advanced Research Azrieli Global Scholar Chair, a Canada Research Council Chair, and a Natural Sciences and Engineering Research Council of Canada Discovery Grant.

Read More

A tool to speed development of new solar cells

In the ongoing race to develop ever-better materials and configurations for solar cells, there are many variables that can be adjusted to try to improve performance, including material type, thickness, and geometric arrangement. Developing new solar cells has generally been a tedious process of making small changes to one of these parameters at a time. While computational simulators have made it possible to evaluate such changes without having to actually build each new variation for testing, the process remains slow.

Now, researchers at MIT and Google Brain have developed a system that makes it possible not just to evaluate one proposed design at a time, but to provide information about which changes will provide the desired improvements. This could greatly increase the rate for the discovery of new, improved configurations.

The new system, called a differentiable solar cell simulator, is described in a paper published today in the journal Computer Physics Communications, written by MIT junior Sean Mann, research scientist Giuseppe Romano of MIT’s Institute for Soldier Nanotechnologies, and four others at MIT and at Google Brain.

Traditional solar cell simulators, Romano explains, take the details of a solar cell configuration and produce as their output a predicted efficiency — that is, what percentage of the energy of incoming sunlight actually gets converted to an electric current. But this new simulator both predicts the efficiency and shows how much that output is affected by any one of the input parameters. “It tells you directly what happens to the efficiency if we make this layer a little bit thicker, or what happens to the efficiency if we for example change the property of the material,” he says.

In short, he says, “we didn’t discover a new device, but we developed a tool that will enable others to discover more quickly other higher performance devices.” Using this system, “we are decreasing the number of times that we need to run a simulator to give quicker access to a wider space of optimized structures.” In addition, he says, “our tool can identify a unique set of material parameters that has been hidden so far because it’s very complex to run those simulations.”

While traditional approaches use essentially a random search of possible variations, Mann says, with his tool “we can follow a trajectory of change because the simulator tells you what direction you want to be changing your device. That makes the process much faster because instead of exploring the entire space of opportunities, you can just follow a single path” that leads directly to improved performance.

Since advanced solar cells often are composed of multiple layers interlaced with conductive materials to carry electric charge from one to the other, this computational tool reveals how changing the relative thicknesses of these different layers will affect the device’s output. “This is very important because the thickness is critical. There is a strong interplay between light propagation and the thickness of each layer and the absorption of each layer,” Mann explains.

Other variables that can be evaluated include the amount of doping (the introduction of atoms of another element) that each layer receives, or the dielectric constant of insulating layers, or the bandgap, a measure of the energy levels of photons of light that can be captured by different materials used in the layers.

This simulator is now available as an open-source tool that can be used immediately to help guide research in this field, Romano says. “It is ready, and can be taken up by industry experts.” To make use of it, researchers would couple this device’s computations with an optimization algorithm, or even a machine learning system, to rapidly assess a wide variety of possible changes and home in quickly on the most promising alternatives.

At this point, the simulator is based on just a one-dimensional version of the solar cell, so the next step will be to expand its capabilities to include two- and three-dimensional configurations. But even this 1D version “can cover the majority of cells that are currently under production,” Romano says. Certain variations, such as so-called tandem cells using different materials, cannot yet be simulated directly by this tool, but “there are ways to approximate a tandem solar cell by simulating each of the individual cells,” Mann says.

The simulator is “end-to-end,” Romano says, meaning it computes the sensitivity of the efficiency, also taking into account light absorption. He adds: “An appealing future direction is composing our simulator with advanced existing differentiable light-propagation simulators, to achieve enhanced accuracy.”

Moving forward, Romano says, because this is an open-source code, “that means that once it’s up there, the community can contribute to it. And that’s why we are really excited.” Although this research group is “just a handful of people,” he says, now anyone working in the field can make their own enhancements and improvements to the code and introduce new capabilities.

“Differentiable physics is going to provide new capabilities for the simulations of engineered systems,” says Venkat Viswanathan, an associate professor of mechanical engineering at Carnegie Mellon University, who was not associated with this work. “The  differentiable solar cell simulator is an incredible example of differentiable physics, that can now provide new capabilities to optimize solar cell device performance,” he says, calling the study “an exciting step forward.”

In addition to Mann and Romano, the team included Eric Fadel and Steven Johnson at MIT, and Samuel Schoenholz and Ekin Cubuk at Google Brain. The work was supported in part by Eni S.p.A. and the MIT Energy Initiative, and the MIT Quest for Intelligence.

Read More

Tiny machine learning design alleviates a bottleneck in memory usage on internet-of-things devices

Machine learning provides powerful tools to researchers to identify and predict patterns and behaviors, as well as learn, optimize, and perform tasks. This ranges from applications like vision systems on autonomous vehicles or social robots to smart thermostats to wearable and mobile devices like smartwatches and apps that can monitor health changes. While these algorithms and their architectures are becoming more powerful and efficient, they typically require tremendous amounts of memory, computation, and data to train and make inferences.

At the same time, researchers are working to reduce the size and complexity of the devices that these algorithms can run on, all the way down to a microcontroller unit (MCU) that’s found in billions of internet-of-things (IoT) devices. An MCU is memory-limited minicomputer housed in compact integrated circuit that lacks an operating system and runs simple commands. These relatively cheap edge devices require low power, computing, and bandwidth, and offer many opportunities to inject AI technology to expand their utility, increase privacy, and democratize their use — a field called TinyML.

Now, an MIT team working in TinyML in the MIT-IBM Watson AI Lab and the research group of Song Han, assistant professor in the Department of Electrical Engineering and Computer Science (EECS), has designed a technique to shrink the amount of memory needed even smaller, while improving its performance on image recognition in live videos.

“Our new technique can do a lot more and paves the way for tiny machine learning on edge devices,” says Han, who designs TinyML software and hardware.

To increase TinyML efficiency, Han and his colleagues from EECS and the MIT-IBM Watson AI Lab analyzed how memory is used on microcontrollers running various convolutional neural networks (CNNs). CNNs are biologically-inspired models after neurons in the brain and are often applied to evaluate and identify visual features within imagery, like a person walking through a video frame. In their study, they discovered an imbalance in memory utilization, causing front-loading on the computer chip and creating a bottleneck. By developing a new inference technique and neural architecture, the team alleviated the problem and reduced peak memory usage by four-to-eight times. Further, the team deployed it on their own tinyML vision system, equipped with a camera and capable of human and object detection, creating its next generation, dubbed MCUNetV2. When compared to other machine learning methods running on microcontrollers, MCUNetV2 outperformed them with high accuracy on detection, opening the doors to additional vision applications not before possible.

The results will be presented in a paper at the conference on Neural Information Processing Systems (NeurIPS) this week. The team includes Han, lead author and graduate student Ji Lin, postdoc Wei-Ming Chen, graduate student Han Cai, and MIT-IBM Watson AI Lab Research Scientist Chuang Gan.

A design for memory efficiency and redistribution

TinyML offers numerous advantages over deep machine learning that happens on larger devices, like remote servers and smartphones. These, Han notes, include privacy, since the data are not transmitted to the cloud for computing but processed on the local device; robustness, as the computing is quick and the latency is low; and low cost, because IoT devices cost roughly $1 to $2. Further, some larger, more traditional AI models can emit as much carbon as five cars in their lifetimes, require many GPUs, and cost billions of dollars to train. “So, we believe such TinyML techniques can enable us to go off-grid to save the carbon emissions and make the AI greener, smarter, faster, and also more accessible to everyone — to democratize AI,” says Han.

However, small MCU memory and digital storage limit AI applications, so efficiency is a central challenge. MCUs contain only 256 kilobytes of memory and 1 megabyte of storage. In comparison, mobile AI on smartphones and cloud computing, correspondingly, may have 256 gigabytes and terabytes of storage, as well as 16,000 and 100,000 times more memory. As a precious resource, the team wanted to optimize its use, so they profiled the MCU memory usage of CNN designs — a task that had been overlooked until now, Lin and Chen say.

Their findings revealed that the memory usage peaked by the first five convolutional blocks out of about 17. Each block contains many connected convolutional layers, which help to filter for the presence of specific features within an input image or video, creating a feature map as the output. During the initial memory-intensive stage, most of the blocks operated beyond the 256KB memory constraint, offering plenty of room for improvement. To reduce the peak memory, the researchers developed a patch-based inference schedule, which operates on only a small fraction, roughly 25 percent, of the layer’s feature map at one time, before moving onto the next quarter, until the whole layer is done. This method saved four-to-eight times the memory of the previous layer-by-layer computational method, without any latency.

“As an illustration, say we have a pizza. We can divide it into four chunks and only eat one chunk at a time, so you save about three-quarters. This is the patch-based inference method,” says Han. “However, this was not a free lunch.” Like photoreceptors in the human eye, they can only take in and examine part of an image at a time; this receptive field is a patch of the total image or field of view. As the size of these receptive fields (or pizza slices in this analogy) grows, there becomes increasing overlap, which amounts to redundant computation that the researchers found to be about 10 percent. The researchers proposed to also redistribute the neural network across the blocks, in parallel with the patch-based inference method, without losing any of the accuracy in the vision system. However, the question remained about which blocks needed the patch-based inference method and which could use the original layer-by-layer one, together with the redistribution decisions; hand-tuning for all of these knobs was labor-intensive, and better left to AI.

“We want to automate this process by doing a joint automated search for optimization, including both the neural network architecture, like the number of layers, number of channels, the kernel size, and also the inference schedule including number of patches, number of layers for patch-based inference, and other optimization knobs,” says Lin, “so that non-machine learning experts can have a push-button solution to improve the computation efficiency but also improve the engineering productivity, to be able to deploy this neural network on microcontrollers.”

A new horizon for tiny vision systems

The co-design of the network architecture with the neural network search optimization and inference scheduling provided significant gains and was adopted into MCUNetV2; it outperformed other vision systems in peak memory usage, and image and object detection and classification. The MCUNetV2 device includes a small screen, a camera, and is about the size of an earbud case. Compared to the first version, the new version needed four times less memory for the same amount of accuracy, says Chen. When placed head-to-head against other tinyML solutions, MCUNetV2 was able to detect the presence of objects in image frames, like human faces, with an improvement of nearly 17 percent. Further, it set a record for accuracy, at nearly 72 percent, for a thousand-class image classification on the ImageNet dataset, using 465KB of memory. The researchers tested for what’s known as visual wake words, how well their MCU vision model could identify the presence of a person within an image, and even with the limited memory of only 30KB, it achieved greater than 90 percent accuracy, beating the previous state-of-the-art method. This means the method is accurate enough and could be deployed to help in, say, smart-home applications.

With the high accuracy and low energy utilization and cost, MCUNetV2’s performance unlocks new IoT applications. Due to their limited memory, Han says, vision systems on IoT devices were previously thought to be only good for basic image classification tasks, but their work has helped to expand the opportunities for TinyML use. Further, the research team envisions it in numerous fields, from monitoring sleep and joint movement in the health-care industry to sports coaching and movements like a golf swing to plant identification in agriculture, as well as in smarter manufacturing, from identifying nuts and bolts to detecting malfunctioning machines.

“We really push forward for these larger-scale, real-world applications,” says Han. “Without GPUs or any specialized hardware, our technique is so tiny it can run on these small cheap IoT devices and perform real-world applications like these visual wake words, face mask detection, and person detection. This opens the door for a brand-new way of doing tiny AI and mobile vision.”

This research was sponsored by the MIT-IBM Watson AI Lab, Samsung, and Woodside Energy, and the National Science Foundation.

Read More

Machines that see the world more like humans do

Computer vision systems sometimes make inferences about a scene that fly in the face of common sense. For example, if a robot were processing a scene of a dinner table, it might completely ignore a bowl that is visible to any human observer, estimate that a plate is floating above the table, or misperceive a fork to be penetrating a bowl rather than leaning against it.

Move that computer vision system to a self-driving car and the stakes become much higher  — for example, such systems have failed to detect emergency vehicles and pedestrians crossing the street.

To overcome these errors, MIT researchers have developed a framework that helps machines see the world more like humans do. Their new artificial intelligence system for analyzing scenes learns to perceive real-world objects from just a few images, and perceives scenes in terms of these learned objects.

The researchers built the framework using probabilistic programming, an AI approach that enables the system to cross-check detected objects against input data, to see if the images recorded from a camera are a likely match to any candidate scene. Probabilistic inference allows the system to infer whether mismatches are likely due to noise or to errors in the scene interpretation that need to be corrected by further processing.

This common-sense safeguard allows the system to detect and correct many errors that plague the “deep-learning” approaches that have also been used for computer vision. Probabilistic programming also makes it possible to infer probable contact relationships between objects in the scene, and use common-sense reasoning about these contacts to infer more accurate positions for objects.

“If you don’t know about the contact relationships, then you could say that an object is floating above the table — that would be a valid explanation. As humans, it is obvious to us that this is physically unrealistic and the object resting on top of the table is a more likely pose of the object. Because our reasoning system is aware of this sort of knowledge, it can infer more accurate poses. That is a key insight of this work,” says lead author Nishad Gothoskar, an electrical engineering and computer science (EECS) PhD student with the Probabilistic Computing Project.

In addition to improving the safety of self-driving cars, this work could enhance the performance of computer perception systems that must interpret complicated arrangements of objects, like a robot tasked with cleaning a cluttered kitchen.

Gothoskar’s co-authors include recent EECS PhD graduate Marco Cusumano-Towner; research engineer Ben Zinberg; visiting student Matin Ghavamizadeh; Falk Pollok, a software engineer in the MIT-IBM Watson AI Lab; recent EECS master’s graduate Austin Garrett; Dan Gutfreund, a principal investigator in the MIT-IBM Watson AI Lab; Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences (BCS) and a member of the Computer Science and Artificial Intelligence Laboratory; and senior author Vikash K. Mansinghka, principal research scientist and leader of the Probabilistic Computing Project in BCS. The research is being presented at the Conference on Neural Information Processing Systems in December.

A blast from the past

To develop the system, called “3D Scene Perception via Probabilistic Programming (3DP3),” the researchers drew on a concept from the early days of AI research, which is that computer vision can be thought of as the “inverse” of computer graphics.

Computer graphics focuses on generating images based on the representation of a scene; computer vision can be seen as the inverse of this process. Gothoskar and his collaborators made this technique more learnable and scalable by incorporating it into a framework built using probabilistic programming.

“Probabilistic programming allows us to write down our knowledge about some aspects of the world in a way a computer can interpret, but at the same time, it allows us to express what we don’t know, the uncertainty. So, the system is able to automatically learn from data and also automatically detect when the rules don’t hold,” Cusumano-Towner explains.

In this case, the model is encoded with prior knowledge about 3D scenes. For instance, 3DP3 “knows” that scenes are composed of different objects, and that these objects often lay flat on top of each other — but they may not always be in such simple relationships. This enables the model to reason about a scene with more common sense.

Learning shapes and scenes

To analyze an image of a scene, 3DP3 first learns about the objects in that scene. After being shown only five images of an object, each taken from a different angle, 3DP3 learns the object’s shape and estimates the volume it would occupy in space.

“If I show you an object from five different perspectives, you can build a pretty good representation of that object. You’d understand its color, its shape, and you’d be able to recognize that object in many different scenes,” Gothoskar says.

Mansinghka adds, “This is way less data than deep-learning approaches. For example, the Dense Fusion neural object detection system requires thousands of training examples for each object type. In contrast, 3DP3 only requires a few images per object, and reports uncertainty about the parts of each objects’ shape that it doesn’t know.”

The 3DP3 system generates a graph to represent the scene, where each object is a node and the lines that connect the nodes indicate which objects are in contact with one another. This enables 3DP3 to produce a more accurate estimation of how the objects are arranged. (Deep-learning approaches rely on depth images to estimate object poses, but these methods don’t produce a graph structure of contact relationships, so their estimations are less accurate.)

Outperforming baseline models

The researchers compared 3DP3 with several deep-learning systems, all tasked with estimating the poses of 3D objects in a scene.

In nearly all instances, 3DP3 generated more accurate poses than other models and performed far better when some objects were partially obstructing others. And 3DP3 only needed to see five images of each object, while each of the baseline models it outperformed needed thousands of images for training.

When used in conjunction with another model, 3DP3 was able to improve its accuracy. For instance, a deep-learning model might predict that a bowl is floating slightly above a table, but because 3DP3 has knowledge of the contact relationships and can see that this is an unlikely configuration, it is able to make a correction by aligning the bowl with the table.

“I found it surprising to see how large the errors from deep learning could sometimes be — producing scene representations where objects really didn’t match with what people would perceive. I also found it surprising that only a little bit of model-based inference in our causal probabilistic program was enough to detect and fix these errors. Of course, there is still a long way to go to make it fast and robust enough for challenging real-time vision systems — but for the first time, we’re seeing probabilistic programming and structured causal models improving robustness over deep learning on hard 3D vision benchmarks,” Mansinghka says.

In the future, the researchers would like to push the system further so it can learn about an object from a single image, or a single frame in a movie, and then be able to detect that object robustly in different scenes. They would also like to explore the use of 3DP3 to gather training data for a neural network. It is often difficult for humans to manually label images with 3D geometry, so 3DP3 could be used to generate more complex image labels.

The 3DP3 system “combines low-fidelity graphics modeling with common-sense reasoning to correct large scene interpretation errors made by deep learning neural nets. This type of approach could have broad applicability as it addresses important failure modes of deep learning. The MIT researchers’ accomplishment also shows how probabilistic programming technology previously developed under DARPA’s Probabilistic Programming for Advancing Machine Learning (PPAML) program can be applied to solve central problems of common-sense AI under DARPA’s current Machine Common Sense (MCS) program,” says Matt Turek, DARPA Program Manager for the Machine Common Sense Program, who was not involved in this research, though the program partially funded the study.

Additional funders include the Singapore Defense Science and Technology Agency collaboration with the MIT Schwarzman College of Computing, Intel’s Probabilistic Computing Center, the MIT-IBM Watson AI Lab, the Aphorism Foundation, and the Siegel Family Foundation.

Read More

Q&A: More-sustainable concrete with machine learning

As a building material, concrete withstands the test of time. Its use dates back to early civilizations, and today it is the most popular composite choice in the world. However, it’s not without its faults. Production of its key ingredient, cement, contributes 8-9 percent of the global anthropogenic CO2 emissions and 2-3 percent of energy consumption, which is only projected to increase in the coming years. With aging United States infrastructure, the federal government recently passed a milestone bill to revitalize and upgrade it, along with a push to reduce greenhouse gas emissions where possible, putting concrete in the crosshairs for modernization, too.

Elsa Olivetti, the Esther and Harold E. Edgerton Associate Professor in the MIT Department of Materials Science and Engineering, and Jie Chen, MIT-IBM Watson AI Lab research scientist and manager, think artificial intelligence can help meet this need by designing and formulating new, more sustainable concrete mixtures, with lower costs and carbon dioxide emissions, while improving material performance and reusing manufacturing byproducts in the material itself. Olivetti’s research improves environmental and economic sustainability of materials, and Chen develops and optimizes machine learning and computational techniques, which he can apply to materials reformulation. Olivetti and Chen, along with their collaborators, have recently teamed up for an MIT-IBM Watson AI Lab project to make concrete more sustainable for the benefit of society, the climate, and the economy.

Q: What applications does concrete have, and what properties make it a preferred building material?

Olivetti: Concrete is the dominant building material globally with an annual consumption of 30 billion metric tons. That is over 20 times the next most produced material, steel, and the scale of its use leads to considerable environmental impact, approximately 5-8 percent of global greenhouse gas (GHG) emissions. It can be made locally, has a broad range of structural applications, and is cost-effective. Concrete is a mixture of fine and coarse aggregate, water, cement binder (the glue), and other additives.

Q: Why isn’t it sustainable, and what research problems are you trying to tackle with this project?

Olivetti: The community is working on several ways to reduce the impact of this material, including alternative fuels use for heating the cement mixture, increasing energy and materials efficiency and carbon sequestration at production facilities, but one important opportunity is to develop an alternative to the cement binder.

While cement is 10 percent of the concrete mass, it accounts for 80 percent of the GHG footprint. This impact is derived from the fuel burned to heat and run the chemical reaction required in manufacturing, but also the chemical reaction itself releases CO2 from the calcination of limestone. Therefore, partially replacing the input ingredients to cement (traditionally ordinary Portland cement or OPC) with alternative materials from waste and byproducts can reduce the GHG footprint. But use of these alternatives is not inherently more sustainable because wastes might have to travel long distances, which adds to fuel emissions and cost, or might require pretreatment processes. The optimal way to make use of these alternate materials will be situation-dependent. But because of the vast scale, we also need solutions that account for the huge volumes of concrete needed. This project is trying to develop novel concrete mixtures that will decrease the GHG impact of the cement and concrete, moving away from the trial-and-error processes towards those that are more predictive.

Chen: If we want to fight climate change and make our environment better, are there alternative ingredients or a reformulation we could use so that less greenhouse gas is emitted? We hope that through this project using machine learning we’ll be able to find a good answer.

Q: Why is this problem important to address now, at this point in history?

Olivetti: There is urgent need to address greenhouse gas emissions as aggressively as possible, and the road to doing so isn’t necessarily straightforward for all areas of industry. For transportation and electricity generation, there are paths that have been identified to decarbonize those sectors. We need to move much more aggressively to achieve those in the time needed; further, the technological approaches to achieve that are more clear. However, for tough-to-decarbonize sectors, such as industrial materials production, the pathways to decarbonization are not as mapped out.

Q: How are you planning to address this problem to produce better concrete?

Olivetti: The goal is to predict mixtures that will both meet performance criteria, such as strength and durability, with those that also balance economic and environmental impact. A key to this is to use industrial wastes in blended cements and concretes. To do this, we need to understand the glass and mineral reactivity of constituent materials. This reactivity not only determines the limit of the possible use in cement systems but also controls concrete processing, and the development of strength and pore structure, which ultimately control concrete durability and life-cycle CO2 emissions.

Chen: We investigate using waste materials to replace part of the cement component. This is something that we’ve hypothesized would be more sustainable and economic — actually waste materials are common, and they cost less. Because of the reduction in the use of cement, the final concrete product would be responsible for much less carbon dioxide production. Figuring out the right concrete mixture proportion that makes endurable concretes while achieving other goals is a very challenging problem. Machine learning is giving us an opportunity to explore the advancement of predictive modeling, uncertainty quantification, and optimization to solve the issue. What we are doing is exploring options using deep learning as well as multi-objective optimization techniques to find an answer. These efforts are now more feasible to carry out, and they will produce results with reliability estimates that we need to understand what makes a good concrete.

Q: What kinds of AI and computational techniques are you employing for this?

Olivetti: We use AI techniques to collect data on individual concrete ingredients, mix proportions, and concrete performance from the literature through natural language processing. We also add data obtained from industry and/or high throughput atomistic modeling and experiments to optimize the design of concrete mixtures. Then we use this information to develop insight into the reactivity of possible waste and byproduct materials as alternatives to cement materials for low-CO2 concrete. By incorporating generic information on concrete ingredients, the resulting concrete performance predictors are expected to be more reliable and transformative than existing AI models.

Chen: The final objective is to figure out what constituents, and how much of each, to put into the recipe for producing the concrete that optimizes the various factors: strength, cost, environmental impact, performance, etc. For each of the objectives, we need certain models: We need a model to predict the performance of the concrete (like, how long does it last and how much weight does it sustain?), a model to estimate the cost, and a model to estimate how much carbon dioxide is generated. We will need to build these models by using data from literature, from industry, and from lab experiments.

We are exploring Gaussian process models to predict the concrete strength, going forward into days and weeks. This model can give us an uncertainty estimate of the prediction as well. Such a model needs specification of parameters, for which we will use another model to calculate. At the same time, we also explore neural network models because we can inject domain knowledge from human experience into them. Some models are as simple as multi-layer perceptions, while some are more complex, like graph neural networks. The goal here is that we want to have a model that is not only accurate but also robust — the input data is noisy, and the model must embrace the noise, so that its prediction is still accurate and reliable for the multi-objective optimization.

Once we have built models that we are confident with, we will inject their predictions and uncertainty estimates into the optimization of multiple objectives, under constraints and under uncertainties.

Q: How do you balance cost-benefit trade-offs?

Chen: The multiple objectives we consider are not necessarily consistent, and sometimes they are at odds with each other. The goal is to identify scenarios where the values for our objectives cannot be further pushed simultaneously without compromising one or a few. For example, if you want to further reduce the cost, you probably have to suffer the performance or suffer the environmental impact. Eventually, we will give the results to policymakers and they will look into the results and weigh the options. For example, they may be able to tolerate a slightly higher cost under a significant reduction in greenhouse gas. Alternatively, if the cost varies little but the concrete performance changes drastically, say, doubles or triples, then this is definitely a favorable outcome.

Q: What kinds of challenges do you face in this work?

Chen: The data we get either from industry or from literature are very noisy; the concrete measurements can vary a lot, depending on where and when they are taken. There are also substantial missing data when we integrate them from different sources, so, we need to spend a lot of effort to organize and make the data usable for building and training machine learning models. We also explore imputation techniques that substitute missing features, as well as models that tolerate missing features, in our predictive modeling and uncertainty estimate.

Q: What do you hope to achieve through this work?

Chen: In the end, we are suggesting either one or a few concrete recipes, or a continuum of recipes, to manufacturers and policymakers. We hope that this will provide invaluable information for both the construction industry and for the effort of protecting our beloved Earth.

Olivetti: We’d like to develop a robust way to design cements that make use of waste materials to lower their CO2 footprint. Nobody is trying to make waste, so we can’t rely on one stream as a feedstock if we want this to be massively scalable. We have to be flexible and robust to shift with feedstocks changes, and for that we need improved understanding. Our approach to develop local, dynamic, and flexible alternatives is to learn what makes these wastes reactive, so we know how to optimize their use and do so as broadly as possible. We do that through predictive model development through software we have developed in my group to automatically extract data from literature on over 5 million texts and patents on various topics. We link this to the creative capabilities of our IBM collaborators to design methods that predict the final impact of new cements. If we are successful, we can lower the emissions of this ubiquitous material and play our part in achieving carbon emissions mitigation goals.

Other researchers involved with this project include Stefanie Jegelka, the X-Window Consortium Career Development Associate Professor in the MIT Department of Electrical Engineering and Computer Science; Richard Goodwin, IBM principal researcher; Soumya Ghosh, MIT-IBM Watson AI Lab research staff member; and Kristen Severson, former research staff member. Collaborators included Nghia Hoang, former research staff member with MIT-IBM Watson AI Lab and IBM Research, and Executive Director of MIT Climate & Sustainability Consortium Jeremy Gregory.​

This research is supported by the MIT-IBM Watson AI Lab.

Read More

Technique enables real-time rendering of scenes in 3D

Humans are pretty good at looking at a single two-dimensional image and understanding the full three-dimensional scene that it captures. Artificial intelligence agents are not.

Yet a machine that needs to interact with objects in the world — like a robot designed to harvest crops or assist with surgery — must be able to infer properties about a 3D scene from observations of the 2D images it’s trained on.     

While scientists have had success using neural networks to infer representations of 3D scenes from images, these machine learning methods aren’t fast enough to make them feasible for many real-world applications.

A new technique demonstrated by researchers at MIT and elsewhere is able to represent 3D scenes from images about 15,000 times faster than some existing models.

The method represents a scene as a 360-degree light field, which is a function that describes all the light rays in a 3D space, flowing through every point and in every direction. The light field is encoded into a neural network, which enables faster rendering of the underlying 3D scene from an image.

The light-field networks (LFNs) the researchers developed can reconstruct a light field after only a single observation of an image, and they are able to render 3D scenes at real-time frame rates.

“The big promise of these neural scene representations, at the end of the day, is to use them in vision tasks. I give you an image and from that image you create a representation of the scene, and then everything you want to reason about you do in the space of that 3D scene,” says Vincent Sitzmann, a postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.

Sitzmann wrote the paper with co-lead author Semon Rezchikov, a postdoc at Harvard University; William T. Freeman, the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science and a member of CSAIL; Joshua B. Tenenbaum, a professor of computational cognitive science in the Department of Brain and Cognitive Sciences and a member of CSAIL; and senior author Frédo Durand, a professor of electrical engineering and computer science and a member of CSAIL. The research will be presented at the Conference on Neural Information Processing Systems this month.

Mapping rays

In computer vision and computer graphics, rendering a 3D scene from an image involves mapping thousands or possibly millions of camera rays. Think of camera rays like laser beams shooting out from a camera lens and striking each pixel in an image, one ray per pixel. These computer models must determine the color of the pixel struck by each camera ray.

Many current methods accomplish this by taking hundreds of samples along the length of each camera ray as it moves through space, which is a computationally expensive process that can lead to slow rendering.

Instead, an LFN learns to represent the light field of a 3D scene and then directly maps each camera ray in the light field to the color that is observed by that ray. An LFN leverages the unique properties of light fields, which enable the rendering of a ray after only a single evaluation, so the LFN doesn’t need to stop along the length of a ray to run calculations.

“With other methods, when you do this rendering, you have to follow the ray until you find the surface. You have to do thousands of samples, because that is what it means to find a surface. And you’re not even done yet because there may be complex things like transparency or reflections. With a light field, once you have reconstructed the light field, which is a complicated problem, rendering a single ray just takes a single sample of the representation, because the representation directly maps a ray to its color,” Sitzmann says.      

The LFN classifies each camera ray using its “Plücker coordinates,” which represent a line in 3D space based on its direction and how far it is from its point of origin. The system computes the Plücker coordinates of each camera ray at the point where it hits a pixel to render an image.

By mapping each ray using Plücker coordinates, the LFN is also able to compute the geometry of the scene due to the parallax effect. Parallax is the difference in apparent position of an object when viewed from two different lines of sight. For instance, if you move your head, objects that are farther away seem to move less than objects that are closer. The LFN can tell the depth of objects in a scene due to parallax, and uses this information to encode a scene’s geometry as well as its appearance.

But to reconstruct light fields, the neural network must first learn about the structures of light fields, so the researchers trained their model with many images of simple scenes of cars and chairs.

“There is an intrinsic geometry of light fields, which is what our model is trying to learn. You might worry that light fields of cars and chairs are so different that you can’t learn some commonality between them. But it turns out, if you add more kinds of objects, as long as there is some homogeneity, you get a better and better sense of how light fields of general objects look, so you can generalize about classes,” Rezchikov says.

Once the model learns the structure of a light field, it can render a 3D scene from only one image as an input.

Rapid rendering

The researchers tested their model by reconstructing 360-degree light fields of several simple scenes. They found that LFNs were able to render scenes at more than 500 frames per second, about three orders of magnitude faster than other methods. In addition, the 3D objects rendered by LFNs were often crisper than those generated by other models.

An LFN is also less memory-intensive, requiring only about 1.6 megabytes of storage, as opposed to 146 megabytes for a popular baseline method.

“Light fields were proposed before, but back then they were intractable. Now, with these techniques that we used in this paper, for the first time you can both represent these light fields and work with these light fields. It is an interesting convergence of the mathematical models and the neural network models that we have developed coming together in this application of representing scenes so machines can reason about them,” Sitzmann says.

In the future, the researchers would like to make their model more robust so it could be used effectively for complex, real-world scenes. One way to drive LFNs forward is to focus only on reconstructing certain patches of the light field, which could enable the model to run faster and perform better in real-world environments, Sitzmann says.

“Neural rendering has recently enabled photorealistic rendering and editing of images from only a sparse set of input views. Unfortunately, all existing techniques are computationally very expensive, preventing applications that require real-time processing, like video conferencing. This project takes a big step toward a new generation of computationally efficient and mathematically elegant neural rendering algorithms,” says Gordon Wetzstein, an associate professor of electrical engineering at Stanford University, who was not involved in this research. “I anticipate that it will have widespread applications, in computer graphics, computer vision, and beyond.”

This work is supported by the National Science Foundation, the Office of Naval Research, Mitsubishi, the Defense Advanced Research Projects Agency, and the Singapore Defense Science and Technology Agency.

Read More

Generating a realistic 3D world

While standing in a kitchen, you push some metal bowls across the counter into the sink with a clang, and drape a towel over the back of a chair. In another room, it sounds like some precariously stacked wooden blocks fell over, and there’s an epic toy car crash. These interactions with our environment are just some of what humans experience on a daily basis at home, but while this world may seem real, it isn’t.

A new study from researchers at MIT, the MIT-IBM Watson AI Lab, Harvard University, and Stanford University is enabling a rich virtual world, very much like stepping into “The Matrix.” Their platform, called ThreeDWorld (TDW), simulates high-fidelity audio and visual environments, both indoor and outdoor, and allows users, objects, and mobile agents to interact like they would in real life and according to the laws of physics. Object orientations, physical characteristics, and velocities are calculated and executed for fluids, soft bodies, and rigid objects as interactions occur, producing accurate collisions and impact sounds.

TDW is unique in that it is designed to be flexible and generalizable, generating synthetic photo-realistic scenes and audio rendering in real time, which can be compiled into audio-visual datasets, modified through interactions within the scene, and adapted for human and neural network learning and prediction tests. Different types of robotic agents and avatars can also be spawned within the controlled simulation to perform, say, task planning and execution. And using virtual reality (VR), human attention and play behavior within the space can provide real-world data, for example.

“We are trying to build a general-purpose simulation platform that mimics the interactive richness of the real world for a variety of AI applications,” says study lead author Chuang Gan, MIT-IBM Watson AI Lab research scientist.

Creating realistic virtual worlds with which to investigate human behaviors and train robots has been a dream of AI and cognitive science researchers. “Most of AI right now is based on supervised learning, which relies on huge datasets of human-annotated images or sounds,” says Josh McDermott, associate professor in the Department of Brain and Cognitive Sciences (BCS) and an MIT-IBM Watson AI Lab project lead. These descriptions are expensive to compile, creating a bottleneck for research. And for physical properties of objects, like mass, which isn’t always readily apparent to human observers, labels may not be available at all. A simulator like TDW skirts this problem by generating scenes where all the parameters and annotations are known. Many competing simulations were motivated by this concern but were designed for specific applications; through its flexibility, TDW is intended to enable many applications that are poorly suited to other platforms.

Another advantage of TDW, McDermott notes, is that it provides a controlled setting for understanding the learning process and facilitating the improvement of AI robots. Robotic systems, which rely on trial and error, can be taught in an environment where they cannot cause physical harm. In addition, “many of us are excited about the doors that these sorts of virtual worlds open for doing experiments on humans to understand human perception and cognition. There’s the possibility of creating these very rich sensory scenarios, where you still have total control and complete knowledge of what is happening in the environment.”

McDermott, Gan, and their colleagues are presenting this research at the conference on Neural Information Processing Systems (NeurIPS) in December.

Behind the framework

The work began as a collaboration between a group of MIT professors along with Stanford and IBM researchers, tethered by individual research interests into hearing, vision, cognition, and perceptual intelligence. TDW brought these together in one platform. “We were all interested in the idea of building a virtual world for the purpose of training AI systems that we could actually use as models of the brain,” says McDermott, who studies human and machine hearing. “So, we thought that this sort of environment, where you can have objects that will interact with each other and then render realistic sensory data from them, would be a valuable way to start to study that.”

To achieve this, the researchers built TDW on a video game platform called Unity3D Engine and committed to incorporating both visual and auditory data rendering without any animation. The simulation consists of two components: the build, which renders images, synthesizes audio, and runs physics simulations; and the controller, which is a Python-based interface where the user sends commands to the build. Researchers construct and populate a scene by pulling from an extensive 3D model library of objects, like furniture pieces, animals, and vehicles. These models respond accurately to lighting changes, and their material composition and orientation in the scene dictate their physical behaviors in the space. Dynamic lighting models accurately simulate scene illumination, causing shadows and dimming that correspond to the appropriate time of day and sun angle. The team has also created furnished virtual floor plans that researchers can fill with agents and avatars. To synthesize true-to-life audio, TDW uses generative models of impact sounds that are triggered by collisions or other object interactions within the simulation. TDW also simulates noise attenuation and reverberation in accordance with the geometry of the space and the objects in it.

Two physics engines in TDW power deformations and reactions between interacting objects — one for rigid bodies, and another for soft objects and fluids. TDW performs instantaneous calculations regarding mass, volume, and density, as well as any friction or other forces acting upon the materials. This allows machine learning models to learn about how objects with different physical properties would behave together.

Users, agents, and avatars can bring the scenes to life in several ways. A researcher could directly apply a force to an object through controller commands, which could literally set a virtual ball in motion. Avatars can be empowered to act or behave in a certain way within the space — e.g., with articulated limbs capable of performing task experiments. Lastly, VR head and handsets can allow users to interact with the virtual environment, potentially to generate human behavioral data that machine learning models could learn from.

Richer AI experiences

To trial and demonstrate TDW’s unique features, capabilities, and applications, the team ran a battery of tests comparing datasets generated by TDW and other virtual simulations. The team found that neural networks trained on scene image snapshots with randomly placed camera angles from TDW outperformed other simulations’ snapshots in image classification tests and neared that of systems trained on real-world images. The researchers also generated and trained a material classification model on audio clips of small objects dropping onto surfaces in TDW and asked it to identify the types of materials that were interacting. They found that TDW produced significant gains over its competitor. Additional object-drop testing with neural networks trained on TDW revealed that the combination of audio and vision together is the best way to identify the physical properties of objects, motivating further study of audio-visual integration.

TDW is proving particularly useful for designing and testing systems that understand how the physical events in a scene will evolve over time. This includes facilitating benchmarks of how well a model or algorithm makes physical predictions of, for instance, the stability of stacks of objects, or the motion of objects following a collision — humans learn many of these concepts as children, but many machines need to demonstrate this capacity to be useful in the real world. TDW has also enabled comparisons of human curiosity and prediction against those of machine agents designed to evaluate social interactions within different scenarios.

Gan points out that these applications are only the tip of the iceberg. By expanding the physical simulation capabilities of TDW to depict the real world more accurately, “we are trying to create new benchmarks to advance AI technologies, and to use these benchmarks to open up many new problems that until now have been difficult to study.”

The research team on the paper also includes MIT engineers Jeremy Schwartz and Seth Alter, who are instrumental to the operation of TDW; BCS professors James DiCarlo and Joshua Tenenbaum; graduate students Aidan Curtis and Martin Schrimpf; and former postdocs James Traer (now an assistant professor at the University of Iowa) and Jonas Kubilius PhD ‘08. Their colleagues are IBM director of the MIT-IBM Watson AI Lab David Cox; research software engineer Abhishek Bhandwaldar; and research staff member Dan Gutfreund of IBM. Additional researchers co-authoring are Harvard University assistant professor Julian De Freitas; and from Stanford University, assistant professors Daniel L.K. Yamins (a TDW founder) and Nick Haber, postdoc Daniel M. Bear, and graduate students Megumi Sano, Kuno Kim, Elias Wang, Damian Mrowca, Kevin Feigelis, and Michael Lingelbach.

This research was supported by the MIT-IBM Watson AI Lab.

Read More

Taking some of the guesswork out of drug discovery

In their quest to discover effective new medicines, scientists search for drug-like molecules that can attach to disease-causing proteins and change their functionality. It is crucial that they know the 3D shape of a molecule to understand how it will attach to specific surfaces of the protein.

But a single molecule can fold in thousands of different ways, so solving that puzzle experimentally is a time consuming and expensive process akin to searching for a needle in a molecular haystack.

MIT researchers are using machine learning to streamline this complex task. They have created a deep learning model that predicts the 3D shapes of a molecule solely based on a graph in 2D of its molecular structure. Molecules are typically represented as small graphs.

Their system, GeoMol, processes molecules in only seconds and performs better than other machine learning models, including some commercial methods. GeoMol could help pharmaceutical companies accelerate the drug discovery process by narrowing down the number of molecules they need to test in lab experiments, says Octavian-Eugen Ganea, a postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and co-lead author of the paper.

“When you are thinking about how these structures move in 3D space, there are really only certain parts of the molecule that are actually flexible, these rotatable bonds. One of the key innovations of our work is that we think about modeling the conformational flexibility like a chemical engineer would. It is really about trying to predict the potential distribution of rotatable bonds in the structure,” says Lagnajit Pattanaik, a graduate student in the Department of Chemical Engineering and co-lead author of the paper.

Other authors include Connor W. Coley, the Henri Slezynger Career Development Assistant Professor of Chemical Engineering; Regina Barzilay, the School of Engineering Distinguished Professor for AI and Health in CSAIL; Klavs F. Jensen, the Warren K. Lewis Professor of Chemical Engineering; William H. Green, the Hoyt C. Hottel Professor in Chemical Engineering; and senior author Tommi S. Jaakkola, the Thomas Siebel Professor of Electrical Engineering in CSAIL and a member of the Institute for Data, Systems, and Society. The research will be presented this week at the Conference on Neural Information Processing Systems.

Mapping a molecule

In a molecular graph, a molecule’s individual atoms are represented as nodes and the chemical bonds that connect them are edges. 

GeoMol leverages a recent tool in deep learning called a message passing neural network, which is specifically designed to operate on graphs. The researchers adapted a message passing neural network to predict specific elements of molecular geometry.

Given a molecular graph, GeoMol initially predicts the lengths of the chemical bonds between atoms and the angles of those individual bonds. The way the atoms are arranged and connected determines which bonds can rotate.

GeoMol then predicts the structure of each atom’s local neighborhood individually and assembles neighboring pairs of rotatable bonds by computing the torsion angles and then aligning them. A torsion angle determines the motion of three segments that are connected, in this case, three chemical bonds that connect four atoms.

“Here, the rotatable bonds can take a huge range of possible values. So, the use of these message passing neural networks allows us to capture a lot of the local and global environments that influences that prediction. The rotatable bond can take multiple values, and we want our prediction to be able to reflect that underlying distribution,” Pattanaik says.

Overcoming existing hurdles

One major challenge to predicting the 3D structure of molecules is to model chirality. A chiral molecule can’t be superimposed on its mirror image, like a pair of hands (no matter how you rotate your hands, there is no way their features exactly line up). If a molecule is chiral, its mirror image won’t interact with the environment in the same way.

This could cause medicines to interact with proteins incorrectly, which could result in dangerous side effects. Current machine learning methods often involve a long, complex optimization process to ensure chirality is correctly identified, Ganea says.

Because GeoMol determines the 3D structure of each bond individually, it explicitly defines chirality during the prediction process, eliminating the need for optimization after-the-fact.

After performing these predictions, GeoMol outputs a set of likely 3D structures for the molecule.

“What we can do now is take our model and connect it end-to-end with a model that predicts this attachment to specific protein surfaces. Our model is not a separate pipeline. It is very easy to integrate with other deep learning models,” Ganea says.

A “super-fast” model

The researchers tested their model using a dataset of molecules and the likely 3D shapes they could take, which was developed by Rafael Gomez-Bombarelli, the Jeffrey Cheah Career Development Chair in Engineering, and graduate student Simon Axelrod.

They evaluated how many of these likely 3D structures their model was able to capture, in comparison to machine learning models and other methods.

In nearly all instances, GeoMol outperformed the other models on all tested metrics.

“We found that our model is super-fast, which was really exciting to see. And importantly, as you add more rotatable bonds, you expect these algorithms to slow down significantly. But we didn’t really see that. The speed scales nicely with the number of rotatable bonds, which is promising for using these types of models down the line, especially for applications where you are trying to quickly predict the 3D structures inside these proteins,” Pattanaik says.

In the future, the researchers hope to apply GeoMol to the area of high-throughput virtual screening, using the model to determine small molecule structures that would interact with a specific protein. They also want to keep refining GeoMol with additional training data so it can more effectively predict the structure of long molecules with many flexible bonds.

“Conformational analysis is a key component of numerous tasks in computer-aided drug design, and an important component in advancing machine learning approaches in drug discovery,” says Pat Walters, senior vice president of computation at Relay Therapeutics, who was not involved in this research. “I’m excited by continuing advances in the field and thank MIT for contributing to broader learnings in this area.”

This research was funded by the Machine Learning for Pharmaceutical Discovery and Synthesis consortium.

Read More