As more artificial intelligence applications move to smartphones, deep learning models are getting smaller to allow apps to run faster and save battery power. Now, MIT researchers have a new and better way to compress models.
It’s so simple that they unveiled it in a tweet last month: Train the model, prune its weakest connections, retrain the model at its fast, early training rate, and repeat, until the model is as tiny as you want.
“That’s it,” says Alex Renda, a PhD student at MIT. “The standard things people do to prune their models are crazy complicated.”
The search for a better compression technique grew out of Frankle and Carbin’s award-winning Lottery Ticket Hypothesis paper at ICLR last year. They showed that a deep neural network could perform with only one-tenth the number of connections if the right subnetwork was found early in training. Their revelation came as demand for computing power and energy to train ever larger deep learning models was increasing exponentially, a trend that continues to this day. Costs of that growth include a rise in planet-warming carbon emissions and a potential drop in innovation as researchers not affiliated with big tech companies compete for scarce computing resources. Everyday users are affected, too. Big AI models eat up mobile-phone bandwidth and battery power.
But at a colleague’s suggestion, Frankle decided to see what lessons it might hold for pruning, a set of techniques for reducing the size of a neural network by removing unnecessary connections or neurons. Pruning algorithms had been around for decades, but the field saw a resurgence after the breakout success of neural networks at classifying images in the ImageNet competition. As models got bigger, with researchers adding on layers of artificial neurons to boost performance, others proposed techniques for whittling them down.
Song Han, now an assistant professor at MIT, was one pioneer. Building on a series of influential papers, Han unveiled a pruning algorithm he called AMC, or AutoML for model compression, that’s still the industry standard. Under Han’s technique, redundant neurons and connections are automatically removed, and the model is retrained to restore its initial accuracy.
In response to Han’s work, Frankle recently suggested in an unpublished paper that results could be further improved by rewinding the smaller, pruned model to its initial parameters, or weights, and retraining the smaller model at its faster, initial rate.
In the current ICLR study, the researchers realized that the model could simply be rewound to its early training rate without fiddling with any parameters. In any pruning regimen, the tinier a model gets, the less accurate it becomes. But when the researchers compared this new method to Han’s AMC or Frankle’s weight-rewinding methods, it performed better no matter how much the model shrank.
It’s unclear why the pruning technique works as well as it does. The researchers say they will leave that question for others to answer. As for those who wish to try it, the algorithm is as easy to implement as other pruning methods, without time-consuming tuning, the researchers say.
“It’s the pruning algorithm from the ‘Book,’” says Frankle. “It’s clear, generic, and drop-dead simple.”
Han, for his part, has now partly shifted focus from compression AI models to channeling AI to design small, efficient models from the start. His newest method, Once for All, also debuts at ICLR. Of the new learning rate method, he says: “I’m happy to see new pruning and retraining techniques evolve, giving more people access to high-performing AI applications.”
Support for the study came from the Defense Advanced Research Projects Agency, Google, MIT-IBM Watson AI Lab, MIT Quest for Intelligence, and the U.S. Office of Naval Research.
Provided with genre, artist, and lyrics as input, Jukebox outputs a new music sample produced from scratch. Below, we show some of our favorite samples.
To hear all uncurated samples, check out our sample explorer.
Automatic music generation dates back to more than half a century. A prominent approach is to generate music symbolically in the form of a piano roll, which specifies the timing, pitch, velocity, and instrument of each note to be played. This has led to impressive results like producing Bach chorals, polyphonic music with multiple instruments, as well as minute long musical pieces.
But symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music. A different approach[1] is to model music directly as raw audio. Generating music at the audio level is challenging since the sequences are very long. A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game. Thus, to learn the high level semantics of music, a model would have to deal with extremely long-range dependencies.
One way of addressing the long input problem is to use an autoencoder that compresses raw audio to a lower-dimensional space by discarding some of the perceptually irrelevant bits of information. We can then train a model to generate audio in this compressed space, and upsample back to the raw audio space.
We chose to work on music because we want to continue to push the boundaries of generative models. Our previous work on MuseNet explored synthesizing music based on large amounts of MIDI data. Now in raw audio, our models must learn to tackle high diversity as well as very long range structure, and the raw audio domain is particularly unforgiving of errors in short, medium, or long term timing.
Approach
Compressing music to discrete codes
Jukebox’s autoencoder model compresses audio to a discrete space, using a quantization-based approach called VQ-VAE. Hierarchical VQ-VAEs can generate short instrumental pieces from a few sets of instruments, however they suffer from hierarchy collapse due to use of successive encoders coupled with autoregressive decoders. A simplified variant called VQ-VAE-2 avoids these issues by using feedforward encoders and decoders only, and they show impressive results at generating high-fidelity images.
We draw inspiration from VQ-VAE-2 and apply their approach to music. We modify their architecture as follows:
To alleviate codebook collapse common to VQ-VAE models, we use random restarts where we randomly reset a codebook vector to one of the encoded hidden states whenever its usage falls below a threshold.
To maximize the use of the upper levels, we use separate decoders and independently reconstruct the input from the codes of each level.
To allow the model to reconstruct higher frequencies easily, we add a spectral loss that penalizes the norm of the difference of input and reconstructed spectrograms.
We use three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level. This downsampling loses much of the audio detail, and sounds noticeably noisy as we go further down the levels. However, it retains essential information about the pitch, timbre, and volume of the audio.
Each VQ-VAE level independently encodes the input. The bottom level encoding produces the highest quality reconstruction, while the top level encoding retains only the essential musical information.
To generate novel songs, a cascade of transformers generates codes from top to bottom level, after which the bottom-level decoder can convert them to raw audio.
Generating codes using transformers
Next, we train the prior models whose goal is to learn the distribution of music codes encoded by VQ-VAE and to generate music in this compressed discrete space. Like the VQ-VAE, we have three levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.
The top-level prior models the long-range structure of music, and samples decoded from this level have lower audio quality but capture high-level semantics like singing and melodies. The middle and bottom upsampling priors add local musical structures like timbre, significantly improving the audio quality.
We train these as autoregressive models using a simplified variant of Sparse Transformers. Each of these models has 72 layers of factorized self-attention on a context of 8192 codes, which corresponds to approximately 24 seconds, 6 seconds, and 1.5 seconds of raw audio at the top, middle and bottom levels, respectively.
Once all of the priors are trained, we can generate codes from the top level, upsample them using the upsamplers, and decode them back to the raw audio space using the VQ-VAE decoder to sample novel songs.
Dataset
To train this model, we crawled the web to curate a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki. The metadata includes artist, album genre, and year of the songs, along with common moods or playlist keywords associated with each song. We train on 32-bit, 44.1 kHz raw audio, and perform data augmentation by randomly downmixing the right and left channels to produce mono audio.
Artist and genre conditioning
The top-level transformer is trained on the task of predicting compressed audio tokens. We can provide additional information, such as the artist and genre for each song. This has two advantages: first, it reduces the entropy of the audio prediction, so the model is able to achieve better quality in any particular style; second, at generation time, we are able to steer the model to generate in a style of our choosing.
This t-SNE below shows how the model learns, in an unsupervised way, to cluster similar artists and genres close together, and also makes some surprising associations like Jennifer Lopez being so close to Dolly Parton!
Lyrics conditioning
In addition to conditioning on artist and genre, we can provide more context at training time by conditioning the model on the lyrics for a song. A significant challenge is the lack of a well-aligned dataset: we only have lyrics at a song level without alignment to the music, and thus for a given chunk of audio we don’t know precisely which portion of the lyrics (if any) appear. We also may have song versions that don’t match the lyric versions, as might occur if a given song is performed by several different artists in slightly different ways. Additionally, singers frequently repeat phrases, or otherwise vary the lyrics, in ways that are not always captured in the written lyrics.
To match audio portions to their corresponding lyrics, we begin with a simple heuristic that aligns the characters of the lyrics to linearly span the duration of each song, and pass a fixed-size window of characters centered around the current segment during training. While this simple strategy of linear alignment worked surprisingly well, we found that it fails for certain genres with fast lyrics, such as hip hop. To address this, we use Spleeter to extract vocals from each song and run NUS AutoLyricsAlign on the extracted vocals to obtain precise word-level alignments of the lyrics. We chose a large enough window so that the actual lyrics have a high probability of being inside the window.
To attend to the lyrics, we add an encoder to produce a representation for the lyrics, and add attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder. After training, the model learns a more precise alignment.
Lyric–music alignment learned by encoder–decoder attention layer Attention progresses from one lyric token to the next as the music progresses, with a few moments of uncertainty.
Limitations
While Jukebox represents a step forward in musical quality, coherence, length of audio sample, and ability to condition on artist, genre, and lyrics, there is a significant gap between these generations and human-created music.
For example, while the generated songs show local musical coherence, follow traditional chord patterns, and can even feature impressive solos, we do not hear familiar larger musical structures such as choruses that repeat. Our downsampling and upsampling process introduces discernable noise. Improving the VQ-VAE so its codes capture more musical information would help reduce this. Our models are also slow to sample from, because of the autoregressive nature of sampling. It takes approximately 9 hours to fully render one minute of audio through our models, and thus they cannot yet be used in interactive applications. Using techniques that distill the model into a parallel sampler can significantly speed up the sampling speed. Finally, we currently train on English lyrics and mostly Western music, but in the future we hope to include songs from other languages and parts of the world.
Future directions
Our audio team is continuing to work on generating audio samples conditioned on different kinds of priming information. In particular, we’ve seen early success conditioning on MIDI files and stem files. Here’s an example of a raw audio sample conditioned on MIDI tokens. We hope this will improve the musicality of samples (in the way conditioning on lyrics improved the singing), and this would also be a way of giving musicians more control over the generations. We expect human and model collaborations to be an increasingly exciting creative space. If you’re excited to work on these problems with us, we’re hiring.
As generative modeling across various domains continues to advance, we are also conducting research into issues like bias and intellectual property rights, and are engaging with people who work in the domains where we develop tools. To better understand future implications for the music community, we shared Jukebox with an initial set of 10 musicians from various genres to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations. We are connecting with the wider creative community as we think generative work across text, images, and audio will continue to improve. If you’re interested in being a creative collaborator to help us build useful tools or new works of art in these domains, please let us know!
To connect with the corresponding authors, please email jukebox@openai.com.
Timeline
Our first raw audio model, which learns to recreate instruments like Piano and Violin. We try a dataset of rock and pop songs, and surprisingly it works.
We collect a larger and more diverse dataset of songs, with labels for genres and artists. Model picks up artist and genre styles more consistently with diversity, and at convergence can also produce full-length songs with long-range coherence.
We scale our VQ-VAE from 22 to 44kHz to achieve higher quality audio. We also scale top-level prior from 1B to 5B to capture the increased information. We see better musical quality, clear singing, and long-range coherence. We also make novel completions of real songs.
We start training models conditioned on lyrics to incorporate further conditioning information. We only have unaligned lyrics, so model has to learn alignment and pronunciation, as well as singing.
Posted by Eric Johnson, TFRT Product Manager and Mingsheng Hong, TFRT Tech Lead/Manager
TensorFlow aims to make it easy for you to build and deploy ML models across many different devices. Yet, what it means to “build and deploy ML models” is not static and continues to change with increased investment in the ML ecosystem. At the top-half of the TensorFlow stack, innovation is leading to more complex models and deployment scenarios. Researchers are inventing new algorithms that require more compute, while application developers are enhancing their products with these new techniques across edge and server.
At the bottom-half of the stack, the tension from increasing compute needs and rising compute costs due to the ending of Moore’s law has sparked a proliferation of new hardware aimed at specific ML use cases. Traditional chip makers, startups, and software companies alike (including Google) have invested in specialized silicon.
The result is that the needs of the ML ecosystem are vastly different than they were 4 or 5 years ago when TensorFlow was first created. Of course, we’ve continued to iterate with the release of 2.x, but the current TensorFlow stack is optimized for graph execution, and incurs non-trivial overhead when dispatching a single op. A high-performance low-level runtime is a key to enable the trends of today and empower the innovations of tomorrow.
Enter TFRT, a new TensorFlow RunTime. It aims to provide a unified, extensible infrastructure layer with best-in-class performance across a wide variety of domain specific hardware. It provides efficient use of multithreaded host CPUs, supports fully asynchronous programming models, and focuses on low-level efficiency.
TFRT will benefit a broad range of users, including:
Researchers looking for faster iteration time and better error reporting when developing complex new models in eager mode.
Application developers looking for improved performance when training and serving models in production.
Hardware makers looking to integrate edge and datacenter devices into TensorFlow in a modular way.
What is TFRT?
TFRT is a new runtime that will replace the existing TensorFlow runtime. It is responsible for efficient execution of kernels – low-level device-specific primitives – on targeted hardware. It plays a critical part in both eager and graph execution, which is illustrated by this simplified diagram of the TensorFlow training stack:
TFRT’s role in graph and eager execution within the TensorFlow training stack
Note that everything in grey is part of TFRT. In eager execution, TensorFlow APIs call directly into the new runtime. In graph execution, your program’s computational graph is lowered to an optimized target-specific program and dispatched to TFRT. In both execution paths, the new runtime invokes a set of kernels that call into the underlying hardware devices to complete the model execution, as shown by the black arrows.
Key design points
Whereas the existing TensorFlow runtime was initially built for graph execution and training workloads, the new runtime will make eager execution and inference first-class citizens, while putting special emphasis on architecture extensibility and modularity. More specifically, TFRT has the following selected design highlights:
To achieve higher performance, TFRT has a lock-free graph executor that supports concurrent op execution with low synchronization overhead, and a thin eager op dispatch stack so that eager API calls will be asynchronous and more efficient.
To make extending the TF stack easier, we decoupled device runtimes from the host runtime, the core TFRT component that drives host CPU and I/O work.
To get consistent behavior, TFRT leverages common abstractions, such as shape functions and kernels, across both eager and graph.
The power of MLIR
TFRT is also tightly-integrated with MLIR. For example:
TFRT utilizes MLIR’s compiler infrastructure to generate an optimized, target-specific representation of your computational graph that the runtime executes.
TFRT uses MLIR’s extensible type system to support arbitrary C++ types in the runtime, which removes tensor-specific limitations.
Together, TFRT and MLIR will improve TensorFlow’s unification, flexibility, and extensibility.
Initial Results
Early performance results from the inference and serving use case are encouraging. As part of a benchmarking study for TensorFlow Dev Summit 2020, we integrated TFRT with TensorFlow Serving and measured the latency of sending requests to the model and getting prediction results back. We picked a common MLPerf model, ResNet-50, and chose a batch size of 1 and a data precision of FP16 to focus our study on runtime related op dispatch overhead. In comparing performance of GPU inference over TFRT to the current runtime, we saw an improvement of 28% in average inference time. These early results are strong validation for TFRT, and we expect it to provide a big boost to performance. We hope you are as excited as we are!
What’s next
TFRT is being integrated with TensorFlow, and will be enabled initially through an opt-in flag, giving the team time to fix any bugs and fine-tune performance. Eventually, it will become TensorFlow’s default runtime. Although it is still an early stage project, we have made the GitHub repository available to the community. We are limiting contributions to begin with, but encourage participation in the form of requirements and design discussions.
To learn more, please check out our Dev Summit 2020 presentation, where we first introduced TFRT to the world, and our MLIR Open Design Deep Dive presentation, where we provided a detailed overview of TFRT’s core components, low-level abstractions, and general design principles. And finally, if you want to keep up with all things TFRT, please join our new mailing list. Thanks!
Driven by an innate curiosity, children pick up new skills as they explore the world and learn from their experience. Computers, by contrast, often get stuck when thrown into new environments.
To get around this, engineers have tried encoding simple forms of curiosity into their algorithms with the hope that an agent pushed to explore will learn about its environment more effectively. An agent with a child’s curiosity might go from learning to pick up, manipulate, and throw objects to understanding the pull of gravity, a realization that could dramatically accelerate its ability to learn many other things.
Engineers have discovered many ways of encoding curious exploration into machine learning algorithms. A research team at MIT wondered if a computer could do better, based on a long history of enlisting computers in the search for new algorithms.
In recent years, the design of deep neural networks, algorithms that search for solutions by adjusting numeric parameters, has been automated with software like Google’s AutoML and auto-sklearn in Python. That’s made it easier for non-experts to develop AI applications. But while deep nets excel at specific tasks, they have trouble generalizing to new situations. Algorithms expressed in code, in a high-level programming language, by contrast, have the capacity to transfer knowledge across different tasks and environments.
“Algorithms designed by humans are very general,” says study co-author Ferran Alet, a graduate student in MIT’s Department of Electrical Engineering and Computer Science and Computer Science and Artificial Intelligence Laboratory (CSAIL). “We were inspired to use AI to find algorithms with curiosity strategies that can adapt to a range of environments.”
The researchers created a “meta-learning” algorithm that generated 52,000 exploration algorithms. They found that the top two were entirely new — seemingly too obvious or counterintuitive for a human to have proposed. Both algorithms generated exploration behavior that substantially improved learning in a range of simulated tasks, from navigating a two-dimensional grid based on images to making a robotic ant walk. Because the meta-learning process generates high-level computer code as output, both algorithms can be dissected to peer inside their decision-making processes.
The paper received praise from researchers not involved in the work. “The use of program search to discover a better intrinsic reward is very creative,” says Quoc Le, a principal scientist at Google who has helped pioneer computer-aided design of deep learning models. “I like this idea a lot, especially since the programs are interpretable.”
The researchers compare their automated algorithm design process to writing sentences with a limited number of words. They started by choosing a set of basic building blocks to define their exploration algorithms. After studying other curiosity algorithms for inspiration, they picked nearly three dozen high-level operations, including basic programs and deep learning models, to guide the agent to do things like remember previous inputs, compare current and past inputs, and use learning methods to change its own modules. The computer then combined up to seven operations at a time to create computation graphs describing 52,000 algorithms.
Even with a fast computer, testing them all would have taken decades. So, instead, the researchers limited their search by first ruling out algorithms predicted to perform poorly, based on their code structure alone. Then, they tested their most promising candidates on a basic grid-navigation task requiring substantial exploration but minimal computation. If the candidate did well, its performance became the new benchmark, eliminating even more candidates.
Four machines searched over 10 hours to find the best algorithms. More than 99 percent were junk, but about a hundred were sensible, high-performing algorithms. Remarkably, the top 16 were both novel and useful, performing as well as, or better than, human-designed algorithms at a range of other virtual tasks, from landing a moon rover to raising a robotic arm and moving an ant-like robot in a physical simulation.
All 16 algorithms shared two basic exploration functions.
In the first, the agent is rewarded for visiting new places where it has a greater chance of making a new kind of move. In the second, the agent is also rewarded for visiting new places, but in a more nuanced way: One neural network learns to predict the future state while a second recalls the past, and then tries to predict the present by predicting the past from the future. If this prediction is erroneous it rewards itself, as it is a sign that it discovered something it didn’t know before. The second algorithm was so counterintuitive it took the researchers time to figure out.
“Our biases often prevent us from trying very novel ideas,” says Alet. “But computers don’t care. They try, and see what works, and sometimes we get great unexpected results.”
More researchers are turning to machine learning to design better machine learning algorithms, a field known as AutoML. At Google, Le and his colleagues recently unveiled a new algorithm-discovery tool called Auto-ML Zero. (Its name is a play on Google’s AutoML software for customizing deep net architectures for a given application, and Google DeepMind’s Alpha Zero, the program that can learn to play different board games by playing millions of games against itself.)
Their method searches through a space of algorithms made up of simpler primitive operations. But rather than look for an exploration strategy, their goal is to discover algorithms for classifying images. Both studies show the potential for humans to use machine-learning methods themselves to create novel, high-performing machine-learning algorithms.
“The algorithms we generated could be read and interpreted by humans, but to actually understand the code we had to reason through each variable and operation and how they evolve with time,” says study co-author Martin Schneider, a graduate student at MIT. “It’s an interesting open challenge to design algorithms and workflows that leverage the computer’s ability to evaluate lots of algorithms and our human ability to explain and improve on those ideas.”
The research received support from the U.S. National Science Foundation, Air Force Office of Scientific Research, Office of Naval Research, Honda Research Institute, SUTD Temasek Laboratories, and MIT Quest for Intelligence.
With various physical distancing guidelines in place throughout the world as a means to curb the spread of Covid-19, the internet has experienced a dramatic spike in overall traffic. MIT Professor Tom Leighton is chief executive officer and co-founder of Akamai Technologies, a global content delivery network, cybersecurity, and cloud service company that provides web and internet security services. At MIT he specializes in applied mathematics in the Department of Mathematics and is a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The Department of Mathematics Communications spoke to Leighton about his company’s response to the world’s increased reliance on the internet during the Covid-19 pandemic.
Q: How is the pandemic changing the way people use the internet?
A: The internet has become our lifeline as we face the challenges of working remotely, distance learning, and sheltering in place. Everything has moved online: religious services, movie premieres, commerce of all kinds, and even gatherings of friends for a cup of coffee. We’ve already been doing many of these things online for years — the big difference now is that we are suddenly only doing them online.
When we’ve emerged from the pandemic, it seems quite possible that our usage of the internet for nearly every facet of our lives will have increased permanently. Many more people may be working remotely even when offices reopen; the shift to virtual meetings may become the norm even when we can travel again; a much greater share of commerce may be conducted online even when we can return to shopping malls; and our usage of social media and video streaming could well be greater than ever before, even when it’s OK to meet others in person.
Q: How much more use is the internet seeing as a result of the pandemic?
A: Akamai operates a globally distributed intelligent edge platform with more than 270,000 servers in 4,000 locations across 137 countries. From our vantage point, we can see that global internet traffic increased by about 30 percent during the past month. That’s about 10 times normal, and it means we’ve seen an entire year’s worth of growth in internet traffic in just the past few weeks. And that’s without any live sports streaming, like the usual March Madness college basketball tournament in the United States.
Just a few weeks ago, we set a new peak record of traffic on the Akamai edge platform of 167 terabits per second. That’s more than double the peak we saw one year before. These are truly unprecedented times. The internet is being used at a scale that the world has never experienced.
Q: Can the internet keep up with the surge in traffic?
A: The answer is yes, but with many more caveats now.
Around the world, some regulators, major carriers, and content providers are taking steps to reduce load during peak traffic times in an effort to avert online gridlock. For example, European regulators have asked telecom providers and streaming platforms to switch to standard definition video during periods of peak demand. And Akamai is working with leading companies such as Microsoft and Sony to deliver software updates for e-gaming at off-peak traffic times. The typical software update uses as much traffic as about 30,000 web pages, so this makes a big difference when it comes to managing congestion.
In addition, Akamai’s intelligent edge network architecture is designed to mitigate and minimize network congestion. Because we’ve deployed our infrastructure deep into carrier networks, we can help those networks avoid overload by diverting traffic away from areas experiencing high levels of congestion.
Overall, we fully expect to maintain the integrity and reliability of website and mobile application delivery, as well as security services, for all of our customers during this time. In particular, Akamai customers across sectors such as government, health care, financial services, commerce, manufacturing, and business services should not experience any change in the performance of their services. We will continue working with governments, network operators, and our customers to minimize stress on the system. At the same time, we’ll do our best to make sure that everyone who is relying on the internet for their work, studies, news, and entertainment continues to have a high-quality, positive experience.
Developing drugs to combat Covid-19 is a global priority, requiring communities to come together to fight the spread of infection. At MIT, researchers with backgrounds in machine learning and life sciences are collaborating, sharing datasets and tools to develop machine learning methods that can identify novel cures for Covid-19.
This research is an extension of a community effort launched earlier this year. In February, before the Institute de-densified as a result of the pandemic, the first-ever AI Powered Drug Discovery and Manufacturing Conference, conceived and hosted by the Abdul Latif Jameel Clinic for Machine Learning in Health, drew attendees including pharmaceutical industry researchers, government regulators, venture capitalists, and pioneering drug researchers. More than 180 health care companies and 29 universities developing new artificial intelligence methods used in pharmaceuticals got involved, making the conference a singular event designed to lift the mask and reveal what goes on in the process of drug discovery.
As secretive as Silicon Valley seems, computer science and engineering students typically know what a job looks like when aspiring to join companies like Facebook or Tesla. But the global head of research and development for Janssen — the innovative pharmaceutical company owned by Johnson & Johnson — said it’s often much harder for students to grasp how their work fits into drug discovery.
“That’s a problem at the moment,” Mathai Mammen says, after addressing attendees, including MIT graduate students and postdocs, who gathered in the Samberg Conference Center in part to get a glimpse behind the scenes of companies currently working on bold ideas blending artificial intelligence with health care. Mathai, who is a graduate of the Harvard-MIT Program in Health Sciences and Technology and whose work at Theravance has brought to market five new medicines and many more on their way, is here to be part of the answer to that problem. “What the industry needs to do, is talk to students and postdocs about the sorts of interesting scientific and medical problems whose solutions can directly and profoundly benefit the health of people everywhere” he says.
“The conference brought together research communities that rarely overlap at technical conferences,” says Regina Barzilay, the Delta Electronics Professor of Electrical Engineering and Computer Science, Jameel Clinic faculty co-lead, and one of the conference organizers. “This blend enables us to better understand open problems and opportunities in the intersection. The exciting piece for MIT students, especially for computer science and engineering students, is to see where the industry is moving and to understand how they can contribute to this changing industry, which will happen when they graduate.”
Over two days, conference attendees snapped photographs through a packed schedule of research presentations, technical sessions, and expert panels, covering everything from discovering new therapeutic molecules with machine learning to funding AI research. Carefully curated, the conference provided a roadmap of bold tech ideas at work in health care now and traced the path to show how those tech solutions get implemented.
At the conference, Barzilay and Jim Collins, the Termeer Professor of Medical Engineering and Science in MIT’s Institute for Medical Engineering and Science (IMES) and Department of Biological Engineering, and Jameel Clinic faculty co-lead, presented research from a study published in Cell where they used machine learning to help identify a new drug that can target antibiotic-resistant bacteria. Together with MIT researchers Tommi Jaakkola, Kevin Yang, Kyle Swanson, and the first author Jonathan Stokes, they demonstrated how blending their backgrounds can yield potential answers to combat the growing antibiotic resistance crisis.
Collins saw the conference as an opportunity to inspire interest in antibiotic research, hoping to get the top young minds involved in battling resistance to antibiotics built up over decades of overuse and misuse, an urgent predicament in medicine that computer science students might not understand their role in solving. “I think we should take advantage of the innovation ecosystem at MIT and the fact that there are many experts here at MIT who are willing to step outside their comfort zone and get engaged in a new problem,” Collins says. “Certainly in this case, the development and discovery of novel antibiotics, is critically needed around the globe.”
AIDM showed the power of collaboration, inviting experts from major health-care companies and relevant organizations like Merck, Bayer, Darpa, Google, Pfizer, Novartis, Amgen, the U.S. Food and Drug Administration, and Janssen. Reaching capacity for conference attendees, it also showed people are ready to pull together to get on the same page. “I think the time is right and I think the place is right,” Collins says. “I think MIT is well-positioned to be a national, if not an international leader in this space, given the excitement and engagement of our students and our position in Kendall Square.”
A biotech hub for decades, Kendall Square has come a long way since big data came to Cambridge, Massachusetts, forever changing life science companies based here. AIDM kicked off with Institute Professor and Professor of Biology Phillip Sharp walking attendees through a brief history of AI in health care in the area. He was perhaps the person at the conference most excited for others to see the potential, as through his long career, he’s watched firsthand the history of innovation that led to this conference.
“The bigger picture, which this conference is a major part of, is this bringing together of the life science — biologists and chemists with machine learning and artificial intelligence — it’s the future of life science,” Sharp says. “It’s clear. It will reshape how we talk about our science, how we think about solving problems, how we deal with the other parts of the process of taking insights to benefit society.”
Albert Einstein famously postulated that “the only real valuable thing is intuition,” arguably one of the most important keys to understanding intention and communication.
But intuitiveness is hard to teach — especially to a machine. Looking to improve this, a team from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) came up with a method that dials us closer to more seamless human-robot collaboration. The system, called “Conduct-A-Bot,” uses human muscle signals from wearable sensors to pilot a robot’s movement.
“We envision a world in which machines help people with cognitive and physical work, and to do so, they adapt to people rather than the other way around,” says Professor Daniela Rus, director of CSAIL, deputy dean of research for the MIT Stephen A. Schwarzman College of Computing, and co-author on a paper about the system.
To enable seamless teamwork between people and machines, electromyography and motion sensors are worn on the biceps, triceps, and forearms to measure muscle signals and movement. Algorithms then process the signals to detect gestures in real time, without any offline calibration or per-user training data. The system uses just two or three wearable sensors, and nothing in the environment — largely reducing the barrier to casual users interacting with robots.
While Conduct-A-Bot could potentially be used for various scenarios, including navigating menus on electronic devices or supervising autonomous robots, for this research the team used a Parrot Bebop 2 drone, although any commercial drone could be used.
By detecting actions like rotational gestures, clenched fists, tensed arms, and activated forearms, Conduct-A-Bot can move the drone left, right, up, down, and forward, as well as allow it to rotate and stop.
If you gestured toward the right to your friend, they could likely interpret that they should move in that direction. Similarly, if you waved your hand to the left, for example, the drone would follow suit and make a left turn.
In tests, the drone correctly responded to 82 percent of over 1,500 human gestures when it was remotely controlled to fly through hoops. The system also correctly identified approximately 94 percent of cued gestures when the drone was not being controlled.
“Understanding our gestures could help robots interpret more of the nonverbal cues that we naturally use in everyday life,” says Joseph DelPreto, lead author on the new paper. “This type of system could help make interacting with a robot more similar to interacting with another person, and make it easier for someone to start using robots without prior experience or external sensors.”
This type of system could eventually target a range of applications for human-robot collaboration, including remote exploration, assistive personal robots, or manufacturing tasks like delivering objects or lifting materials.
These intelligent tools are also consistent with social distancing — and could potentially open up a realm of future contactless work. For example, you can imagine machines being controlled by humans to safely clean a hospital room, or drop off medications, while letting us humans stay a safe distance.
Muscle signals can often provide information about states that are hard to observe from vision, such as joint stiffness or fatigue.
For example, if you watch a video of someone holding a large box, you might have difficulty guessing how much effort or force was needed — and a machine would also have difficulty gauging that from vision alone. Using muscle sensors opens up possibilities to estimate not only motion, but also the force and torque required to execute that physical trajectory.
For the gesture vocabulary currently used to control the robot, the movements were detected as follows:
stiffening the upper arm to stop the robot (similar to briefly cringing when seeing something going wrong): biceps and triceps muscle signals;
waving the hand left/right and up/down to move the robot sideways or vertically: forearm muscle signals (with the forearm accelerometer indicating hand orientation);
fist clenching to move the robot forward: forearm muscle signals; and
rotating clockwise/counterclockwise to turn the robot: forearm gyroscope.
Machine learning classifiers detected the gestures using the wearable sensors. Unsupervised classifiers processed the muscle and motion data and clustered it in real time to learn how to separate gestures from other motions. A neural network also predicted wrist flexion or extension from forearm muscle signals.
The system essentially calibrates itself to each person’s signals while they’re making gestures that control the robot, making it faster and easier for casual users to start interacting with robots.
In the future, the team hopes to expand the tests to include more subjects. And while the movements for Conduct-A-Bot cover common gestures for robot motion, the researchers want to extend the vocabulary to include more continuous or user-defined gestures. Eventually, the hope is to have the robots learn from these interactions to better understand the tasks and provide more predictive assistance or increase their autonomy.
“This system moves one step closer to letting us work seamlessly with robots so they can become more effective and intelligent tools for everyday tasks,” says DelPreto. “As such collaborations continue to become more accessible and pervasive, the possibilities for synergistic benefit continue to deepen.”
DelPreto and Rus presented the paper virtually earlier this month at the ACM/IEEE International Conference on Human Robot Interaction.