Marshaling artificial intelligence in the fight against Covid-19

Artificial intelligence could play a decisive role in stopping the Covid-19 pandemic. To give the technology a push, the MIT-IBM Watson AI Lab is funding 10 projects at MIT aimed at advancing AI’s transformative potential for society. The research will target the immediate public health and economic challenges of this moment. But it could have a lasting impact on how we evaluate and respond to risk long after the crisis has passed. The 10 research projects are highlighted below.

Early detection of sepsis in Covid-19 patients 

Sepsis is a deadly complication of Covid-19, the disease caused by the new coronavirus SARS-CoV-2. About 10 percent of Covid-19 patients get sick with sepsis within a week of showing symptoms, but only about half survive.

Identifying patients at risk for sepsis can lead to earlier, more aggressive treatment and a better chance of survival. Early detection can also help hospitals prioritize intensive-care resources for their sickest patients. In a project led by MIT Professor Daniela Rus, researchers will develop a machine learning system to analyze images of patients’ white blood cells for signs of an activated immune response against sepsis.

Designing proteins to block SARS-CoV-2

Proteins are the basic building blocks of life, and with AI, researchers can explore and manipulate their structures to address longstanding problems. Take perishable food: The MIT-IBM Watson AI Lab recently used AI to discover that a silk protein made by honeybees could double as a coating for quick-to-rot foods to extend their shelf life.

In a related project led by MIT professors Benedetto Marelli and Markus Buehler, researchers will enlist the protein-folding method used in their honeybee-silk discovery to try to defeat the new coronavirus. Their goal is to design proteins able to block the virus from binding to human cells, and to synthesize and test their unique protein creations in the lab.

Saving lives while restarting the U.S. economy

Some states are reopening for business even as questions remain about how to protect those most vulnerable to the coronavirus. In a project led by MIT professors Daron AcemogluSimon Johnson and Asu Ozdaglar will model the effects of targeted lockdowns on the economy and public health.

In a recent working paper co-authored by Acemoglu, Victor Chernozhukov, Ivan Werning, and Michael Whinston, MIT economists analyzed the relative risk of infection, hospitalization, and death for different age groups. When they compared uniform lockdown policies against those targeted to protect seniors, they found that a targeted approach could save more lives. Building on this work, researchers will consider how antigen tests and contact tracing apps can further reduce public health risks.

Which materials make the best face masks?

Massachusetts and six other states have ordered residents to wear face masks in public to limit the spread of coronavirus. But apart from the coveted N95 mask, which traps 95 percent of airborne particles 300 nanometers or larger, the effectiveness of many masks remains unclear due to a lack of standardized methods to evaluate them.

In a project led by MIT Associate Professor Lydia Bourouiba, researchers are developing a rigorous set of methods to measure how well homemade and medical-grade masks do at blocking the tiny droplets of saliva and mucus expelled during normal breathing, coughs, or sneezes. The researchers will test materials worn alone and together, and in a variety of configurations and environmental conditions. Their methods and measurements will determine how well materials protect mask wearers and the people around them.

Treating Covid-19 with repurposed drugs

As Covid-19’s global death toll mounts, researchers are racing to find a cure among already-approved drugs. Machine learning can expedite screening by letting researchers quickly predict if promising candidates can hit their target.

In a project led by MIT Assistant Professor Rafael Gomez-Bombarelli, researchers will represent molecules in three dimensions to see if this added spatial information can help to identify drugs most likely to be effective against the disease. They will use NASA’s Ames and U.S. Department of Energy’s NSERC supercomputers to further speed the screening process.

A privacy-first approach to automated contact tracing

Smartphone data can help limit the spread of Covid-19 by identifying people who have come into contact with someone infected with the virus, and thus may have caught the infection themselves. But automated contact tracing also carries serious privacy risks.

In collaboration with MIT Lincoln Laboratory and others, MIT researchers Ronald Rivest and Daniel Weitzner will use encrypted Bluetooth data to ensure personally identifiable information remains anonymous and secure.

Overcoming manufacturing and supply hurdles to provide global access to a coronavirus vaccine

A vaccine against SARS-CoV-2 would be a crucial turning point in the fight against Covid-19. Yet, its potential impact will be determined by the ability to rapidly and equitably distribute billions of doses globally. This is an unprecedented challenge in biomanufacturing. 

In a project led by MIT professors Anthony Sinskey and Stacy Springs, researchers will build data-driven statistical models to evaluate tradeoffs in scaling the manufacture and supply of vaccine candidates. Questions include how much production capacity will need to be added, the impact of centralized versus distributed operations, and how to design strategies for fair vaccine distribution. The goal is to give decision-makers the evidence needed to cost-effectively achieve global access.

Leveraging electronic medical records to find a treatment for Covid-19

Developed as a treatment for Ebola, the anti-viral drug remdesivir is now in clinical trials in the United States as a treatment for Covid-19. Similar efforts to repurpose already-approved drugs to treat or prevent the disease are underway.

In a project led by MIT professors Roy Welsch and Stan Finkelstein, researchers will use statistics, machine learning, and simulated clinical drug trials to find and test already-approved drugs as potential therapeutics against Covid-19. Researchers will sift through millions of electronic health records and medical claims for signals indicating that drugs used to fight chronic conditions like hypertension, diabetes, and gastric influx might also work against Covid-19 and other diseases.

Finding better ways to treat Covid-19 patients on ventilators 

Troubled breathing from acute respiratory distress syndrome is one of the complications that brings Covid-19 patients to the ICU. There, life-saving machines help patients breathe by mechanically pumping oxygen into the lungs. But even as towns and cities lower their Covid-19 infections through social distancing, there remains a national shortage of mechanical ventilators and serious health risks of ventilation itself.

In collaboration with IBM researchers Zach Shahn and Daby Sow, MIT researchers Li-Wei Lehman and Roger Mark will develop an AI tool to help doctors find better ventilator settings for Covid-19 patients and decide how long to keep them on a machine. Shortened ventilator use can limit lung damage while freeing up machines for others. To build their models, researchers will draw on data from intensive-care patients with acute respiratory distress syndrome, as well as Covid-19 patients at a local Boston hospital.

Returning to normal via targeted lockdowns, personalized treatments, and mass testing

In a few short months, Covid-19 has devastated towns and cities around the world. Researchers are now piecing together the data to understand how government policies can limit new infections and deaths and how targeted policies might protect the most vulnerable.

In a project led by MIT Professor Dimitris Bertsimas, researchers will study the effects of lockdowns and other measures meant to reduce new infections and deaths and prevent the health-care system from being swamped. In a second phase of the project, they will develop machine learning models to predict how vulnerable a given patient is to Covid-19, and what personalized treatments might be most effective. They will also develop an inexpensive, spectroscopy-based test for Covid-19 that can deliver results in minutes and pave the way for mass testing. The project will draw on clinical data from four hospitals in the United States and Europe, including Codogno Hospital, which reported Italy’s first infection.

Read More

What can your microwave tell you about your health?

For many of us, our microwaves and dishwashers aren’t the first thing that come to mind when trying to glean health information, beyond that we should (maybe) lay off the Hot Pockets and empty the dishes in a timely way.

But we may soon be rethinking that, thanks to new research from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). The system, called “Sapple,” analyzes in-home appliance usage to better understand our health patterns, using just radio signals and a smart electricity meter.

Taking information from two in-home sensors, the new machine learning model examines use of everyday items like microwaves, stoves, and even hair dryers, and can detect where and when a particular appliance is being used.

For example, for an elderly person living alone, learning appliance usage patterns could help their health-care professionals understand their ability to perform various activities of daily living, with the goal of eventually helping advise on healthy patterns. These can include personal hygiene, dressing, eating, maintaining continence, and mobility.

“This system uses passive sensing data, and does not require people to change the way they live,” says MIT PhD student Chen-Yu Hsu, the lead author on a new paper about Sapple. “It has potential to improve things like energy saving and efficiency, give us a better understanding of the daily activities of seniors living alone, and provide insight into the behavioral analytics for smart environments.”

Of the two sensors, the “location sensor” uses radio signals to sense placement, and covers around 40 feet, or enough to cover a typical one-bedroom apartment. A user can walk around their apartment to set up the sensor, which allows it to understand the physical boundaries, and then the sensor can limit itself to that specified area.

The team says the system could potentially be useful during the Covid-19 pandemic, where there’s an increasing interest in contactless sensing of health and behaviors. They can imagine using passive sensor data to free up the need for caregivers to visit higher-risk populations and minimize overall in-person contact.

Sapple comes from the team’s growing body of research focused on using wireless sensing to better understand our complex human bodies — such as an in-body “GPS” sensor with the goal of tracking tumors or dispensing drugs, a wireless smart-home system for monitoring diseases and helping the elderly “age in place,” and another system for measuring gait to help monitor and diagnose various ailments.

Previous work in learning appliance usage has looked at using energy data from a utility meter. But this approach makes it challenging to tease out details, as the energy data is a mix of multiple appliances’ patterns all added together.

Unsupervised approaches — those in which training data aren’t labeled — assume patterns of individual appliances are unknown. However, since the utility meter measures the total energy used by the home, it’s really hard to learn individual appliances or detect them effectively.

Sapple stays in the unsupervised realm: It doesn’t assume we know the patterns of individual appliances, but instead uses data from a second sensor to help learn appliance usage patterns with self-supervision. For example, the location sensor captures a person’s motion as they approach a microwave, put food in it, and turn it on. The model then analyzes the data, and learns when specific appliances are turned on, and what their locations are in a home.

In addition to health, Sapple could potentially help reduce our heavy imprint on the natural world. By analyzing appliance usage patterns within homes, the system could be used to encourage energy-saving behaviors and improve forecasting and delivery for utility companies.

The team notes that their system’s approach solves some of the issues that can be tricky for in-home sensors. For example, using the location data doesn’t always imply appliance usage, as people can be next to an appliance without using it. Also, many appliances like refrigerators cycle their power and create “background events,” and there could be location data from multiple people in a home, but not all of them are related to appliance usage. Sapple solves these problems by learning when the two sensor streams become related, and uses that to discover when appliances are turned on, and their locations.

“As indoor location-sensing starts to potentially become as common as Wi-Fi in the future, the hope is that our technology can be effortlessly applied to all places with utility meters,” says Hsu. “This could enable new applications for passive health sensing in the homes. Utility companies, for example, could reduce peak demands by providing personalized feedback, optimize energy generation and delivery, and ultimately improve energy efficiency.”

Hsu wrote the paper alongside CSAIL PhD students Abbas Zeitoun and Guang-He Lee, as well as MIT professors Dina Katabi and Tommi Jaakkola. They presented the paper virtually at the International Conference on Learning Representations.

Read More

How Hugging Face achieved a 2x performance boost for Question Answering with DistilBERT in Node.js

How Hugging Face achieved a 2x performance boost for Question Answering with DistilBERT in Node.js

A guest post by Hugging Face: Pierric Cistac, Software Engineer; Victor Sanh, Scientist; Anthony Moi, Technical Lead.

Hugging Face 🤗 is an AI startup with the goal of contributing to Natural Language Processing (NLP) by developing tools to improve collaboration in the community, and by being an active part of research efforts.

Because NLP is a difficult field, we believe that solving it is only possible if all actors share their research and results. That’s why we created 🤗 Transformers, a leading NLP library with more than 2M downloads and used by researchers and engineers across many companies. It allows the amazing international NLP community to quickly experiment, iterate, create and publish new models for a variety of tasks (text/token generation, text classification, question answering…) in a variety of languages (English of course, but also French, Italian, Spanish, German, Turkish, Swedish, Dutch, Arabic and many others!) More than 300 different models are available today through Transformers.

While Transformers is very handy for research, we are also working hard on the production aspects of NLP, looking at and implementing solutions that can ease its adoption everywhere. In this blog post, we’re going to showcase one of the paths we believe can help fulfill this goal: the use of “small”, yet performant models (such as DistilBERT), and frameworks targeting ecosystems different from Python such as Node via TensorFlow.js.

The need for small models: DistilBERT

One of the areas we’re interested in is “low-resource” model with close to state-of-the-art results, while being a lot smaller and also a lot faster to run. That’s why we created DistilBERT, a distilled version of BERT: it has 40% fewer parameters, runs 60% faster while preserving 97% of BERT’s performance as measured on the GLUE language understanding benchmark.

NLP models through time, with their number of parameters

To create DistilBERT, we’ve been applying knowledge distillation to BERT (hence its name), a compression technique in which a small model is trained to reproduce the behavior of a larger model (or an ensemble of models), demonstrated by Hinton et al.

In the teacher-student training, we train a student network to mimic the full output distribution of the teacher network (its knowledge). Rather than training with a cross-entropy over the hard targets (one-hot encoding of the gold class), we transfer the knowledge from the teacher to the student with a cross-entropy over the soft targets (probabilities of the teacher). Our training loss thus becomes:

With t the logits from the teacher and s the logits of the student

Our student is a small version of BERT in which we removed the token-type embeddings and the pooler (used for the next sentence classification task). We kept the rest of the architecture identical while reducing the numbers of layers by taking one layer out of two, leveraging the common hidden size between student and teacher. We trained DistilBERT on very large batches leveraging gradient accumulation (up to 4000 examples per batch), with dynamic masking, and removed the next sentence prediction objective.

With this, we were then able to fine-tune our model on the specific task of Question Answering. To do so, we used the BERT-cased model fine-tuned on SQuAD 1.1 as a teacher with a knowledge distillation loss. In other words, we distilled a question answering model into a language model previously pre-trained with knowledge distillation! That’s a lot of teachers and students: DistilBERT-cased was first taught by BERT-cased, and then “taught again” by the SQuAD-finetuned BERT-cased version in order to get the DistilBERT-cased-finetuned-squad model.

This results in very interesting performances given the size of the network: our DistilBERT-cased fine-tuned model reaches an F1 score of 87.1 on the dev set, less than 2 points behind the full BERT-cased fine-tuned model! (88.7 F1 score).

If you’re interested in learning more about the distillation process, you can read our dedicated blog post.

The need for a language-neutral format: SavedModel

Using the previous process, we end up with a 240MB Keras file (.h5) containing the weights of our DistilBERT-cased-squad model. In this format, the architecture of the model resides in an associated Python class. But our final goal is to be able to use this model in as many environments as possible (Node.js + TensorFlow.js for this blog post), and the TensorFlow SavedModel format is perfect for this: it’s a “serialized” format, meaning that all the information necessary to run the model is contained into the model files. It is also a language-neutral format, so we can use it in Python, but also in JS, C++, and Go.

To convert to SavedModel, we first need to construct a graph from the model code. In Python, we can use tf.function to do so:

import tensorflow as tf
from transformers import TFDistilBertForQuestionAnswering

distilbert = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')
callable = tf.function(distilbert.call)

Here we passed to tf.function the function called in our Keras model, call. What we get in return is a callable that we can in turn use to trace our call function with a specific signature and shapes thanks to get_concrete_function:

concrete_function = callable.get_concrete_function([tf.TensorSpec([None, 384], tf.int32, name="input_ids"), tf.TensorSpec([None, 384], tf.int32, name="attention_mask")])

By calling get_concrete_function, we trace-compile the TensorFlow operations of the model for an input signature composed of two Tensors of shape [None, 384], the first one being the input ids and the second one the attention mask.

Then we can finally save our model to the SavedModel format:

tf.saved_model.save(distilbert, 'distilbert_cased_savedmodel', signatures=concrete_function)

A conversion in 4 lines of code, thanks to TensorFlow! We can check that our resulting SavedModel contains the correct signature by using the

saved_model_cli:

$ saved_model_cli show --dir distilbert_cased_savedmodel --tag_set serve --signature_def serving_default

Output:

The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, 384)
name: serving_default_attention_mask:0
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, 384)
name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output_0'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 384)
name: StatefulPartitionedCall:0
outputs['output_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 384)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict

Perfect! You can play with the conversion code yourself by opening this colab notebook. We are now ready to use our SavedModel with TensorFlow.js!

The need for ML in Node.js: TensorFlow.js

Here at Hugging Face we strongly believe that in order to reach its full adoption potential, NLP has to be accessible in other languages that are more widely used in production than Python, with APIs simple enough to be manipulated with software engineers without a Ph.D. in Machine Learning; one of those languages is obviously Javascript.

Thanks to the API provided by TensorFlow.js, interacting with the SavedModel we created previously in Node.js is very straightforward. Here is a slightly simplified version of the Typescript code in our NPM Question Answering package:

const model = await tf.node.loadSavedModel(path); // Load the model located in path

const result = tf.tidy(() => {
// ids and attentionMask are of type number[][]
const inputTensor = tf.tensor(ids, undefined, "int32");
const maskTensor = tf.tensor(attentionMask, undefined, "int32");

// Run model inference
return model.predict({
// “input_ids” and “attention_mask” correspond to the names specified in the signature passed to get_concrete_function during the model conversion
“input_ids”: inputTensor, “attention_mask”: maskTensor
}) as tf.NamedTensorMap;
});

// Extract the start and end logits from the tensors returned by model.predict
const [startLogits, endLogits] = await Promise.all([
result[“output_0"].squeeze().array() as Promise,
result[“output_1”].squeeze().array() as Promise
]);

tf.dispose(result); // Clean up memory used by the result tensor since we don’t need it anymore

Note the use of the very helpful TensorFlow.js function tf.tidy, which takes care of automatically cleaning up intermediate tensors like inputTensor and maskTensor while returning the result of the model inference.

How do we know we need to use "ouput_0" and "output_1" to extract the start and end logits (beginning and end of the possible spans answering the question) from the result returned by the model? We just have to look at the output names indicated by the saved_model_cli command we ran previously after exporting to SavedModel.

The need for fast and easy to use tokenizer: 🤗 Tokenizers

Our goal while building our Node.js library was to make the API as simple as possible. As we just saw, running model inference once we have our SavedModel is quite simple, thanks to TensorFlow.js. Now, the most difficult part is passing the data in the right format to the input ids and attention mask tensors. What we collect from a user is usually a string, but the tensors require arrays of numbers: we need to tokenize the user input.

Enter 🤗 Tokenizers: a performant library written in Rust that we’ve been working on at Hugging Face. It allows you to play with different tokenizers such as BertWordpiece very easily, and it works in Node.js too thanks to the provided bindings:

const tokenizer = await BertWordPieceTokenizer.fromOptions({
vocabFile: vocabPath, lowercase: false
});

tokenizer.setPadding({ maxLength: 384 }); // 384 matches the shape of the signature input provided while exporting to SavedModel

// Here question and context are in their original string format
const encoding = await tokenizer.encode(question, context);
const { ids, attentionMask } = encoding;

That’s it! In just 4 lines of code, we are able to convert the user input to a format we can then use to feed our model with TensorFlow.js.

The Final Result: Powerful Question Answering in Node.js

Thanks to the powers of the SavedModel format, TensorFlow.js for inference, and Tokenizers for tokenization, we’ve reached our goal to offer a very simple, yet very powerful, public API in our NPM package:

import { QAClient } from "question-answering"; // If using Typescript or Babel
// const { QAClient } = require("question-answering"); // If using vanilla JS

const text = `
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season.
The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California.
As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
`;

const question = "Who won the Super Bowl?";

const qaClient = await QAClient.fromOptions();
const answer = await qaClient.predict(question, text);

console.log(answer); // { text: 'Denver Broncos', score: 0.3 }

Powerful? Yes! Thanks to the native support of SavedModel format in TensorFlow.js, we get very good performances: here is a benchmark comparing our Node.js package and our popular transformers Python library, running the same DistilBERT-cased-squad model. As you can see, we achieve a 2X speed gain! Who said Javascript was slow?

Short texts are texts between 500 and 1000 characters, long texts are between 4000 and 5000 characters. You can check the Node.js benchmark script here (the Python one is equivalent). Benchmark run on a standard 2019 MacBook Pro running on macOS 10.15.2.

It’s a very interesting time for NLP: big models such as GPT2 or T5 keep getting better and better, and research on how to “minify” those good but heavy and costly models is also getting more and more traction, with distillation being one technique among others. Adding to the equation tools that allow big developer communities to be part of the revolution (such as TensorFlow.js with the Javascript ecosystem), only makes the future of NLP more exciting and more production-ready than ever!

For further reading, feel free to check our Github repositories:
https://github.com/huggingfaceRead More

How AI could predict sight-threatening eye conditions

How AI could predict sight-threatening eye conditions

Age-related macular degeneration (AMD) is the biggest cause of sight loss in the UK and USA and is the third largest cause of blindness across the globe. The latest research collaboration between Google Health, DeepMind and Moorfields Eye Hospital is published in Nature Medicine today. It shows that artificial intelligence (AI) has the potential to not only spot the presence of AMD in scans, but also predict the disease’s progression. 

Vision loss and wet AMD

Around 75 percent of patients with AMD have an early form called “dry” AMD that usually has relatively mild impact on vision. A minority of patients, however, develop the more sight-threatening form of AMD called exudative, or “wet” AMD. This condition affects around 15 percent of patients, and occurs when abnormal blood vessels develop underneath the retina. These vessels can leak fluid, which can cause permanent loss of central vision if not treated early enough.

Macular degeneration mainly affects central vision, causing "blind spots" directly ahead

Macular degeneration mainly affects central vision, causing “blind spots” directly ahead (Macular Society).

Wet AMD often affects one eye first, so patients become heavily reliant upon their unaffected eye to maintain their normal day-to-day living. Unfortunately, 20 percent of these patientswill go on to develop wet AMD in their other eye within two years. The condition often develops suddenly but further vision loss can be slowed with treatments if wet AMD is recognized early enough. Ophthalmologists regularly monitor their patients for signs of wet AMD using 3D optical coherence tomography (OCT) images of the retina.

The period before wet AMD develops is a critical window for preventive treatment, which is why we set out to build a system that could predict whether a patient with wet AMD in one eye will go on to develop the condition in their second eye. This is a novel clinical challenge, since it’s not a task that is routinely performed.

How AI could predict the development of wet AMD

In collaboration with colleagues at DeepMind and Moorfields Eye Hospital NHS Foundation Trust, we’ve developed an artificial intelligence (AI) model that has the potential to predict whether a patient will develop wet AMD within six months. In the future, this system could potentially help doctors plan studies of earlier intervention, as well as contribute more broadly to clinical understanding of the disease and disease progression. 

We trained and tested our model using a retrospective, anonymized dataset of 2,795 patients. These patients had been diagnosed with wet AMD in one of their eyes, and were attending one of seven clinical sites for regular OCT imaging and treatment. For each patient, our researchers worked with retinal experts to review all prior scans for each eye and determine the scan when wet AMD was first evident. In collaboration with our colleagues at DeepMind we developed an AI system composed of two deep convolutional neural networks, one taking the raw 3D scan as input and the other, built on our previous work, taking a segmentation map outlining the types of tissue present in the retina. Our prediction system used the raw scan and tissue segmentations to estimate a patient’s risk of progressing to wet AMD within the next six months. 

To test the system, we presented the model with a single, de-identified scan and asked it to predict whether there were any signs that indicated the patient would develop wet AMD in the following six months. We also asked six clinical experts—three retinal specialists and three optometrists, each with at least ten years’ experience—to do the same. Predicting the possibility of a patient developing wet AMD is not a task that is usually performed in clinical practice so this is the first time, to our knowledge, that experts have been assessed on this ability. 

While clinical experts performed better than chance alone, there was substantial variability between their assessments. Our system performed as well as, and in certain cases better than, these clinicians in predicting wet AMD progression. This highlights its potential use for informing studies in the future to assess or help develop treatments to prevent wet AMD progression.

Future work could address several limitations of our research. The sample was representative of practice at multiple sites of the world’s largest eye hospital, but more work is needed to understand the model performance in different demographics and clinical settings. Such work should also understand the impact of unstudied factors—such as additional imaging tests—that might be important for prediction, but were beyond the scope of this work.

What’s next 

These findings demonstrate the potential for AI to help improve understanding of disease progression and predict the future risk of patients developing sight-threatening conditions. This, in turn, could help doctors study preventive treatments.

This is the latest stage in our partnership with Moorfields Eye Hospital NHS Foundation Trust, a long-standing relationship that transitioned from DeepMind to Google Health in September 2019. Our previous collaborations include using AI to quickly detect eye conditions, and showing how Google Cloud AutoML might eventually help clinicians without prior technical experience to accurately detect common diseases from medical images. 

This is early research, rather than a product that could be implemented in routine clinical practice. Any future product would need to go through rigorous prospective clinical trials and regulatory approvals before it could be used as a tool for doctors. This work joins a growing body of research in the area of developing predictive models that could inform clinical research and trials. In line with this, Moorfields will be making the dataset available through the Ryan Initiative for Macular Research. We hope that models like ours will be able to support this area of work to improve patient outcomes. 

Read More

Making Sense of Vision and Touch: Multimodal Representations for Contact-Rich Tasks

Making Sense of Vision and Touch: Multimodal Representations for Contact-Rich Tasks

Sound, smell, taste, touch, and vision – these are the five senses that humans use to perceive and understand the world. We are able to seamlessly combine these different senses when perceiving the world. For example, watching a movie requires constant processing of both visual and auditory information, and we do that effortlessly. As roboticists, we are particularly interested in studying how humans combine our sense of touch and our sense of sight. Vision and touch are especially important when doing manipulation tasks that require contact with the environment, such as closing a water bottle or inserting a dollar bill into a vending machine.

Let’s take closing a water bottle as an example. With our eyes, we can observe the colors, edges, and shapes in the scene, from which we can infer task-relevant information, such as the poses and geometry of the water bottle and the cap. Meanwhile, our sense of touch tells us texture, pressure, and force, which also give us task-relevant information such as the force we are applying to the water bottle and the slippage of the bottle cap in our grasp. Furthermore, humans can infer the same kind of information using either or both types of senses: our tactile senses can also give us pose and geometric information, while our visual senses can predict when we are going to make contact with the environment.

Humans use visual and tactile senses to infer task-relevant information and actions for contact-rich tasks, such as closing a bottle.

From these multimodal observations and task-relevant features, we come up with appropriate actions for the given observations to successfully close the water bottle. Given a new task, such as inserting a dollar into a vending machine, we might use the same task-relevant information (poses, geometry, forces, etc) to learn a new policy. In other words, there are certain task-relevant multimodal features that generalize across different types of tasks.

Learning features from raw observation inputs (such as RGB images and force/torque data from sensors commonly seen on modern robots) is also known as representation learning. We want to learn a representation for vision and touch, and preferably a representation that can combine the two senses together. We hypothesize that if we can learn a representation that captures task-relevant features, we can use the same representation for similar contact-rich tasks. In other words, learning a rich multimodal representation can help us generalize.

While humans interact with the world in an inherently multimodal manner, it is not clear how to combine very different kinds of data directly from sensors. RGB images from cameras are very high dimensional (often around 640 x 480 x 3 pixels). On the other hand, force/torque sensor readings only have 6 dimensions but also have the complicating quality of sometimes rapidly changing (e.g. when the robot is not touching anything, the sensor registers 0 newtons, but that can quickly jump to 20 newtons once contact is made).

Combining Vision and Touch

How do we combine vision and touch when they have such different characteristics?

Our encoder architectures to fuse the multimodal inputs.

We can leverage a deep neural network to learn features from our high dimensional raw sensor data. The above figure shows our multimodal representation learning neural network architecture, which we train to create a fused vector representation of RGB images, force sensor readings (from a wrist-attached force/torque sensor), and robot states (the position and velocity of the robot wrist from which the peg is attached).

Because our sensor readings have such different characteristics, we use a different network architecture to encode each modality:

-The image encoder is a simplified FlowNet1 network, with a 6-layer convolutional neural network (CNN). This will be helpful for our self-supervised objective.

-Because our force reading is a time series data with temporal correlation, we take the causal convolutions of our force readings. This is similar to the architecture of WaveNet2, which has been shown to work well with time-sequenced audio data.

-For proprioceptive sensor readings (end-effector position and velocity), we encode it with fully connected layers, as this is commonly done in robotics.

Each encoder produces a feature vector. If we want a deterministic representation, we can combine them into one vector by just concatenating them together. If we use a probabilistic representation, where each feature vector actually has a mean vector and a variance vector (assuming Gaussian distributions), we can combine the different modality distributions using the Product of Experts idea of multiplying the densities of the distributions together by weighting each mean with its variance. The resulting combined vector is our multimodal representation.

How do we learn multimodal features without manual labeling?
Our modality encoders have close to half a million learnable parameters, which would require large amounts of labeled data to train with supervised learning. It would be very costly and expensive to manually label our data. However, we can design training objectives whose labels are automatically generated during data collection. In other words, we can train the encoders using self-supervised learning. Imagine trying to annotate 1000 hours of video of a robot doing a task or trying to manually label the poses of the objects. Intuitively, you’d much rather just write down a rule like ‘keep track of the force on the robot arm and label the state and action pair when force readings are too high’, rather than checking each frame one by one for when the robot is touching the box. We do something similar, by algorithmically labeling the data we collect from the robot rollouts.

Our self-supervised learning objectives.

We design two learning objectives that capture the dynamics of the sensor modalities: (i) predicting the optical flow of the robot generated by the action and (ii) predicting whether the robot will make contact with the environment given the action. Since we usually know the geometry, kinematics, and meshes of a robot, ground-truth optical flow annotations can be automatically generated given the joint positions and robot kinematics. Contact prediction can also be automatically generated by looking for spikes in the force sensor data.

Our last self-supervised learning objective attempts to capture the time-locked correlation between the two different sensor modalities of vision and touch, and learn the relationship between them. When a robot touches an environment, a camera captures the interaction and the force sensor captures the contact at the same time. So, this objective predicts whether our input modalities are time aligned. During training, we give our network both time-aligned data and also randomly shifted sensor data. Our network needs to be able to predict from our representation whether the inputs are aligned or not.

To train our model, we collected 100,000 data points in 90 minutes by having the robot perform random actions as well as pre-defined actions that encourage peg insertion and collecting self-supervised labels as described above. Then, we learn our representation via standard stochastic gradient descent, training for 20 epochs.

How do we know if we have a good multimodal representation?

A good representation should:

  • Enable us to learn a policy that is able to accomplish a contact-rich manipulation task (e.g. a peg insertion task) in a sample-efficient manner

  • Generalize across task instances (e.g. different peg geometries for peg insertion)

  • Enable use to learn a policy that is robust to sensor noises, external perturbations, and different goal locations

To study how to learn this multimodal representation, we use a peg insertion task as an experimental setup. Our multimodal inputs are raw RGB image, force readings from a force/torque sensor, and end-effector position and velocity. And unlike classical works on tight tolerance peg insertion that need prior knowledge of peg geometries, we will be learning policies for different geometries directly from raw RGB images and force/torque sensor readings. More importantly, we want to learn a representation from one peg geometry, and see if that representation can generalize to new unseen geometries.

Learning a policy

We want the robot to be able to learn policies directly from its own interactions with the environment. Here, we turn to deep reinforcement learning (RL) algorithms, which enable agents to learn from trial and error and a reward function.
Deep reinforcement learning has shown great advances in playing video games, robotic grasping, and solving Rubik’s cubes. Specifically, we use Trust Region Policy Optimization3, an on-policy RL algorithm, and a dense reward that guides the robot towards the hole for peg insertion.

Once we learn the representation, we feed the representation directly to a RL policy. And we are able to learn a peg insertion task for different peg geometries in about 5 hours from raw sensory inputs.

Here is the robot when it first starts learning the task.

About 100 episodes in (which is 1.5 hours), the robot starts touching the box.
Insert gif episode 100

And in 5 hours, the robot is able to reliably insert the peg for a round peg, triangular peg, and also a semi-circular peg.

Evaluation of our representation

We evaluate how well our representation captures our multimodal sensor inputs by testing how well the representation generalizes to new task instances, how robust our policy is with the representation as state input, and how the different modalities (or lack thereof) affect the representation learning.

Generalization of our representation

We examine the potential of transferring the learned policies and representations to two novel shapes previously unseen in representation and policy training, the hexagonal peg and the square peg. For policy transfer, we take the representation model and the policy trained for the triangular peg, and execute with the new unseen square peg. As you can see in the gif below, when we do policy transfer, our success rate drops from 92% to 62%. This shows that a policy learned for one peg geometry does not necessarily transfer to a new peg geometry.

A better transfer performance can be achieved by taking the representation model trained on the triangular peg, and training a new policy for the new hexagonal peg. As seen in the gif, our peg insertion rate goes up to 92% again when we transfer the multimodal representation. Even though the learned policies do not transfer to new geometries, we show that our multimodal representation from visual and tactile feedback can transfer to new task instances. Our representation generalizes to new unseen peg geometries, and captures task-relevant information across task instances.

Policy robustness

We showed that our policy is robust to sensor noises for the force/torque sensors and for the camera.

Force Sensor Perturbation: When we tap the force/torque sensor, this sometimes tricks the robot to think it is making contact with the environment. But the policy is still able to recover from these perturbations and noises.

Camera Occlusion: When we intermittently occlude the camera after the robot has already made contact with the environment. The policy is still able to find the hole from the robot states, force readings, and the occluded images.

Goal Target Movement: We can move the box to a new location that has never been seen by the robot during training, and our robot is still able to complete the insertion.

External Forces: We can also perturb the robot and apply external forces directly on it, and is it still able to finish the insertion.

Also notice we run our policies on two different robots, the orange KUKA IIWA robot and the white Franka Panda robot, which shows that our method works on different robots.

Ablation study

To study the effects of how the different modalities affect the representation, we ran an ablation study in simulation. In our simulation experiments where we randomize the box location, we can study how each sensor is being used by completely taking away a modality during representation and policy training. If we only have force data, our policy is not able to find the box. With only image data, we achieve a 49% task success rate, but our policy really struggles with aligning the peg with the hole, since the camera cannot capture these small precise movements. With both force and image inputs, our task completion rate goes up to 77% in simulation.

Simulation results for modality ablation study

The learning curves also demonstrate that the Full Model and the Image Only Model (No Haptics) have similar returns in the beginning of the training. As training goes on and the robot learns to get closer to the box, the returns start to diverge when the Full Model is able to more quickly and robustly learn how to insert the peg with both visual and force feedback.

Policy learning curves for modality ablation study

It’s not surprising that learning a representation with more modalities improves policy learning, but our result also shows that our representation and policy are using all the modalities for contact-rich tasks.

Summary

As an overview of our method, we collect self-labeled data through self-supervision, which takes about 90 minutes to collect 100k data points. We can learn a representation from this data, which takes about 24 hours training on a GPU, but is done fully offline. Afterward, you can learn new policies from the same representation, which only takes 5 hours of real robot training. This method can be done on different robots or for different kinds of tasks.

Here are some of the key takeaways from this work. The first is, self-supervision, specifically dynamics and temporal concurrency prediction can give us rich objectives to train a representation model of different modalities.

Second, our representation that captures our modality concurrency and forward dynamics can generalize across task instances (e.g. peg geometries and hole location) and is robust to sensor noise. This suggests that the features from each modality and the relationship between them are useful across different instances of contact rich tasks.

Lastly, our experiments show that learning multimodal representation leads to learning efficiency and policy robustness.

For future work, we want our method to be able to generalize beyond a task family to completely different contact-rich tasks (e.g. chopping vegetables, changing a lightbulb, inserting an electric plug). To do so, we might need to utilize more modalities, such as incorporating temperature, audio, or tactile sensors, and also find algorithms that can give us quick adaptations to new tasks.


This blog post is based on the two following papers:

For further details on this work, check out our video and our 2020 GTC Talk.

The code and multimodal dataset are available here.

Acknowledgements

Many thanks to Andrey Kurenkov, Yuke Zhu, and Jeannette Bohg for comments and edits on this blog post.

  1. Fischer et al. FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 

  2. Van Den Oord et al. WaveNet: A Generative Model for Raw Audio. SSW, 2016. 

  3. Schulman et al. Trust Region Policy Optimization. ICML, 2015. 

  4. * denotes equal contribution 

Read More

Using AI to predict retinal disease progression

Vision loss among the elderly is a major healthcare issue: about one in three people have some vision-reducing disease by the age of 65. Age-related macular degeneration (AMD) is the most common cause of blindness in the developed world. In Europe, approximately 25% of those 60 and older have AMD. The dry form is relatively common among people over 65, and usually causes only mild sight loss. However, about 15% of patients with dry AMD go on to develop a more serious form of the disease exudative AMD, or exAMD which can result in rapid and permanent loss of sight. Fortunately, there are treatments that can slow further vision loss. Although there are no preventative therapies available at present, these are being explored in clinical trials. The period before the development of exAMD may therefore represent a critical window to target for therapeutic innovations: can we predict which patients will progress to exAMD, and help prevent sight loss before it even occurs?Read More

Enabling E-Textile Microinteractions: Gestures and Light through Helical Structures

Enabling E-Textile Microinteractions: Gestures and Light through Helical Structures

Posted by Alex Olwal, Research Scientist, Google Research

Textiles have the potential to help technology blend into our everyday environments and objects by improving aesthetics, comfort, and ergonomics. Consumer devices have started to leverage these opportunities through fabric-covered smart speakers and braided headphone cords, while advances in materials and flexible electronics have enabled the incorporation of sensing and display into soft form factors, such as jackets, dresses, and blankets.

A scalable interactive E-textile architecture with embedded touch sensing, gesture recognition and visual feedback.

In “E-textile Microinteractions” (Proceedings of ACM CHI 2020), we bring interactivity to soft devices and demonstrate how machine learning (ML) combined with an interactive textile topology enables parallel use of discrete and continuous gestures. This work extends our previously introduced E-textile architecture (Proceedings of ACM UIST 2018). This research focuses on cords, due to their modular use as drawstrings in garments, and as wired connections for data and power across consumer devices. By exploiting techniques from textile braiding, we integrate both gesture sensing and visual feedback along the surface through a repeating matrix topology.

For insight into how this works, please see this video about E-textile microinteractions and this video about the E-textile architecture.

E-textile microinteractions combining continuous sensing with discrete motion and grasps.

The Helical Sensing Matrix (HSM)
Braiding generally refers to the diagonal interweaving of three or more material strands. While braids are traditionally used for aesthetics and structural integrity, they can also be used to enable new sensing and display capabilities.

Whereas cords can be made to detect basic touch gestures through capacitive sensing, we developed a helical sensing matrix (HSM) that enables a larger gesture space. The HSM is a braid consisting of electrically insulated conductive textile yarns and passive support yarns,where conductive yarns in opposite directions take the role of transmit and receive electrodes to enable mutual capacitive sensing. The capacitive coupling at their intersections is modulated by the user’s fingers, and these interactions can be sensed anywhere on the cord since the braided pattern repeats along the length.

Left: A Helical Sensing Matrix based on a 4×4 braid (8 conductive threads spiraled around the core). Magenta/cyan are conductive yarns, used as receive/transmit lines. Grey are passive yarns (cotton). Center: Flattened matrix, that illustrates the infinite number of 4×4 matrices (colored circles 0-F), which repeat along the length of the cord. Right: Yellow are fiber optic lines, which provide visual feedback.

Rotation Detection
A key insight is that the two axial columns in an HSM that share a common set of electrodes (and color in the diagram of the flattened matrix) are 180º opposite each other. Thus, pinching and rolling the cord activates a set of electrodes and allows us to track relative motion across these columns. Rotation detection identifies the current phase with respect to the set of time-varying sinusoidal signals that are offset by 90º. The braid allows the user to initiate rotation anywhere, and is scalable with a small set of electrodes.

Rotation is deduced from horizontal finger motion across the columns. The plots below show the relative capacitive signal strengths, which change with finger proximity.

Interaction Techniques and Design Guidelines
This e-textile architecture makes the cord touch-sensitive, but its softness and malleability limit suitable interactions compared to rigid touch surfaces. With the unique material in mind, our design guidelines emphasize:

  • Simple gestures. We design for short interactions where the user either makes a single discrete gesture or performs a continuous manipulation.
  • Closed-loop feedback. We want to help the user discover functionality and get continuous feedback on their actions. Where possible, we provide visual, tactile, and audio feedback integrated in the device.

Based on these principles, we leverage our e-textile architecture to enable interaction techniques based on our ability to sense proximity, area, contact time, roll and pressure.

Our e-textile enables interaction based on capacitive sensing of proximity, contact area, contact time, roll, and pressure.

The inclusion of fiber optic strands that can display color of varying intensity enable dynamic real-time feedback to the user.

Braided fiber optics strands create the illusion of directional motion.

Motion Gestures (Flicks and Slides) and Grasping Styles (Pinch, Grab, Pinch)
We conducted a gesture elicitation study, which showed opportunities for an expanded gesture set. Inspired by these results, we decided to investigate five motion gestures based on flicks and slides, along with single­-touch gestures (pinch, grab and pat).

Gesture elicitation study with imagined touch sensing.

We collected data from 12 new participants, which resulted in 864 gesture samples (12 participants performed eight gestures each, repeating nine times), each having 16 features linearly interpolated to 80 observations over time. Participants performed the eight gestures in their own style without feedback as we wanted to accommodate individual differences since the classification is highly dependent on user style (“contact”), preference (“how to pinch/grab”) and anatomy (e.g., hand size). Our pipeline was thus designed for user-dependent training to enable individual styles with differences across participants, such as the inconsistent use of clockwise/counterclockwise, overlap between temporal gestures (e.g., flick vs. flick and hold, and similar pinch and grab gestures.) For a user-independent system, we would need to address such differences, for example with stricter instructions for consistency, data from a larger population, and in more diverse settings. Real-time feedback during training will also help mitigate differences as the user learns to adjust their behavior.

Twelve participants (horizontal axis) performed 9 repetitions (animation) for the eight gestures (vertical axis). Each sub-image shows 16 overlaid feature vectors, interpolated to 80 observations over time.

We performed cross-validation for each user across the gestures by training on eight repetitions and testing on one, through nine permutations, and achieved a gesture recognition accuracy of ~94%. This result is encouraging, especially given the expressivity enabled by such a low-resolution sensor matrix (eight electrodes).

Notable here is that inherent relationships in the repeated sensing matrices are well-suited for machine learning classification. The ML classifier used in our research enables quick training with limited data, which makes a user-dependent interaction system reasonable. In our experience, training for a typical gesture takes less than 30s, which is comparable to the amount of time required to train a fingerprint sensor.

User-Independent, Continuous Twist: Quantifying Precision and Speed
The per-user trained gesture recognition enabled eight new discrete gestures. For continuous interactions, we also wanted to quantify how well user-independent, continuous twist performs for precision tasks. We compared our e-textile with two baselines, a capacitive multi-touch trackpad (“Scroll”) and the familiar headphone cord remote control (“Buttons”). We designed a lab study where the three devices controlled 1D movement in a targeting task.

We analysed three dependent variables for the 1800 trials, covering 12 participants and three techniques: time on task (milliseconds), total motion, and motion during end-of-trial. Participants also provided qualitative feedback through rankings and comments.

Our quantitative analysis suggests that our e-textile’s twisting is faster than existing headphone button controls and comparable in speed to a touch surface. Qualitative feedback also indicated a preference for e-textile interaction over headphone controls.

Left: Weighted average subjective feedback. We mapped the 7-point Likert scale to a score in the range [-3, 3] and multiplied by the number of times the technique received that rating, and computed an average for all the scores. Right: Mean completion times for target distances show that Buttons were consistently slower.

These results are particularly interesting given that our e-textile was more sensitive, compared to the rigid input devices. One explanation might be its expressiveness — users can twist quickly or slowly anywhere on the cord, and the actions are symmetric and reversible. Conventional buttons on headphones require users to find their location and change grips for actions, which adds a high cost to pressing the wrong button. We use a high-pass filter to limit accidental skin contact, but further work is needed to characterize robustness and evaluate long-term performance in actual contexts of use.

Gesture Prototypes: Headphones, Hoodie Drawstrings, and Speaker Cord
We developed different prototypes to demonstrate the capabilities of our e-textile architecture: e-textile USB-C headphones to control media playback on the phone, a hoodie drawstring to invisibly add music control to clothing, and an interactive cord for gesture controls of smart speakers.

Left: Tap = Play/Pause; Center: Double-tap = Next track; Right: Roll = Volume +/-
Interactive speaker cord for simultaneous use of continuous (twisting/rolling) and discrete gestures (pinch/pat) to control music playback.

Conclusions and Future Directions
We introduce an interactive e-textile architecture for embedded sensing and visual feedback, which can enable both precise small-scale and large-scale motion in a compact cord form factor. With this work, we hope to advance textile user interfaces and inspire the use of microinteractions for future wearable interfaces and smart fabrics, where eyes-free access and casual, compact and efficient input is beneficial. We hope that our e-textile will inspire others to augment physical objects with scalable techniques, while preserving industrial design and aesthetics.

Acknowledgements
This work is a collaboration across multiple teams at Google. Key contributors to the project include Alex Olwal, Thad Starner, Jon Moeller, Greg Priest-Dorman, Ben Carroll, and Gowa Mainini. We thank the Google ATAP Jacquard team for our collaboration, especially Shiho Fukuhara, Munehiko Sato, and Ivan Poupyrev. We thank Google Wearables, and Kenneth Albanowski and Karissa Sawyer, in particular. Finally, we would like to thank Mark Zarich for illustrations, Bryan Allen for videography, Frank Li for data processing, Mathieu Le Goc for valuable discussions, and Carolyn Priest-Dorman for textile advice.