October 2021 – Page 6

Two Amazon-affiliated economists awarded Nobel Prize

Amazon Scholar David Card wins half the award, while academic research consultant Guido Imbens shares in the other half.Read More

An ML-Based Framework for COVID-19 Epidemiology

Posted by Joel Shor, Software Engineer, Google Research and Sercan Arik, Research Scientist, Google Research, Cloud AI Team

Over the past 20 months, the COVID-19 pandemic has had a profound impact on daily life, presented logistical challenges for businesses planning for supply and demand, and created difficulties for governments and organizations working to support communities with timely public health responses. While there have been well-studied epidemiology models that can help predict COVID-19 cases and deaths to help with these challenges, this pandemic has generated an unprecedented amount of real-time publicly-available data, which makes it possible to use more advanced machine learning techniques in order to improve results.

In “A prospective evaluation of AI-augmented epidemiology to forecast COVID-19 in the USA and Japan“, accepted to npj Digital Medicine, we continued our previous work [1, 2, 3, 4] and proposed a framework designed to simulate the effect of certain policy changes on COVID-19 deaths and cases, such as school closings or a state-of-emergency at a US-state, US-county, and Japan-prefecture level, using only publicly-available data. We conducted a 2-month prospective assessment of our public forecasts, during which our US model tied or outperformed all other 33 models on COVID19 Forecast Hub. We also released a fairness analysis of the performance on protected sub-groups in the US and Japan. Like other Google initiatives to help with COVID-19 [1, 2, 3], we are releasing daily forecasts based on this work to the public for free, on the web [us, ja] and through BigQuery.

Prospective forecasts for the USA and Japan models. Ground truth cumulative deaths counts (green lines) are shown alongside the forecasts for each day. Each daily forecast contains a predicted increase in deaths for each day during the prediction window of 4 weeks (shown as colored dots, where shading shifting to yellow indicates days further from the date of prediction in the forecasting horizon, up to 4 weeks). Predictions of deaths are shown for the USA (above) and Japan (below).

The Model
Models for infectious diseases have been studied by epidemiologists for decades. Compartmental models are the most common, as they are simple, interpretable, and can fit different disease phases effectively. In compartmental models, individuals are separated into mutually exclusive groups, or compartments, based on their disease status (such as susceptible, exposed, or recovered), and the rates of change between these compartments are modeled to fit the past data. A population is assigned to compartments representing disease states, with people flowing between states as their disease status changes.

In this work, we propose a few extensions to the Susceptible-Exposed-Infectious-Removed (SEIR) type compartmental model. For example, susceptible people becoming exposed causes the susceptible compartment to decrease and the exposed compartment to increase, with a rate that depends on disease spreading characteristics. Observed data for COVID-19 associated outcomes, such as confirmed cases, hospitalizations and deaths, are used for training of compartmental models.

Visual explanation of “compartmental” models in epidemiology. People “flow” between compartments. Real-world events, like policy changes and more ICU beds, change the rate of flow between compartments.

Our framework proposes a number of novel technical innovations:

Learned transition rates: Instead of using static rates for transitions between compartments across all locations and times, we use machine-learned rates to map them. This allows us to take advantage of the vast amount of available data with informative signals, such as Google’s COVID-19 Community Mobility Reports, healthcare supply, demographics, and econometrics features.
Explainability: Our framework provides explainability for decision makers, offering insights on disease propagation trends via its compartmental structure, and suggesting which factors may be most important for driving compartmental transitions.
Expanded compartments: We add hospitalization, ICU, ventilator, and vaccine compartments and demonstrate efficient training despite data sparsity.
Information sharing across locations: As opposed to fitting to an individual location, we have a single model for all locations in a country (e.g., >3000 US counties) with distinct dynamics and characteristics, and we show the benefit of transferring information across locations.
Seq2seq modeling: We use a sequence-to-sequence model with a novel partial teacher forcing approach that minimizes amplified growth of errors into the future.

Forecast Accuracy
Each day, we train models to predict COVID-19 associated outcomes (primarily deaths and cases) 28 days into the future. We report the mean absolute percentage error (MAPE) for both a country-wide score and a location-level score, with both cumulative values and weekly incremental values for COVID-19 associated outcomes.

We compare our framework with alternatives for the US from the COVID19 Forecast Hub. In MAPE, our models outperform all other 33 models except one — the ensemble forecast that also includes our model’s predictions, where the difference is not statistically significant.

We also used prediction uncertainty to estimate whether a forecast is likely to be accurate. If we reject forecasts that the model considers uncertain, we can improve the accuracy of the forecasts that we do release. This is possible because our model has well-calibrated uncertainty.

Mean average percentage error (MAPE, the lower the better) decreases as we remove uncertain forecasts, increasing accuracy.

What-If Tool to Simulate Pandemic Management Policies and Strategies
In addition to understanding the most probable scenario given past data, decision makers are interested in how different decisions could affect future outcomes, for example, understanding the impact of school closures, mobility restrictions and different vaccination strategies. Our framework allows counterfactual analysis by replacing the forecasted values for selected variables with their counterfactual counterparts. The results of our simulations reinforce the risk of prematurely relaxing non-pharmaceutical interventions (NPIs) until the rapid disease spreading is reduced. Similarly, the Japan simulations show that maintaining the State of Emergency while having a high vaccination rate greatly reduces infection rates.

What-if simulations on the percent change of predicted exposed individuals assuming different non-pharmaceutical interventions (NPIs) for the prediction date of March 1, 2021 in Texas, Washington and South Carolina. Increased NPI restrictions are associated with a larger % reduction in the number of exposed people.

What-if simulations on the percent change of predicted exposed individuals assuming different vaccination rates for the prediction date of March 1, 2021 in Texas, Washington and South Carolina. Increased vaccination rate also plays a key role to reduce exposed count in these cases.

Fairness Analysis
To ensure that our models do not create or reinforce unfairly biased decision making, in alignment with our AI Principles, we performed a fairness analysis separately for forecasts in the US and Japan by quantifying whether the model’s accuracy was worse on protected sub-groups. These categories include age, gender, income, and ethnicity in the US, and age, gender, income, and country of origin in Japan. In all cases, we demonstrated no consistent pattern of errors among these groups once we controlled for the number of COVID-19 deaths and cases that occur in each subgroup.

Normalized errors by median income. The comparison between the two shows that patterns of errors don’t persist once errors are normalized by cases. Left: Normalized errors by median income for the US. Right: Normalized errors by median income for Japan.

Real-World Use Cases
In addition to quantitative analyses to measure the performance of our models, we conducted a structured survey in the US and Japan to understand how organisations were using our model forecasts. In total, seven organisations responded with the following results on the applicability of the model.

Organization type: Academia (3), Government (2), Private industry (2)
Main user job role: Analyst/Scientist (3), Healthcare professional (1), Statistician (2), Managerial (1)
Location: USA (4), Japan (3)
Predictions used: Confirmed cases (7), Death (4), Hospitalizations (4), ICU (3), Ventilator (2), Infected (2)
Model use case: Resource allocation (2), Business planning (2), scenario planning (1), General understanding of COVID spread (1), Confirm existing forecasts (1)
Frequency of use: Daily (1), Weekly (1), Monthly (1)
Was the model helpful?: Yes (7)

To share a few examples, in the US, the Harvard Global Health Institute and Brown School of Public Health used the forecasts to help create COVID-19 testing targets that were used by the media to help inform the public. The US Department of Defense used the forecasts to help determine where to allocate resources, and to help take specific events into account. In Japan, the model was used to make business decisions. One large, multi-prefecture company with stores in more than 20 prefectures used the forecasts to better plan their sales forecasting, and to adjust store hours.

Limitations and next steps
Our approach has a few limitations. First, it is limited by available data, and we are only able to release daily forecasts as long as there is reliable, high-quality public data. For instance, public transportation usage could be very useful but that information is not publicly available. Second, there are limitations due to the model capacity of compartmental models as they cannot model very complex dynamics of Covid-19 disease propagation. Third, the distribution of case counts and deaths are very different between the US and Japan. For example, most of Japan’s COVID-19 cases and deaths have been concentrated in a few of its 47 prefectures, with the others experiencing low values. This means that our per-prefecture models, which are trained to perform well across all Japanese prefectures, often have to strike a delicate balance between avoiding overfitting to noise while getting supervision from these relatively COVID-19-free prefectures.

We have updated our models to take into account large changes in disease dynamics, such as the increasing number of vaccinations. We are also expanding to new engagements with city governments, hospitals, and private organizations. We hope that our public releases continue to help public and policy-makers address the challenges of the ongoing pandemic, and we hope that our method will be useful to epidemiologists and public health officials in this and future health crises.

Acknowledgements
This paper was the result of hard work from a variety of teams within Google and collaborators around the globe. We’d especially like to thank our paper co-authors from the School of Medicine at Keio University, Graduate School of Public Health at St Luke’s International University, and Graduate School of Medicine at The University of Tokyo.

ML Community Day: Save the date

Posted by the TensorFlow team

Please join us for ML Community Day, a virtual developer event on November 9th. You’ll hear from the TensorFlow, JAX, and Deepmind teams and the community, covering new products, updates, and more.

This event takes place on TensorFlow’s sixth birthday (time flies!), and celebrates many years of open-source work by the developer community. We’ll have talks on on-device Machine Learning, JAX, Responsible AI, Cloud TPUs, Machine Learning in production, and sessions from the community.

You’ll also learn how you can get involved with the machine learning community by connecting with Google Developer Experts, joining Special Interest Groups, TensorFlow User Groups, and more.

Later in the day we’ll host the TensorFlow Contributor Summit, in which Google Developer Experts, Special Interest Groups, TensorFlow User Group organizers, and other community leaders gather to connect in a small group to discuss topics such as documentation and how to contribute to the TensorFlow ecosystem. During this event, we will also host the first TensorFlow Awards Ceremony to recognize outstanding community contributions.

You can find more details at the event website. We’ll see you soon!

Helmut Katzgraber elected fellow of the American Physical Society

Amazon quantum computing scientist recognized for ‘outstanding contributions to physics’.Read More

Self-Supervised Learning Advances Medical Image Classification

Posted by Shekoofeh Azizi, AI Resident, Google Research

In recent years, there has been increasing interest in applying deep learning to medical imaging tasks, with exciting progress in various applications like radiology, pathology and dermatology. Despite the interest, it remains challenging to develop medical imaging models, because high-quality labeled data is often scarce due to the time-consuming effort needed to annotate medical images. Given this, transfer learning is a popular paradigm for building medical imaging models. With this approach, a model is first pre-trained using supervised learning on a large labeled dataset (like ImageNet) and then the learned generic representation is fine-tuned on in-domain medical data.

Other more recent approaches that have proven successful in natural image recognition tasks, especially when labeled examples are scarce, use self-supervised contrastive pre-training, followed by supervised fine-tuning (e.g., SimCLR and MoCo). In pre-training with contrastive learning, generic representations are learned by simultaneously maximizing agreement between differently transformed views of the same image and minimizing agreement between transformed views of different images. Despite their successes, these contrastive learning methods have received limited attention in medical image analysis and their efficacy is yet to be explored.

In “Big Self-Supervised Models Advance Medical Image Classification”, to appear at the International Conference on Computer Vision (ICCV 2021), we study the effectiveness of self-supervised contrastive learning as a pre-training strategy within the domain of medical image classification. We also propose Multi-Instance Contrastive Learning (MICLe), a novel approach that generalizes contrastive learning to leverage special characteristics of medical image datasets. We conduct experiments on two distinct medical image classification tasks: dermatology condition classification from digital camera images (27 categories) and multilabel chest X-ray classification (5 categories). We observe that self-supervised learning on ImageNet, followed by additional self-supervised learning on unlabeled domain-specific medical images, significantly improves the accuracy of medical image classifiers. Specifically, we demonstrate that self-supervised pre-training outperforms supervised pre-training, even when the full ImageNet dataset (14M images and 21.8K classes) is used for supervised pre-training.

SimCLR and Multi Instance Contrastive Learning (MICLe)
Our approach consists of three steps: (1) self-supervised pre-training on unlabeled natural images (using SimCLR); (2) further self-supervised pre-training using unlabeled medical data (using either SimCLR or MICLe); followed by (3) task-specific supervised fine-tuning using labeled medical data.

Our approach comprises three steps: (1) Self-supervised pre-training on unlabeled ImageNet using SimCLR (2) Additional self-supervised pre-training using unlabeled medical images. If multiple images of each medical condition are available, a novel Multi-Instance Contrastive Learning (MICLe) strategy is used to construct more informative positive pairs based on different images. (3) Supervised fine-tuning on labeled medical images. Note that unlike step (1), steps (2) and (3) are task and dataset specific.

After the initial pre-training with SimCLR on unlabeled natural images is complete, we train the model to capture the special characteristics of medical image datasets. This, too, can be done with SimCLR, but this method constructs positive pairs only through augmentation and does not readily leverage patients’ meta data for positive pair construction. Alternatively, we use MICLe, which uses multiple images of the underlying pathology for each patient case, when available, to construct more informative positive pairs for self-supervised learning. Such multi-instance data is often available in medical imaging datasets — e.g., frontal and lateral views of mammograms, retinal fundus images from each eye, etc.

Given multiple images of a given patient case, MICLe constructs a positive pair for self-supervised contrastive learning by drawing two crops from two distinct images from the same patient case. Such images may be taken from different viewing angles and show different body parts with the same underlying pathology. This presents a great opportunity for self-supervised learning algorithms to learn representations that are robust to changes of viewpoint, imaging conditions, and other confounding factors in a direct way. MICLe does not require class label information and only relies on different images of an underlying pathology, the type of which may be unknown.

MICLe generalizes contrastive learning to leverage special characteristics of medical image datasets (patient metadata) to create realistic augmentations, yielding further performance boost of image classifiers.

Combining these self-supervised learning strategies, we show that even in a highly competitive production setting we can achieve a sizable gain of 6.7% in top-1 accuracy on dermatology skin condition classification and an improvement of 1.1% in mean AUC on chest X-ray classification, outperforming strong supervised baselines pre-trained on ImageNet (the prevailing protocol for training medical image analysis models). In addition, we show that self-supervised models are robust to distribution shift and can learn efficiently with only a small number of labeled medical images.

Comparison of Supervised and Self-Supervised Pre-training
Despite its simplicity, we observe that pre-training with MICLe consistently improves the performance of dermatology classification over the original method of pre-training with SimCLR under different pre-training dataset and base network architecture choices. Using MICLe for pre-training, translates to (1.18 ± 0.09)% increase in top-1 accuracy for dermatology classification over using SimCLR. The results demonstrate the benefit accrued from utilizing additional metadata or domain knowledge to construct more semantically meaningful augmentations for contrastive pre-training. In addition, our results suggest that wider and deeper models yield greater performance gains, with ResNet-152 (2x width) models often outperforming ResNet-50 (1x width) models or smaller counterparts.

Comparison of supervised and self-supervised pre-training, followed by supervised fine-tuning using two architectures on dermatology and chest X-ray classification. Self-supervised learning utilizes unlabeled domain-specific medical images and significantly outperforms supervised ImageNet pre-training.

Improved Generalization with Self-Supervised Models
For each task we perform pretraining and fine-tuning using the in-domain unlabeled and labeled data respectively. We also use another dataset obtained in a different clinical setting as a shifted dataset to further evaluate the robustness of our method to out-of-domain data. For the chest X-ray task, we note that self-supervised pre-training with either ImageNet or CheXpert data improves generalization, but stacking them both yields further gains. As expected, we also note that when only using ImageNet for self-supervised pre-training, the model performs worse compared to using only in-domain data for pre-training.

To test the performance under distribution shift, for each task, we held out additional labeled datasets for testing that were collected under different clinical settings. We find that the performance improvement in the distribution-shifted dataset (ChestX-ray14) by using self-supervised pre-training (both using ImageNet and CheXpert data) is more pronounced than the original improvement on the CheXpert dataset. This is a valuable finding, as generalization under distribution shift is of paramount importance to clinical applications. On the dermatology task, we observe similar trends for a separate shifted dataset that was collected in skin cancer clinics and had a higher prevalence of malignant conditions. This demonstrates that the robustness of the self-supervised representations to distribution shifts is consistent across tasks.

Evaluation of models on distribution-shifted datasets for the chest-xray interpretation task. We use the model trained on in-domain data to make predictions on an additional shifted dataset without any further fine-tuning (zero-shot transfer learning). We observe that self-supervised pre-training leads to better representations that are more robust to distribution shifts.

Evaluation of models on distribution-shifted datasets for the dermatology task. Our results generally suggest that self-supervised pre-trained models can generalize better to distribution shifts with MICLe pre-training leading to the most gains.

Improved Label Efficiency
We further investigate the label-efficiency of the self-supervised models for medical image classification by fine-tuning the models on different fractions of labeled training data. We use label fractions ranging from 10% to 90% for both Derm and CheXpert training datasets and examine how the performance varies using the different available label fractions for the dermatology task. First, we observe that pre-training using self-supervised models can compensate for low label efficiency for medical image classification, and across the sampled label fractions, self-supervised models consistently outperform the supervised baseline. These results also suggest that MICLe yields proportionally higher gains when fine-tuning with fewer labeled examples. In fact, MICLe is able to match baselines using only 20% of the training data for ResNet-50 (4x) and 30% of the training data for ResNet152 (2x).

Top-1 accuracy for dermatology condition classification for MICLe, SimCLR, and supervised models under different unlabeled pre-training datasets and varied sizes of label fractions. MICLe is able to match baselines using only 20% of the training data for ResNet-50 (4x).

Conclusion
Supervised pre-training on natural image datasets is commonly used to improve medical image classification. We investigate an alternative strategy based on self-supervised pre-training on unlabeled natural and medical images and find that it can significantly improve upon supervised pre-training, the standard paradigm for training medical image analysis models. This approach can lead to models that are more accurate and label efficient and are robust to distribution shifts. In addition, our proposed Multi-Instance Contrastive Learning method (MICLe) enables the use of additional metadata to create realistic augmentations, yielding further performance boost of image classifiers.

Self-supervised pre-training is much more scalable than supervised pre-training because class label annotation is not required. We hope this paper will help popularize the use of self-supervised approaches in medical image analysis yielding label efficient and robust models suited for clinical deployment at scale in the real world.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of researchers, software engineers, clinicians, and cross-functional contributors across Google Health and Google Brain. We thank our co-authors: Basil Mustafa, Fiona Ryan, Zach Beaver, Jan Freyberg, Jon Deaton, Aaron Loh, Alan Karthikesalingam, Simon Kornblith, Ting Chen, Vivek Natarajan, and Mohammad Norouzi. We also thank Yuan Liu from Google Health for valuable feedback and our partners for access to the datasets used in the research.

3 questions with Seyed Sajjadi: How to utilize a video analytics platform to automate the process of learning

Sajjadi, a co-founder and CEO of Alexa Fund company nFlux.ai, explains how procedure monitoring can help humans, from astronauts to manufacturers, and even home cooks.Read More

At the crossroads of language, technology, and empathy

Rujul Gandhi’s love of reading blossomed into a love of language at age 6, when she discovered a book at a garage sale called “What’s Behind the Word?” With forays into history, etymology, and language genealogies, the book captivated Gandhi, who as an MIT senior remains fascinated with words and how we use them.

Growing up partially in the U.S. and mostly in India, Gandhi was surrounded by a variety of languages and dialects. When she moved to India at age 8, she could already see how knowing the Marathi language allowed her to connect more easily to her classmates — an early lesson in how language shapes our human experiences.

Initially thinking she might want to study creative writing or theater, Gandhi first learned about linguistics as its own field of study through an online course in ninth grade. Now a linguistics major at MIT, she is studying the structure of language from the syllable to sentence level, and also learning about how we perceive language. She finds the human aspects of how we use language, and the fact that languages are constantly changing, particularly compelling.

“When you learn to appreciate language, you can then appreciate culture,” she says.

Communicating and connecting, with a technological assist

Taking advantage of MIT’s Global Teaching Labs program, Gandhi traveled to Kazakhstan in January 2020 to teach linguistics and biology to high school students. Lacking a solid grasp of the language, she cautiously navigated conversations with her students and hosts. However, she soon found that working to understand the language, giving culturally relevant examples, and writing her assignments in Russian and Kazakh allowed her to engage more meaningfully with her students.

Technology also helped bridge the communication barrier between Gandhi and her Russian-speaking host father, who spoke no English. With help from Google Translate, they bonded over shared interests, including 1950s and ’60s Bollywood music.

As she began to study computer science at MIT, Gandhi saw more opportunities to connect people through both language and technology, thus leading her to pursue a double major in linguistics and in computer science and electrical engineering.

“The problems I understand through linguistics, I can try to find solutions to through computer science,” she explains.

Energized by ambitious projects

Gandhi is determined to prioritize social impact while looking for those solutions. Through various leadership roles in on-campus organizations during her time at MIT, especially in the student-run Educational Studies Program (ESP), she realized how much working directly with people and being on the logistical side of large projects energizes her. With ESP, she helps organize events that bring thousands of high school and middle school students to campus each year for classes and other activities led by MIT students.

After her second directing program, Spark 2020, was cancelled last March because of the pandemic, Gandhi eventually embraced the virtual experience. She planned and co-directed a virtual program, Splash: 2020, hosting about 1,100 students. “Interacting with the ESP community convinced me that an organization can function efficiently with a strong commitment to its values,” she says.

The pandemic also heightened Gandhi’s appreciation for the MIT community, as many people reached out to her offering a place to stay when campus shut down. She says she sees MIT as home — a place where she not only feels cared for, but also relishes the opportunity to care for others.

Now, she is bridging cultural barriers on campus through performing art. Dance is another one of Gandhi’s loves. When she couldn’t find a group to practice Indian classical dance with, Gandhi took matters into her own hands. In 2019, she and a couple of friends founded Nritya, a student organization at MIT. The group hopes to have its first in-person performance this fall. “Dance is like its own language,” she observes.

Technology born out of empathy

In her academic work, Gandhi relishes researching linguistics problems from a theoretical perspective, and then applying that knowledge through hands-on experiences. “The good thing about MIT is it lets you go out of your comfort zone,” she says.

For example, in IAP 2019 she worked on a geographical dialect survey of her native Marathi language with Deccan College, a center of linguistics in her hometown. And, through the Undergraduate Research Opportunities Program (UROP), she is currently working on a research project focused on phonetics and phonology, focusing her attention on how language “contact,” or interactions, influences the sounds that speakers use.

The following winter, she also worked with Tarjimly, a nonprofit connecting refugees with interpreters through a smartphone app. She notes that translating systems have advanced quickly in terms of allowing people to communicate more effectively, but she also recognizes that there is great potential to improve them to benefit and reach even more people.

“How are people going to advocate for themselves and make use of public infrastructure if they can’t interface with it?” she asks.

Mulling over other ideas, Gandhi says it would be interesting to explore how sign language might be more effectively be interpreted through a smartphone translating app. And, she sees a need for further improving regional translations to better connect with the culture and context of the areas the language is spoken in, accounting for dialectal differences and new developments.

Looking ahead, Gandhi wants to focus on designing systems that better integrate theoretical developments in linguistics and on making language technology widely accessible. She says she finds the work of bringing together technology and linguistics to be most rewarding when it involves people, and that she finds the most meaning in her projects when they are centered around empathy for others’ experiences.

“The technology born out of empathy is the technology that I want to be working on,” she explains. “Language is fundamentally a people thing; you can’t ignore the people when you’re designing technology that relates to language.”

Selective Classification Can Magnify Disparities Across Groups

Selective classification, where models are allowed to “abstain” when they are uncertain about a prediction, is a useful approach for deploying models in settings where errors are costly. For example, in medicine, model errors can have life-or-death ramifications, but abstentions can be easily handled by backing off to a doctor, who then makes a diagnosis. Across a range of applications from vision ¹²³ and NLP ⁴⁵, even simple selective classifiers, relying only on model logits, routinely and often dramatically improve accuracy by abstaining. This makes selective classification a compelling tool for ML practitioners ⁶⁷.

However, in our recent ICLR paper, we find that despite reliably improving average accuracy, selective classification can fail to improve and even hurt the accuracy over certain subpopulations of the data. As a motivating example, consider the task of diagnosing pleural effusion, or fluid in the lungs, from chest X-rays. Pleural effusion is often treated with a chest drain, so many pleural effusion cases also have chest drains, while most cases without pleural effusion do not have chest drains ⁸. While selective classification improves average accuracy for this task, we find that it does not appreciably improve accuracy on the most clinically relevant subgroup, or subpopulation, of the data: those that have pleural effusion but don’t yet have a chest drain, i.e. those that have pleural effusion but have not yet been treated for it. Practitioners, thus, should be wary of these potential failure modes of using selective classification in the wild.

Example of the spurious correlation setup. This patient has a pleural effusion (excess fluid in the lung), but does not yet have a chest drain. The model, relying on the presence of a chest drain to make a prediction, incorrectly predicts negative.

To further outline this critical failure mode of selective classification, we’ll first provide an overview of selective classification. We then demonstrate empirically that selective classification can hurt or fail to significantly improve accuracy on certain subgroups of the data. We next outline our theoretical results, which suggest that selective classification is rarely a good tool to resolve differences in accuracy between subgroups. And finally, suggest methods for building more equitable selective classifiers.

Selective classification basics

Imagine you are trying to build a model that classifies X-rays as either pleural effusion positive or negative. With standard classification, the model is required to either output positive or negative on each input. In contrast, a selective classifier can additionally abstain from making a prediction when it is not sufficiently confident in any class ⁹¹⁰¹¹. By abstaining, selective classifiers aim to avoid making predictions on examples they are likely to classify incorrectly, say a corrupted or difficult-to-classify X-ray, which increases their average accuracy.

Selective classification pipeline. The model makes the incorrect prediction of negative. However, the outputted confidence of 0.7 is less than the confidence threshold of 0.8, so the selective classifier abstains. Selective classifiers increase accuracy by abstaining on examples they would get wrong.

One key question in selective classification is how to choose which examples to abstain on. Selective classifiers can be viewed as two models: one that outputs a prediction (say, negative), and another that outputs a confidence in that prediction (say, 0.7 out of 1.) Whenever the confidence is above a certain (confidence) threshold, the selective classifier outputs the original prediction; for example, if the threshold were 0.6, the selective classifier would predict negative. Otherwise, the selective classifier abstains. In our work, we primarily use softmax response ¹¹ to extract confidences: the confidence in a prediction is simply the maximum softmax probability over the possible classes.

Selective classifiers are typically measured in terms of the accuracy (also called selective accuracy) on predicted examples, and the coverage, or fraction of examples the selective classifier makes predictions on ¹². We can tweak both coverage and accuracy by adjusting the confidence threshold: a lower threshold for making predictions increases the coverage, since the model’s confidence for more examples is sufficiently high. However, this tends to lower average accuracy, as the model is less confident on average in its predictions. In contrast, higher thresholds increase confidence required to make a prediction, reducing the coverage but generally increasing average accuracy.

Typically, researchers measure the performance of selective classifiers by plotting accuracy as a function of coverage. In particular, for each possible coverage (ranging from 0: abstain on everything to 1: predict on everything) they compute the maximum threshold that achieves that coverage, and then plot the accuracy at that threshold. One particularly useful reference point is the full-coverage accuracy: the accuracy of the selective classifier at coverage 1, which is the accuracy of the regular classifier.

For five datasets, we plot the average accuracy as a function of the coverage. Reading from high coverages to low coverages (right to left), as the confidence threshold increases, accuracy reliably increases. This is expected, since the model is more confident on average in its predictions at lower coverage, so more of them tend to be correct.

Selective classification can magnify accuracy disparities between subgroups

While prior work mostly focuses on average accuracy for selective classifiers, we instead focus on the accuracy of different subgroups of the data. In particular, we focus on datasets where models often latch onto spurious correlations. For example, in the above pleural effusion task, the model might learn to predict whether or not there is a chest drain, instead of directly diagnosing pleural effusion, because chest drains are highly correlated with pleural effusion; this correlation is spurious because not all pleural effusions have a chest drain. We consider subgroups that highlight this spurious correlation: two groups for when the spurious correlation gives the correct result (positive pleural effusion with chest drain, negative pleural effusion without a chest drain), and two groups when it does not (positive pleural effusion with no chest drain, negative pleural effusion with a chest drain). As a result, a model that learns this spurious correlation obtains high accuracy for the first two subgroups, but low accuracy for the latter two.

In principle, selective classification seems like a reasonable approach towards resolving these accuracy discrepancies between different subgroups of the data. Since we empirically see that selective classification reliably improves average accuracy, it must be more likely to cause a model to abstain when an example would be classified incorrectly. Incorrect examples disproportionately come from the lowest-accuracy subgroups of the data, suggesting that without bias in the confidence function, worst-group accuracy should increase faster than average accuracy.

To test this, we plot the accuracy-coverage curves over a range of tasks, including hair color classification (CelebA), bird type classification (Waterbirds), pleural effusion classification (CheXpert-device), toxicity classification (CivilComments) and natural language inference (MultiNLI). CelebA, Waterbirds, and MultiNLI use the same spurious correlation setup presented in ². CivilComments exhibits the same spurious correlations as described in the WILDS benchmark ¹³. Finally, we created the CheXpert-device dataset by subsampling the original CheXpert dataset ³ such that the presence of a chest drain even more strongly correlates with pleural effusion.

Reading from right to left, while we see that as the coverage decreases the average accuracy reliably increases, the worst-group accuracies do not always increase, and exhibit a range of undesirable behaviors. On CelebA, worst-group accuracy actually decreases: this means the more confident predictions are more likely to be incorrect. For Waterbirds, CheXpert-device, and CivilComments, worst-group accuracy sometimes increases, but never by more than 10 points until the noisy low-coverage regime, and sometimes decreases. For MultiNLI, worst-group accuracy does slowly improve, but can’t even reach 80% until very low coverages.

These results highlight that practitioners should be wary: even if selective classification reliably increases average accuracy, it will not necessarily improve the accuracy of different subgroups.

Selective classification rarely overcomes accuracy disparities

To better understand why selective classification can sometimes hurt worst-group accuracy and does not reduce full-coverage accuracy disparities, we theoretically characterize for a broad class of distributions: (1) when does selective classification improve accuracy as the confidence threshold decreases and (2) when does selective classification disproportionately help the worst group.

At a high level, our analysis focuses on the margin, or the model’s confidence for a given prediction multiplied by -1 if that prediction was incorrect. Intuitively, the more negative the margin, the “worse” the prediction. Using only the margin distribution, we can recreate the accuracy-coverage curve by abstaining on density between the negative and positive threshold, and computing the fraction of remaining density that is correct.

The key result of our theoretical analysis is that the full-coverage accuracy of a subgroup dramatically impacts how well selective classification performs on that subgroup, which amplifies disparities. For a wide range of margin distributions, full-coverage accuracy and a property of the margin distribution we call left-log-concavity completely determine whether or not the accuracy of a selective classifier monotonically increases or decreases. When a margin distribution is left-log-concave, which many standard distributions (e.g. gaussians) are, accuracy monotonically increases when full-coverage accuracy is at least 50% and decreases otherwise.

Next steps

So far, we have painted a fairly bleak picture of selective classification: even though it reliably improves average accuracy, it can, both theoretically and empirically, exacerbate accuracy disparities between subgroups. There are still, however, mechanisms to improve selective classification, which we outline below.

One natural step towards improving selective classification is to develop confidence functions that allow selective classifiers to overcome accuracy disparities between groups. In our paper, we test the two most widely used methods: softmax response and Monte Carlo dropout ¹⁰. We consistently find that both are disproportionately overconfident on incorrect examples from the worst-groups. However, new confidence functions that are better calibrated across groups would likely resolve disparities ¹⁴, and is an important direction for future work.

In the short term, however, we find that the most promising method to improve worst-group accuracy with selective classification is to build selective classifiers on top of already-equitable models, or models that achieve similar full-coverage accuracies across the relevant subgroups. One method to train such models is group DRO, which minimizes the maximum loss over subgroups ². We find empirically that selective classifiers trained with group DRO improve the accuracy of subgroups at roughly the same rate when they have the same accuracy at full coverage. However, group DRO is far from a perfect fix – it requires a priori knowledge of the relevant subgroups, and subgroup labels for each training example which may be costly to obtain. Nevertheless, it is a promising start, and developing more broadly applicable methods for training already-equitable models is a critical area for future work.

To conclude, despite the intuition that selective classification should improve worst-group accuracy, and selective classification’s ability to consistently improve average accuracy, common selective classifiers can severely exacerbate accuracy discrepancies between subgroups. We hope our work encourages practitioners to apply selective classification with caution, and in general focus on how different methods affect different subgroups of the data.

Acknowledgements

Thanks to the SAIL blog editors, Pang Wei Koh, and Shiori Sagawa for their helpful feedback on this blog post. This post is based off our ICLR 2021 paper:

Selective Classification Can Magnify Disparities Across Groups. Erik Jones*, Shiori Sagawa* Pang Wei Koh*, Ananya Kumar, and Percy Liang. ICLR 2021.

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3730–3738, 2015. ↩
Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020. ↩ ↩² ↩³
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Association for the Advancement of Artificial Intelligence (AAAI), volume 33, pp. 590–597, 2019. ↩ ↩²
Daniel Borkan, Lucas Dixon, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. Nuanced metrics for measuring unintended bias with real data for text classification. In World Wide Web (WWW), pp. 491–500, 2019. ↩
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Association for Computational Linguistics (ACL), pp. 1112–1122, 2018. ↩
Yonatan Giefman and Ran El-Yaniv. SelectiveNet: A deep neural network with an integrated reject option. In International Conference on Machine Learning (ICML), 2019. ↩
Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert. In International Conference on Machine Learning (ICML), 2020. ↩
Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher Ré. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 151–159, 2020. ↩
C. K. Chow. An optimum character recognition system using decision functions. In IRE Transactions on Electronic Computers, 1957. ↩
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning (ICML), 2016. ↩ ↩²
Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), 2017. ↩ ↩²
Ran El-Yaniv and Yair Wiener. On the foundations of noise-free selective classification. Journal of Machine Learning Research (JMLR), 11, 2010. ↩
Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, Tony Lee, Etienne David, Ian Stavness, Wei Guo, Berton A. Earnshaw, Imran S. Haque, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, and Percy Liang. WILDS: A benchmark of in-the-wild distribution shifts. arXiv, 2020. ↩
Yoav Wald, Amir Feder, Daniel Greenfeld, and Uri Shalit. On Calibration and Out-of-domain Generalization. arXiv preprint arXiv:2102.10395, 2021. ↩

Deep learning helps predict traffic crashes before they happen

Today’s world is one big maze, connected by layers of concrete and asphalt that afford us the luxury of navigation by vehicle. For many of our road-related advancements — GPS lets us fire fewer neurons thanks to map apps, cameras alert us to potentially costly scrapes and scratches, and electric autonomous cars have lower fuel costs — our safety measures haven’t quite caught up. We still rely on a steady diet of traffic signals, trust, and the steel surrounding us to safely get from point A to point B.

To get ahead of the uncertainty inherent to crashes, scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Qatar Center for Artificial Intelligence developed a deep learning model that predicts very high-resolution crash risk maps. Fed on a combination of historical crash data, road maps, satellite imagery, and GPS traces, the risk maps describe the expected number of crashes over a period of time in the future, to identify high-risk areas and predict future crashes.

Typically, these types of risk maps are captured at much lower resolutions that hover around hundreds of meters, which means glossing over crucial details since the roads become blurred together. These maps, though, are 5×5 meter grid cells, and the higher resolution brings newfound clarity: The scientists found that a highway road, for example, has a higher risk than nearby residential roads, and ramps merging and exiting the highway have an even higher risk than other roads.

“By capturing the underlying risk distribution that determines the probability of future crashes at all places, and without any historical data, we can find safer routes, enable auto insurance companies to provide customized insurance plans based on driving trajectories of customers, help city planners design safer roads, and even predict future crashes,” says MIT CSAIL PhD student Songtao He, a lead author on a new paper about the research.

Even though car crashes are sparse, they cost about 3 percent of the world’s GDP and are the leading cause of death in children and young adults. This sparsity makes inferring maps at such a high resolution a tricky task. Crashes at this level are thinly scattered — the average annual odds of a crash in a 5×5 grid cell is about one-in-1,000 — and they rarely happen at the same location twice. Previous attempts to predict crash risk have been largely “historical,” as an area would only be considered high-risk if there was a previous nearby crash.

The team’s approach casts a wider net to capture critical data. It identifies high-risk locations using GPS trajectory patterns, which give information about density, speed, and direction of traffic, and satellite imagery that describes road structures, such as the number of lanes, whether there’s a shoulder, or if there’s a large number of pedestrians. Then, even if a high-risk area has no recorded crashes, it can still be identified as high-risk, based on its traffic patterns and topology alone.

To evaluate the model, the scientists used crashes and data from 2017 and 2018, and tested its performance at predicting crashes in 2019 and 2020. Many locations were identified as high-risk, even though they had no recorded crashes, and also experienced crashes during the follow-up years.

“Our model can generalize from one city to another by combining multiple clues from seemingly unrelated data sources. This is a step toward general AI, because our model can predict crash maps in uncharted territories,” says Amin Sadeghi, a lead scientist at Qatar Computing Research Institute (QCRI) and an author on the paper. “The model can be used to infer a useful crash map even in the absence of historical crash data, which could translate to positive use for city planning and policymaking by comparing imaginary scenarios.”

The dataset covered 7,500 square kilometers from Los Angeles, New York City, Chicago and Boston. Among the four cities, L.A. was the most unsafe, since it had the highest crash density, followed by New York City, Chicago, and Boston.

“If people can use the risk map to identify potentially high-risk road segments, they can take action in advance to reduce the risk of trips they take. Apps like Waze and Apple Maps have incident feature tools, but we’re trying to get ahead of the crashes — before they happen,” says He.

He and Sadeghi wrote the paper alongside Sanjay Chawla, research director at QCRI, and MIT professors of electrical engineering and computer science Mohammad Alizadeh, Hari Balakrishnan, and Sam Madden. They will present the paper at the 2021 International Conference on Computer Vision.

Get Started with TensorFlow Lite Micro on Sony’s Spresense

A guest post by Daniel Sandblom, Sony

Editor’s note: an earlier version of this article appeared on the Sony Developers.

Photo of a Sony Spresense microcontroller board

Now you can develop solutions with TensorFlow Lite Micro (TFLM) for the Spresense microcontroller board from Sony. TFLM is designed to run on microcontroller systems where the hardware resources are more limited compared to larger computerized systems. The footprint of TFLM is typically in the order of only 10’s of kBs.

What you get is a combination of a leading machine learning ecosystem with a high performance microcontroller running at super low power consumption. The Spresense board was designed with camera and hi-res audio inputs as core features which open up a substantial set of use cases. Pete Warden, a research engineer on the TensorFlow team, shares his view on that TFLM is now available for use with the Spresense board: “It’s great to see this kind of compute capability tightly integrated into a low power sensor, the combination will help make machine learning accessible to developers in medical, agriculture, industrial monitoring and many other areas where a small form factor and energy are strong constraints.”

The development of TFLM has been a tight collaboration between Google and Arm to optimize functionality while keeping the footprint to a minimum. Fredrik Knutsson, Team Lead at Arm, explains how TFLM has been optimized for the ARM processor architecture: “Arm’s open source CMSIS-NN library provides high performance implementations of common neural network functions for Arm Cortex-M processors. Arm’s engineers have worked closely with the TensorFlow team to develop optimized versions of the TensorFlow Lite kernels in the CMSIS-NN library, delivering extremely fast performance on Arm Cortex-M cores like Spresense.”

How to get started with TensorFlow on Spresense

The easiest and quickest way to get started with TensorFlow on Spresense is to run one of the examples. There is one hello_world example that shows the basic steps and functionality. There is also a micro_speech example using Spresense’s audio abilities, and there’s a person_detection example utilizing the Spresense camera. The latter two examples demonstrate how to link visual and audio sensors to the inputs of TensorFlow models.

Below are the general steps to run the examples:

Set up the Spresense SDK: Getting started with TensorFlow for Spresense
Download the Spresense repository including the examples
Build and Flash the binary into Spresense main board
Run the example

Heads-up: we will run an upcoming webinar for “TensorFlow on Spresense” on October 14 – register here!

Check out these links for more info:

Vedere AI

Monthly Archives: October 2021