These startups are using AI to decarbonize buildings, create sustainable agriculture, protect biodiversity, and remove carbon from the atmosphere.Read More
MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks
Vision-language foundational models are built on the premise of a single pre-training followed by subsequent adaptation to multiple downstream tasks. Two main and disjoint training scenarios are popular: a CLIP-style contrastive learning and next-token prediction. Contrastive learning trains the model to predict if image-text pairs correctly match, effectively building visual and text representations for the corresponding image and text inputs, whereas next-token prediction predicts the most likely next text token in a sequence, thus learning to generate text, according to the required task. Contrastive learning enables image-text and text-image retrieval tasks, such as finding the image that best matches a certain description, and next-token learning enables text-generative tasks, such as Image Captioning and Visual Question Answering (VQA). While both approaches have demonstrated powerful results, when a model is pre-trained contrastively, it typically does not fare well on text-generative tasks and vice-versa. Furthermore, adaptation to other tasks is often done with complex or inefficient methods. For example, in order to extend a vision-language model to videos, some models need to do inference for each video frame separately. This limits the size of the videos that can be processed to only a few frames and does not fully take advantage of motion information available across frames.
Motivated by this, we present “A Simple Architecture for Joint Learning for MultiModal Tasks”, called MaMMUT, which is able to train jointly for these competing objectives and which provides a foundation for many vision-language tasks either directly or via simple adaptation. MaMMUT is a compact, 2B-parameter multimodal model that trains across contrastive, text generative, and localization-aware objectives. It consists of a single image encoder and a text decoder, which allows for a direct reuse of both components. Furthermore, a straightforward adaptation to video-text tasks requires only using the image encoder once and can handle many more frames than prior work. In line with recent language models (e.g., PaLM, GLaM, GPT3), our architecture uses a decoder-only text model and can be thought of as a simple extension of language models. While modest in size, our model outperforms the state of the art or achieves competitive performance on image-text and text-image retrieval, video question answering (VideoQA), video captioning, open-vocabulary detection, and VQA.
The MaMMUT model enables a wide range of tasks such as image-text/text-image retrieval (top left and top right), VQA (middle left), open-vocabulary detection (middle right), and VideoQA (bottom). |
Decoder-only model architecture
One surprising finding is that a single language-decoder is sufficient for all these tasks, which obviates the need for both complex constructs and training procedures presented before. For example, our model (presented to the left in the figure below) consists of a single visual encoder and single text-decoder, connected via cross attention, and trains simultaneously on both contrastive and text-generative types of losses. Comparatively, prior work is either not able to handle image-text retrieval tasks, or applies only some losses to only some parts of the model. To enable multimodal tasks and fully take advantage of the decoder-only model, we need to jointly train both contrastive losses and text-generative captioning-like losses.
MaMMUT architecture (left) is a simple construct consisting of a single vision encoder and a single text decoder. Compared to other popular vision-language models — e.g., PaLI (middle) and ALBEF, CoCa (right) — it trains jointly and efficiently for multiple vision-language tasks, with both contrastive and text-generative losses, fully sharing the weights between the tasks. |
Decoder two-pass learning
Decoder-only models for language learning show clear advantages in performance with smaller model size (almost half the parameters). The main challenge for applying them to multimodal settings is to unify the contrastive learning (which uses unconditional sequence-level representation) with captioning (which optimizes the likelihood of a token conditioned on the previous tokens). We propose a two-pass approach to jointly learn these two conflicting types of text representations within the decoder. During the first pass, we utilize cross attention and causal masking to learn the caption generation task — the text features can attend to the image features and predict the tokens in sequence. On the second pass, we disable the cross-attention and causal masking to learn the contrastive task. The text features will not see the image features but can attend bidirectionally to all text tokens at once to produce the final text-based representation. Completing this two-pass approach within the same decoder allows for accommodating both types of tasks that were previously hard to reconcile. While simple, we show that this model architecture is able to provide a foundation for multiple multimodal tasks.
MaMMUT decoder-only two-pass learning enables both contrastive and generative learning paths by the same model. |
Another advantage of our architecture is that, since it is trained for these disjoint tasks, it can be seamlessly applied to multiple applications such as image-text and text-image retrieval, VQA, and captioning.
Moreover, MaMMUT easily adapts to video-language tasks. Previous approaches used a vision encoder to process each frame individually, which required applying it multiple times. This is slow and restricts the number of frames the model can handle, typically to only 6–8. With MaMMUT, we use sparse video tubes for lightweight adaptation directly via the spatio-temporal information from the video. Furthermore, adapting the model to Open-Vocabulary Detection is done by simply training to detect bounding-boxes via an object-detection head.
Results
Our model achieves excellent zero-shot results on image-text and text-image retrieval without any adaptation, outperforming all previous state-of-the-art models. The results on VQA are competitive with state-of-the-art results, which are achieved by much larger models. The PaLI model (17B parameters) and the Flamingo model (80B) have the best performance on the VQA2.0 dataset, but MaMMUT (2B) has the same accuracy as the 15B PaLI.
MaMMUT outperforms the state of the art (SOTA) on Zero-Shot Image-Text (I2T) and Text-Image (T2I) retrieval on both MS-COCO (top) and Flickr (bottom) benchmarks. |
Performance on the VQA2.0 dataset is competitive but does not outperform large models such as Flamingo-80B and PalI-17B. Performance is evaluated in the more challenging open-ended text generation setting. |
MaMMUT also outperforms the state-of-the-art on VideoQA, as shown below on the MSRVTT-QA and MSVD-QA datasets. Note that we outperform much bigger models such as Flamingo, which is specifically designed for image+video pre-training and is pre-trained with both image-text and video-text data.
Our results outperform the state-of-the-art on open-vocabulary detection fine-tuning as is also shown below.
MAMMUT open-vocabulary detection results on the LVIS dataset compared to state-of-the-art methods. We report the average precisions for rare classes (APr) as is previously adopted in the literature. |
Key ingredients
We show that joint training of both contrastive and text-generative objectives is not an easy task, and in our ablations we find that these tasks are served better by different design choices. We see that fewer cross-attention connections are better for retrieval tasks, but more are preferred by VQA tasks. Yet, while this shows that our model’s design choices might be suboptimal for individual tasks, our model is more effective than more complex, or larger, models.
Ablation studies showing that fewer cross-attention connections (1-2) are better for retrieval tasks (top), whereas more connections favor text-generative tasks such as VQA (bottom). |
Conclusion
We presented MaMMUT, a simple and compact vision-encoder language-decoder model that jointly trains a number of conflicting objectives to reconcile contrastive-like and text-generative tasks. Our model also serves as a foundation for many more vision-language tasks, achieving state-of-the-art or competitive performance on image-text and text-image retrieval, videoQA, video captioning, open-vocabulary detection and VQA. We hope it can be further used for more multimodal applications.
Acknowledgements
The work described is co-authored by: Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and Anelia Angelova. We would like to thank Mojtaba Seyedhosseini, Vijay Vasudevan, Priya Goyal, Jiahui Yu, Zirui Wang, Yonghui Wu, Runze Li, Jie Mei, Radu Soricut, Qingqing Huang, Andy Ly, Nan Du, Yuxin Wu, Tom Duerig, Paul Natsev, Zoubin Ghahramani for their help and support.
IndoorSim-to-OutdoorReal: Learning to navigate outdoors without any outdoor experience
Teaching mobile robots to navigate in complex outdoor environments is critical to real-world applications, such as delivery or search and rescue. However, this is also a challenging problem as the robot needs to perceive its surroundings, and then explore to identify feasible paths towards the goal. Another common challenge is that the robot needs to overcome uneven terrains, such as stairs, curbs, or rockbed on a trail, while avoiding obstacles and pedestrians. In our prior work, we investigated the second challenge by teaching a quadruped robot to tackle challenging uneven obstacles and various outdoor terrains.
In “IndoorSim-to-OutdoorReal: Learning to Navigate Outdoors without any Outdoor Experience”, we present our recent work to tackle the robotic challenge of reasoning about the perceived surroundings to identify a viable navigation path in outdoor environments. We introduce a learning-based indoor-to-outdoor transfer algorithm that uses deep reinforcement learning to train a navigation policy in simulated indoor environments, and successfully transfers that same policy to real outdoor environments. We also introduce Context-Maps (maps with environment observations created by a user), which are applied to our algorithm to enable efficient long-range navigation. We demonstrate that with this policy, robots can successfully navigate hundreds of meters in novel outdoor environments, around previously unseen outdoor obstacles (trees, bushes, buildings, pedestrians, etc.), and in different weather conditions (sunny, overcast, sunset).
PointGoal navigation
User inputs can tell a robot where to go with commands like “go to the Android statue”, pictures showing a target location, or by simply picking a point on a map. In this work, we specify the navigation goal (a selected point on a map) as a relative coordinate to the robot’s current position (i.e., “go to ∆x, ∆y”), this is also known as the PointGoal Visual Navigation (PointNav) task. PointNav is a general formulation for navigation tasks and is one of the standard choices for indoor navigation tasks. However, due to the diverse visuals, uneven terrains and long distance goals in outdoor environments, training PointNav policies for outdoor environments is a challenging task.
Indoor-to-outdoor transfer
Recent successes in training wheeled and legged robotic agents to navigate in indoor environments were enabled by the development of fast, scalable simulators and the availability of large-scale datasets of photorealistic 3D scans of indoor environments. To leverage these successes, we develop an indoor-to-outdoor transfer technique that enables our robots to learn from simulated indoor environments and to be deployed in real outdoor environments.
To overcome the differences between simulated indoor environments and real outdoor environments, we apply kinematic control and image augmentation techniques in our learning system. When using kinematic control, we assume the existence of a reliable low-level locomotion controller that can control the robot to precisely reach a new location. This assumption allows us to directly move the robot to the target location during simulation training through a forward Euler integration and relieves us from having to explicitly model the underlying robot dynamics in simulation, which drastically improves the throughput of simulation data generation. Prior work has shown that kinematic control can lead to better sim-to-real transfer compared to a dynamic control approach, where full robot dynamics are modeled and a low-level locomotion controller is required for moving the robot.
Left Kinematic control; Right: Dynamic control |
We created an outdoor maze-like environment using objects found indoors for initial experiments, where we used Boston Dynamics’ Spot robot for test navigation. We found that the robot could navigate around novel obstacles in the new outdoor environment.
The Spot robot successfully navigates around obstacles found in indoor environments, with a policy trained entirely in simulation. |
However, when faced with unfamiliar outdoor obstacles not seen during training, such as a large slope, the robot was unable to navigate the slope.
The robot is unable to navigate up slopes, as slopes are rare in indoor environments and the robot was not trained to tackle it. |
To enable the robot to walk up and down slopes, we apply an image augmentation technique during the simulation training. Specifically, we randomly tilt the simulated camera on the robot during training. It can be pointed up or down within 30 degrees. This augmentation effectively makes the robot perceive slopes even though the floor is level. Training on these perceived slopes enables the robot to navigate slopes in the real-world.
By randomly tilting the camera angle during training in simulation, the robot is now able to walk up and down slopes. |
Since the robots were only trained in simulated indoor environments, in which they typically need to walk to a goal just a few meters away, we find that the learned network failed to process longer-range inputs — e.g., the policy failed to walk forward for 100 meters in an empty space. To enable the policy network to handle long-range inputs that are common for outdoor navigation, we normalize the goal vector by using the log of the goal distance.
Context-Maps for complex long-range navigation
Putting everything together, the robot can navigate outdoors towards the goal, while walking on uneven terrain, and avoiding trees, pedestrians and other outdoor obstacles. However, there is still one key component missing: the robot’s ability to plan an efficient long-range path. At this scale of navigation, taking a wrong turn and backtracking can be costly. For example, we find that the local exploration strategy learned by standard PointNav policies are insufficient in finding a long-range goal and usually leads to a dead end (shown below). This is because the robot is navigating without context of its environment, and the optimal path may not be visible to the robot from the start.
Navigation policies without context of the environment do not handle complex long-range navigation goals. |
To enable the robot to take the context into consideration and purposefully plan an efficient path, we provide a Context-Map (a binary image that represents a top-down occupancy map of the region that the robot is within) as additional observations for the robot. An example Context-Map is given below, where the black region denotes areas occupied by obstacles and white region is walkable by the robot. The green and red circle denotes the start and goal location of the navigation task. Through the Context-Map, we can provide hints to the robot (e.g., the narrow opening in the route below) to help it plan an efficient navigation route. In our experiments, we create the Context-Map for each route guided by Google Maps satellite images. We denote this variant of PointNav with environmental context, as Context-Guided PointNav.
Example of the Context-Map (right) for a navigation task (left). |
It is important to note that the Context-Map does not need to be accurate because it only serves as a rough outline for planning. During navigation, the robot still needs to rely on its onboard cameras to identify and adapt its path to pedestrians, which are absent on the map. In our experiments, a human operator quickly sketches the Context-Map from the satellite image, masking out the regions to be avoided. This Context-Map, together with other onboard sensory inputs, including depth images and relative position to the goal, are fed into a neural network with attention models (i.e., transformers), which are trained using DD-PPO, a distributed implementation of proximal policy optimization, in large-scale simulations.
The Context-Guided PointNav architecture consists of a 3-layer convolutional neural network (CNN) to process depth images from the robot’s camera, and a multilayer perceptron (MLP) to process the goal vector. The features are passed into a gated recurrent unit (GRU). We use an additional CNN encoder to process the context-map (top-down map). We compute the scaled dot product attention between the map and the depth image, and use a second GRU to process the attended features (Context Attn., Depth Attn.). The output of the policy are linear and angular velocities for the Spot robot to follow. |
Results
We evaluate our system across three long-range outdoor navigation tasks. The provided Context-Maps are rough, incomplete environment outlines that omit obstacles, such as cars, trees, or chairs.
With the proposed algorithm, our robot can successfully reach the distant goal location 100% of the time, without a single collision or human intervention. The robot was able to navigate around pedestrians and real-world clutter that are not present on the context-map, and navigate on various terrain including dirt slopes and grass.
Route 1
Route 2
Route 3
Conclusion
This work opens up robotic navigation research to the less explored domain of diverse outdoor environments. Our indoor-to-outdoor transfer algorithm uses zero real-world experience and does not require the simulator to model predominantly-outdoor phenomena (terrain, ditches, sidewalks, cars, etc). The success in the approach comes from a combination of a robust locomotion control, low sim-to-real gap in depth and map sensors, and large-scale training in simulation. We demonstrate that providing robots with approximate, high-level maps can enable long-range navigation in novel outdoor environments. Our results provide compelling evidence for challenging the (admittedly reasonable) hypothesis that a new simulator must be designed for every new scenario we wish to study. For more information, please see our project page.
Acknowledgements
We would like to thank Sonia Chernova, Tingnan Zhang, April Zitkovich, Dhruv Batra, and Jie Tan for advising and contributing to the project. We would also like to thank Naoki Yokoyama, Nubby Lee, Diego Reyes, Ben Jyenis, and Gus Kouretas for help with the robot experiment setup.
Checks, Google’s AI-powered privacy platform
Checks takes the complexity out of compliance, using AI to help companies quickly discover, communicate and fix issues.Read More
Try 4 new Arts and AI experiments
Four new online interactive artworks from Google Arts & Culture Lab artists in residenceRead More
Google at ICLR 2023
The Eleventh International Conference on Learning Representations (ICLR 2023) is being held this week as a hybrid event in Kigali, Rwanda. We are proud to be a Diamond Sponsor of ICLR 2023, a premier conference on deep learning, where Google researchers contribute at all levels. This year we are presenting over 100 papers and are actively involved in organizing and hosting a number of different events, including workshops and interactive sessions.
If you’re registered for ICLR 2023, we hope you’ll visit the Google booth to learn more about the exciting work we’re doing across topics spanning representation and reinforcement learning, theory and optimization, social impact, safety and privacy, and applications from generative AI to speech and robotics. Continue below to find the many ways in which Google researchers are engaged at ICLR 2023, including workshops, papers, posters and talks (Google affiliations in bold).
Board and Organizing Committee
Board Members include: Shakir Mohamed, Tara Sainath
Senior Program Chairs include: Been Kim
Workshop Chairs include: Aisha Walcott-Bryant, Rose Yu
Diversity, Equity & Inclusion Chairs include: Rosanne Liu
Outstanding Paper awards
Emergence of Maps in the Memories of Blind Navigation Agents
Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra
DreamFusion: Text-to-3D Using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall
Keynote speaker
Learned Optimizers: Why They’re the Future, Why They’re Hard, and What They Can Do Now
Jascha Sohl-Dickstein
Workshops
Kaggle@ICLR 2023: ML Solutions in Africa
Organizers include: Julia Elliott, Phil Culliton, Ray Harvey
Facilitators: Julia Elliot, Walter Reade
Reincarnating Reinforcement Learning (Reincarnating RL)
Organizers include: Rishabh Agarwal, Ted Xiao, Max Schwarzer
Speakers include: Sergey Levine
Panelists include: Marc G. Bellemare, Sergey Levine
Trustworthy and Reliable Large-Scale Machine Learning Models
Organizers include: Sanmi Koyejo
Speakers include: Nicholas Carlini
Physics for Machine Learning (Physics4ML)
Speakers include: Yasaman Bahri
AI for Agent-Based Modelling Community (AI4ABM)
Organizers include: Pablo Samuel Castro
Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)
Organizers include: Mathilde Caron, Tengyu Ma, Hanie Sedghi
Speakers include: Yasaman Bahri, Yann Dauphin
Neurosymbolic Generative Models 2023 (NeSy-GeMs)
Organizers include: Kevin Ellis
Speakers include: Daniel Tarlow, Tuan Anh Le
What Do We Need for Successful Domain Generalization?
Panelists include: Boqing Gong
The 4th Workshop on Practical ML for Developing Countries: Learning Under Limited/Low Resource Settings
Keynote Speaker: Adji Bousso Dieng
Machine Learning for Remote Sensing
Speakers include: Abigail Annkah
Multimodal Representation Learning (MRL): Perks and Pitfalls
Organizers include: Petra Poklukar
Speakers include: Arsha Nagrani
Pitfalls of Limited Data and Computation for Trustworthy ML
Organizers include: Prateek Jain
Speakers include: Nicholas Carlini, Praneeth Netrapalli
Sparsity in Neural Networks: On Practical Limitations and Tradeoffs Between Sustainability and Efficiency
Organizers include: Trevor Gale, Utku Evci
Speakers include: Aakanksha Chowdhery, Jeff Dean
Time Series Representation Learning for Health
Speakers include: Katherine Heller
Deep Learning for Code (DL4C)
Organizers include: Gabriel Orlanski
Speakers include: Alex Polozov, Daniel Tarlow
Affinity Workshops
Tiny Papers Showcase Day (a DEI initiative)
Organizers include: Rosanne Liu
Papers
Evolve Smoothly, Fit Consistently: Learning Smooth Latent Dynamics for Advection-Dominated Systems
Zhong Yi Wan, Leonardo Zepeda-Nunez, Anudhyan Boral, Fei Sha
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, Chiyuan Zhang
Emergence of Maps in the Memories of Blind Navigation Agents (Outstanding Paper Award)
Erik Wijmans, Manolis Savva, Irfan Essa, Stefan Lee, Ari S. Morcos, Dhruv Batra
Offline Q-Learning on Diverse Multi-task Data Both Scales and Generalizes (see blog post)
Aviral Kumar, Rishabh Agarwal, Xingyang Geng, George Tucker, Sergey Levine
ReAct: Synergizing Reasoning and Acting in Language Models (see blog post)
Shunyu Yao*, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, Yuan Cao
Prompt-to-Prompt Image Editing with Cross-Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, Daniel Cohen-Or
DreamFusion: Text-to-3D Using 2D Diffusion (Outstanding Paper Award)
Ben Poole, Ajay Jain, Jonathan T. Barron, Ben Mildenhall
A System for Morphology-Task Generalization via Unified Representation and Behavior Distillation
Hiroki Furuta, Yusuke Iwasawa, Yutaka Matsuo, Shixiang Shane Gu
Sample-Efficient Reinforcement Learning by Breaking the Replay Ratio Barrier
Pierluca D’Oro, Max Schwarzer, Evgenii Nikishin, Pierre-Luc Bacon, Marc G Bellemare, Aaron Courville
Dichotomy of Control: Separating What You Can Control from What You Cannot
Sherry Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum
Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search
Michał Zawalski, Michał Tyrolski, Konrad Czechowski, Tomasz Odrzygóźdź, Damian Stachura, Piotr Piekos, Yuhuai Wu, Łukasz Kucinski, Piotr Miłos
The Trade-Off Between Universality and Label Efficiency of Representations from Contrastive Learning
Zhenmei Shi, Jiefeng Chen, Kunyang Li, Jayaram Raghuram, Xi Wu, Yingyu Liang, Somesh Jha
Sparsity-Constrained Optimal Transport
Tianlin Liu*, Joan Puigcerver, Mathieu Blondel
Unmasking the Lottery Ticket Hypothesis: What’s Encoded in a Winning Ticket’s Mask?
Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite
Extreme Q-Learning: MaxEnt RL without Entropy
Divyansh Garg, Joey Hejna, Matthieu Geist, Stefano Ermon
Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs
Albert Qiaochu Jiang, Sean Welleck, Jin Peng Zhou, Timothee Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, Yuhuai Wu
SimPer: Simple Self-Supervised Learning of Periodic Targets
Yuzhe Yang, Xin Liu, Jiang Wu, Silviu Borac, Dina Katabi, Ming-Zher Poh, Daniel McDuff
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Marcin Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S. Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
What Learning Algorithm Is In-Context Learning? Investigations with Linear Models
Ekin Akyurek*, Dale Schuurmans, Jacob Andreas, Tengyu Ma*, Denny Zhou
Preference Transformer: Modeling Human Preferences Using Transformers for RL
Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee
Iterative Patch Selection for High-Resolution Image Recognition
Benjamin Bergner, Christoph Lippert, Aravindh Mahendran
Open-Vocabulary Object Detection upon Frozen Vision and Language Models
Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
(Certified!!) Adversarial Robustness for Free!
Nicholas Carlini, Florian Tramér, Krishnamurthy (Dj) Dvijotham, Leslie Rice, Mingjie Sun, J. Zico Kolter
REPAIR: REnormalizing Permuted Activations for Interpolation Repair
Keller Jordan, Hanie Sedghi, Olga Saukh, Rahim Entezari, Behnam Neyshabur
Discrete Predictor-Corrector Diffusion Models for Image Synthesis
José Lezama, Tim Salimans, Lu Jiang, Huiwen Chang, Jonathan Ho, Irfan Essa
Feature Reconstruction From Outputs Can Mitigate Simplicity Bias in Neural Networks
Sravanti Addepalli, Anshul Nasery, Praneeth Netrapalli, Venkatesh Babu R., Prateek Jain
An Exact Poly-time Membership-Queries Algorithm for Extracting a Three-Layer ReLU Network
Amit Daniely, Elad Granot
Language Models Are Multilingual Chain-of-Thought Reasoners
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, Jason Wei
Scaling Forward Gradient with Local Losses
Mengye Ren*, Simon Kornblith, Renjie Liao, Geoffrey Hinton
Treeformer: Dense Gradient Trees for Efficient Attention Computation
Lovish Madaan, Srinadh Bhojanapalli, Himanshu Jain, Prateek Jain
LilNetX: Lightweight Networks with EXtreme Model Compression and Structured Sparsification
Sharath Girish, Kamal Gupta, Saurabh Singh, Abhinav Shrivastava
DiffusER: Diffusion via Edit-Based Reconstruction
Machel Reid, Vincent J. Hellendoorn, Graham Neubig
Leveraging Unlabeled Data to Track Memorization
Mahsa Forouzesh, Hanie Sedghi, Patrick Thiran
A Mixture-of-Expert Approach to RL-Based Dialogue Management
Yinlam Chow, Aza Tulepbergenov, Ofir Nachum, Dhawal Gupta, Moonkyung Ryu, Mohammad Ghavamzadeh, Craig Boutilier
Easy Differentially Private Linear Regression
Kareem Amin, Matthew Joseph, Monica Ribero, Sergei Vassilvitskii
KwikBucks: Correlation Clustering with Cheap-Weak and Expensive-Strong Signals
Sandeep Silwal*, Sara Ahmadian, Andrew Nystrom, Andrew McCallum, Deepak Ramachandran, Mehran Kazemi
Massively Scaling Heteroscedastic Classifiers
Mark Collier, Rodolphe Jenatton, Basil Mustafa, Neil Houlsby, Jesse Berent, Effrosyni Kokiopoulou
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar
Compositional Semantic Parsing with Large Language Models
Andrew Drozdov, Nathanael Scharli, Ekin Akyurek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, Denny Zhou
Extremely Simple Activation Shaping for Out-of-Distribution Detection
Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, Rosanne Liu
Long Range Language Modeling via Gated State Spaces
Harsh Mehta, Ankit Gupta, Ashok Cutkosky, Behnam Neyshabur
Investigating Multi-task Pretraining and Generalization in Reinforcement Learning
Adrien Ali Taiga, Rishabh Agarwal, Jesse Farebrother, Aaron Courville, Marc G. Bellemare
Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Nets
Edo Cohen-Karlik, Itamar Menuhin-Gruman, Raja Giryes, Nadav Cohen, Amir Globerson
Weighted Ensemble Self-Supervised Learning
Yangjun Ruan*, Saurabh Singh, Warren Morningstar, Alexander A. Alemi, Sergey Ioffe, Ian Fischer, Joshua V. Dillon
Calibrating Sequence Likelihood Improves Conditional Language Generation
Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, Peter J. Liu
SMART: Sentences as Basic Units for Text Evaluation
Reinald Kim Amplayo, Peter J. Liu, Yao Zhao, Shashi Narayan
Leveraging Importance Weights in Subset Selection
Gui Citovsky, Giulia DeSalvo, Sanjiv Kumar, Srikumar Ramalingam, Afshin Rostamizadeh, Yunjuan Wang*
Proto-Value Networks: Scaling Representation Learning with Auxiliary Tasks
Jesse Farebrother, Joshua Greaves, Rishabh Agarwal, Charline Le Lan, Ross Goroshin, Pablo Samuel Castro, Marc G. Bellemare
An Extensible Multi-modal Multi-task Object Dataset with Materials
Trevor Standley, Ruohan Gao, Dawn Chen, Jiajun Wu, Silvio Savarese
Measuring Forgetting of Memorized Training Examples
Matthew Jagielski, Om Thakkar, Florian Tramér, Daphne Ippolito, Katherine Lee, Nicholas Carlini, Eric Wallace, Shuang Song, Abhradeep Thakurta, Nicolas Papernot, Chiyuan Zhang
Bidirectional Language Models Are Also Few-Shot Learners
Ajay Patel, Bryan Li, Mohammad Sadegh Rasooli, Noah Constant, Colin Raffel, Chris Callison-Burch
Is Attention All That NeRF Needs?
Mukund Varma T., Peihao Wang, Xuxi Chen, Tianlong Chen, Subhashini Venugopalan, Zhangyang Wang
Automating Nearest Neighbor Search Configuration with Constrained Optimization
Philip Sun, Ruiqi Guo, Sanjiv Kumar
Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions
David Bieber, Rishab Goel, Daniel Zheng, Hugo Larochelle, Daniel Tarlow
Composing Ensembles of Pre-trained Models via Iterative Consensus
Shuang Li, Yilun Du, Joshua B. Tenenbaum, Antonio Torralba, Igor Mordatch
Λ-DARTS: Mitigating Performance Collapse by Harmonizing Operation Selection Among Cells
Sajad Movahedi, Melika Adabinejad, Ayyoob Imani, Arezou Keshavarz, Mostafa Dehghani, Azadeh Shakery, Babak N. Araabi
Blurring Diffusion Models
Emiel Hoogeboom, Tim Salimans
Part-Based Models Improve Adversarial Robustness
Chawin Sitawarin, Kornrapat Pongmala, Yizheng Chen, Nicholas Carlini, David Wagner
Learning in Temporally Structured Environments
Matt Jones, Tyler R. Scott, Mengye Ren, Gamaleldin ElSayed, Katherine Hermann, David Mayo, Michael C. Mozer
SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models
Ziyi Wu, Nikita Dvornik, Klaus Greff, Thomas Kipf, Animesh Garg
Robust Algorithms on Adaptive Inputs from Bounded Adversaries
Yeshwanth Cherapanamjeri, Sandeep Silwal, David P. Woodruff, Fred Zhang, Qiuyi (Richard) Zhang, Samson Zhou
Agnostic Learning of General ReLU Activation Using Gradient Descent
Pranjal Awasthi, Alex Tang, Aravindan Vijayaraghavan
Analog Bits: Generating Discrete Data Using Diffusion Models with Self-Conditioning
Ting Chen, Ruixiang Zhang, Geoffrey Hinton
Any-Scale Balanced Samplers for Discrete Space
Haoran Sun*, Bo Dai, Charles Sutton, Dale Schuurmans, Hanjun Dai
Augmentation with Projection: Towards an Effective and Efficient Data Augmentation Paradigm for Distillation
Ziqi Wang*, Yuexin Wu, Frederick Liu, Daogao Liu, Le Hou, Hongkun Yu, Jing Li, Heng Ji
Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD
Konstantinos E. Nikolakakis, Farzin Haddadpour, Amin Karbasi, Dionysios S. Kalogerias
Causal Estimation for Text Data with (Apparent) Overlap Violations
Lin Gui, Victor Veitch
Contrastive Learning Can Find an Optimal Basis for Approximately View-Invariant Functions
Daniel D. Johnson, Ayoub El Hanchi, Chris J. Maddison
Differentially Private Adaptive Optimization with Delayed Preconditioners
Tian Li, Manzil Zaheer, Ziyu Liu, Sashank Reddi, Brendan McMahan, Virginia Smith
Distributionally Robust Post-hoc Classifiers Under Prior Shifts
Jiaheng Wei*, Harikrishna Narasimhan, Ehsan Amid, Wen-Sheng Chu, Yang Liu, Abhishek Kumar
Human Alignment of Neural Network Representations
Lukas Muttenthaler, Jonas Dippel, Lorenz Linhardt, Robert A. Vandermeulen, Simon Kornblith
Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data
Spencer Frei, Gal Vardi, Peter Bartlett, Nathan Srebro, Wei Hu
Koopman Neural Operator Forecaster for Time-Series with Temporal Distributional Shifts
Rui Wang*, Yihe Dong, Sercan Ö. Arik, Rose Yu
Latent Variable Representation for Reinforcement Learning
Tongzheng Ren, Chenjun Xiao, Tianjun Zhang, Na Li, Zhaoran Wang, Sujay Sanghavi, Dale Schuurmans, Bo Dai
Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, Ed Chi
Mind’s Eye: Grounded Language Model Reasoning Through Simulation
Ruibo Liu, Jason Wei, Shixiang Shane Gu, Te-Yen Wu, Soroush Vosoughi, Claire Cui, Denny Zhou, Andrew M. Dai
MOAT: Alternating Mobile Convolution and Attention Brings Strong Vision Models
Chenglin Yang*, Siyuan Qiao, Qihang Yu, Xiaoding Yuan, Yukun Zhu, Alan Yuille, Hartwig Adam, Liang-Chieh Chen
Novel View Synthesis with Diffusion Models
Daniel Watson, William Chan, Ricardo Martin-Brualla, Jonathan Ho, Andrea Tagliasacchi, Mohammad Norouzi
On Accelerated Perceptrons and Beyond
Guanghui Wang, Rafael Hanashiro, Etash Guha, Jacob Abernethy
On Compositional Uncertainty Quantification for Seq2seq Graph Parsing
Zi Lin*, Du Phan, Panupong Pasupat, Jeremiah Liu, Jingbo Shang
On the Robustness of Safe Reinforcement Learning Under Observational Perturbations
Zuxin Liu, Zijian Guo, Zhepeng Cen, Huan Zhang, Jie Tan, Bo Li, Ding Zhao
Online Low Rank Matrix Completion
Prateek Jain, Soumyabrata Pal
Out-of-Distribution Detection and Selective Generation for Conditional Language Models
Jie Ren, Jiaming Luo, Yao Zhao, Kundan Krishna*, Mohammad Saleh, Balaji Lakshminarayanan, Peter J. Liu
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish V. Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme Ruiz, Andreas Peter Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
Phenaki: Variable Length Video Generation from Open Domain Textual Descriptions
Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro*, Julius Kunze*, Dumitru Erhan
Promptagator: Few-Shot Dense Retrieval from 8 Examples
Zhuyun Dai, Vincent Y. Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, Ming-Wei Chang
Pushing the Accuracy-Group Robustness Frontier with Introspective Self-Play
Jeremiah Zhe Liu, Krishnamurthy Dj Dvijotham, Jihyeon Lee, Quan Yuan, Balaji Lakshminarayanan, Deepak Ramachandran
Re-Imagen: Retrieval-Augmented Text-to-Image Generator
Wenhu Chen, Hexiang Hu, Chitwan Saharia, William W. Cohen
Recitation-Augmented Language Models
Zhiqing Sun, Xuezhi Wang, Yi Tay, Yiming Yang, Denny Zhou
Regression with Label Differential Privacy
Badih Ghazi, Pritish Kamath, Ravi Kumar, Ethan Leeman, Pasin Manurangsi, Avinash Varadarajan, Chiyuan Zhang
Revisiting the Entropy Semiring for Neural Speech Recognition
Oscar Chang, Dongseong Hwang, Olivier Siohan
Robust Active Distillation
Cenk Baykal, Khoa Trinh, Fotis Iliopoulos, Gaurav Menghani, Erik Vee
Score-Based Continuous-Time Discrete Diffusion Models
Haoran Sun*, Lijun Yu, Bo Dai, Dale Schuurmans, Hanjun Dai
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou
Self-Supervision Through Random Segments with Autoregressive Coding (RandSAC)
Tianyu Hua, Yonglong Tian, Sucheng Ren, Michalis Raptis, Hang Zhao, Leonid Sigal
Serving Graph Compression for Graph Neural Networks
Si Si, Felix Yu, Ankit Singh Rawat, Cho-Jui Hsieh, Sanjiv Kumar
Sequential Attention for Feature Selection
Taisuke Yasuda*, MohammadHossein Bateni, Lin Chen, Matthew Fahrbach, Gang Fu, Vahab Mirrokni
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Aran Komatsuzaki*, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, Neil Houlsby
Spectral Decomposition Representation for Reinforcement Learning
Tongzheng Ren, Tianjun Zhang, Lisa Lee, Joseph Gonzalez, Dale Schuurmans, Bo Dai
Spotlight: Mobile UI Understanding Using Vision-Language Models with a Focus (see blog post)
Gang Li, Yang Li
Supervision Complexity and Its Role in Knowledge Distillation
Hrayr Harutyunyan*, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon Kim, Sanjiv Kumar
Teacher Guided Training: An Efficient Framework for Knowledge Transfer
Manzil Zaheer, Ankit Singh Rawat, Seungyeon Kim, Chong You, Himanshu Jain, Andreas Veit, Rob Fergus, Sanjiv Kumar
TEMPERA: Test-Time Prompt Editing via Reinforcement Learning
Tianjun Zhang, Xuezhi Wang, Denny Zhou, Dale Schuurmans, Joseph E. Gonzalez
UL2: Unifying Language Learning Paradigms
Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Xavier Garcia, Jason Wei, Xuezhi Wang, Hyung Won Chung, Dara Bahri, Tal Schuster, Steven Zheng, Denny Zhou, Neil Houlsby, Donald Metzler
* Work done while at Google
An ML-based approach to better characterize lung diseases
The combination of the environment an individual experiences and their genetic predispositions determines the majority of their risk for various diseases. Large national efforts, such as the UK Biobank, have created large, public resources to better understand the links between environment, genetics, and disease. This has the potential to help individuals better understand how to stay healthy, clinicians to treat illnesses, and scientists to develop new medicines.
One challenge in this process is how we make sense of the vast amount of clinical measurements — the UK Biobank has many petabytes of imaging, metabolic tests, and medical records spanning 500,000 individuals. To best use this data, we need to be able to represent the information present as succinct, informative labels about meaningful diseases and traits, a process called phenotyping. That is where we can use the ability of ML models to pick up on subtle intricate patterns in large amounts of data.
We’ve previously demonstrated the ability to use ML models to quickly phenotype at scale for retinal diseases. Nonetheless, these models were trained using labels from clinician judgment, and access to clinical-grade labels is a limiting factor due to the time and expense needed to create them.
In “Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models”, published in Nature Genetics, we’re excited to highlight a method for training accurate ML models for genetic discovery of diseases, even when using noisy and unreliable labels. We demonstrate the ability to train ML models that can phenotype directly from raw clinical measurement and unreliable medical record information. This reduced reliance on medical domain experts for labeling greatly expands the range of applications for our technique to a panoply of diseases and has the potential to improve their prevention, diagnosis, and treatment. We showcase this method with ML models that can better characterize lung function and chronic obstructive pulmonary disease (COPD). Additionally, we show the usefulness of these models by demonstrating a better ability to identify genetic variants associated with COPD, improved understanding of the biology behind the disease, and successful prediction of outcomes associated with COPD.
ML for deeper understanding of exhalation
For this demonstration, we focused on COPD, the third leading cause of worldwide death in 2019, in which airway inflammation and impeded airflow can progressively reduce lung function. Lung function for COPD and other diseases is measured by recording an individual’s exhalation volume over time (the record is called a spirogram; see an example below). Although there are guidelines (called GOLD) for determining COPD status from exhalation, these use only a few, specific data points in the curve and apply fixed thresholds to those values. Much of the rich data from these spirograms is discarded in this analysis of lung function.
We reasoned that ML models trained to classify spirograms would be able to use the rich data present more completely and result in more accurate and comprehensive measures of lung function and disease, similar to what we have seen in other classification tasks like mammography or histology. We trained ML models to predict whether an individual has COPD using the full spirograms as inputs.
The common method of training models for this problem, supervised learning, requires samples to be associated with labels. Determining those labels can require the effort of very time-constrained experts. For this work, to show that we do not necessarily need medically graded labels, we decided to use a variety of widely available sources of medical record information to create those labels without medical expert review. These labels are less reliable and noisy for two reasons. First, there are gaps in the medical records of individuals because they use multiple health services. Second, COPD is often undiagnosed, meaning many with the disease will not be labeled as having it even if we compile the complete medical records. Nonetheless, we trained a model to predict these noisy labels from the spirogram curves and treat the model predictions as a quantitative COPD liability or risk score.
Noisy COPD status labels were derived using various medical record sources (clinical data). A COPD liability model is then trained to predict COPD status from raw flow-volume spirograms. |
Predicting COPD outcomes
We then investigated whether the risk scores produced by our model could better predict a variety of binary COPD outcomes (for example, an individual’s COPD status, whether they were hospitalized for COPD or died from it). For comparison, we benchmarked the model relative to expert-defined measurements required to diagnose COPD, specifically FEV1/FVC, which compares specific points on the spirogram curve with a simple mathematical ratio. We observed an improvement in the ability to predict these outcomes as seen in the precision-recall curves below.
Precision-recall curves for COPD status and outcomes for our ML model (green) compared to traditional measures. Confidence intervals are shown by lighter shading. |
We also observed that separating populations by their COPD model score was predictive of all-cause mortality. This plot suggests that individuals with higher COPD risk are more likely to die earlier from any causes and the risk probably has implications beyond just COPD.
Survival analysis of a cohort of UK Biobank individuals stratified by their COPD model’s predicted risk quartile. The decrease of the curve indicates individuals in the cohort dying over time. For example, p100 represents the 25% of the cohort with greatest predicted risk, while p50 represents the 2nd quartile. |
Identifying the genetic links with COPD
Since the goal of large scale biobanks is to bring together large amounts of both phenotype and genetic data, we also performed a test called a genome-wide association study (GWAS) to identify the genetic links with COPD and genetic predisposition. A GWAS measures the strength of the statistical association between a given genetic variant — a change in a specific position of DNA — and the observations (e.g., COPD) across a cohort of cases and controls. Genetic associations discovered in this manner can inform drug development that modifies the activity or products of a gene, as well as expand our understanding of the biology for a disease.
We showed with our ML-phenotyping method that not only do we rediscover almost all known COPD variants found by manual phenotyping, but we also find many novel genetic variants significantly associated with COPD. In addition, we see good agreement on the effect sizes for the variants discovered by both our ML approach and the manual one (R2=0.93), which provides strong evidence for validity of the newly found variants.
Finally, our collaborators at Harvard Medical School and Brigham and Women’s Hospital further examined the plausibility of these findings by providing insights into the possible biological role of the novel variants in development and progression of COPD (you can see more discussion on these insights in the paper).
Conclusion
We demonstrated that our earlier methods for phenotyping with ML can be expanded to a wide range of diseases and can provide novel and valuable insights. We made two key observations by using this to predict COPD from spirograms and discovering new genetic insights. First, domain knowledge was not necessary to make predictions from raw medical data. Interestingly, we showed the raw medical data is probably underutilized and the ML model can find patterns in it that are not captured by expert-defined measurements. Second, we do not need medically graded labels; instead, noisy labels defined from widely available medical records can be used to generate clinically predictive and genetically informative risk scores. We hope that this work will broadly expand the ability of the field to use noisy labels and will improve our collective understanding of lung function and disease.
Acknowledgments
This work is the combined output of multiple contributors and institutions. We thank all contributors: Justin Cosentino, Babak Alipanahi, Zachary R. McCaw, Cory Y. McLean, Farhad Hormozdiari (Google), Davin Hill (Northeastern University), Tae-Hwi Schwantes-An and Dongbing Lai (Indiana University), Brian D. Hobbs and Michael H. Cho (Brigham and Women’s Hospital, and Harvard Medical School). We also thank Ted Yun and Nick Furlotte for reviewing the manuscript, Greg Corrado and Shravya Shetty for support, and Howard Yang, Kavita Kulkarni, and Tammi Huynh for helping with publication logistics.
Robust and efficient medical imaging with self-supervision
Despite recent progress in the field of medical artificial intelligence (AI), most existing models are narrow, single-task systems that require large quantities of labeled data to train. Moreover, these models cannot be easily reused in new clinical contexts as they often require the collection, de-identification and annotation of site-specific data for every new deployment environment, which is both laborious and expensive. This problem of data-efficient generalization (a model’s ability to generalize to new settings using minimal new data) continues to be a key translational challenge for medical machine learning (ML) models and has in turn, prevented their broad uptake in real world healthcare settings.
The emergence of foundation models offers a significant opportunity to rethink development of medical AI to make it more performant, safer, and equitable. These models are trained using data at scale, often by self-supervised learning. This process results in generalist models that can rapidly be adapted to new tasks and environments with less need for supervised data. With foundation models, it may be possible to safely and efficiently deploy models across various clinical contexts and environments.
In “Robust and Efficient MEDical Imaging with Self-supervision” (REMEDIS), to be published in Nature Biomedical Engineering, we introduce a unified large-scale self-supervised learning framework for building foundation medical imaging models. This strategy combines large scale supervised transfer learning with self-supervised learning and requires minimal task-specific customization. REMEDIS shows significant improvement in data-efficient generalization across medical imaging tasks and modalities with a 3–100x reduction in site-specific data for adapting models to new clinical contexts and environments. Building on this, we are excited to announce Medical AI Research Foundations (hosted by PhysioNet), an expansion of the public release of chest X-ray Foundations in 2022. Medical AI Research Foundations is a collection of open-source non-diagnostic models (starting with REMEDIS models), APIs, and resources to help researchers and developers accelerate medical AI research.
Large scale self-supervision for medical imaging
REMEDIS uses a combination of natural (non-medical) images and unlabeled medical images to develop strong medical imaging foundation models. Its pre-training strategy consists of two steps. The first involves supervised representation learning on a large-scale dataset of labeled natural images (pulled from Imagenet 21k or JFT) using the Big Transfer (BiT) method.
The second step involves intermediate self-supervised learning, which does not require any labels and instead, trains a model to learn medical data representations independently of labels. The specific approach used for pre-training and learning representations is SimCLR. The method works by maximizing agreement between differently augmented views of the same training example via a contrastive loss in a hidden layer of a feed-forward neural network with multilayer perceptron (MLP) outputs. However, REMEDIS is equally compatible with other contrastive self-supervised learning methods. This training method is applicable for healthcare environments as many hospitals acquire raw data (images) as a routine practice. While processes would have to be implemented to make this data usable within models (i.e., patient consent prior to gathering the data, de-identification, etc.), the costly, time-consuming, and difficult task of labeling that data could be avoided using REMEDIS.
REMEDIS leverages large-scale supervised learning using natural images and self-supervised learning using unlabeled medical data to create strong foundation models for medical imaging. |
Given ML model parameter constraints, it is important that our proposed approach works when using both small and large model architecture sizes. To study this in detail, we considered two ResNet architectures with commonly used depth and width multipliers, ResNet-50 (1×) and ResNet-152 (2×) as the backbone encoder networks.
After pre-training, the model was fine-tuned using labeled task-specific medical data and evaluated for in-distribution task performance. In addition, to evaluate the data-efficient generalization, the model was also optionally fine-tuned using small amounts of out-of-distribution (OOD) data.
Evaluation and results
To evaluate the REMEDIS model’s performance, we simulate realistic scenarios using retrospective de-identified data across a broad range of medical imaging tasks and modalities, including dermatology, retinal imaging, chest X-ray interpretation, pathology and mammography. We further introduce the notion of data-efficient generalization, capturing the model’s ability to generalize to new deployment distributions with a significantly reduced need for expert annotated data from the new clinical setting. In-distribution performance is measured as (1) improvement in zero-shot generalization to OOD settings (assessing performance in an OOD evaluation set, with zero access to training data from the OOD dataset) and (2) significant reduction in the need for annotated data from the OOD settings to reach performance equivalent to clinical experts (or threshold demonstrating clinical utility). REMEDIS exhibits significantly improved in-distribution performance with up to 11.5% relative improvement in diagnostic accuracy over a strongly supervised baseline.
More importantly, our strategy leads to data-efficient generalization of medical imaging models, matching strong supervised baselines resulting in a 3–100x reduction in the need for retraining data. While SimCLR is the primary self-supervised learning approach used in the study, we also show that REMEDIS is compatible with other approaches, such as MoCo-V2, RELIC and Barlow Twins. Furthermore, the approach works across model architecture sizes.
REMEDIS is compatible with MoCo-V2, RELIC and Barlow Twins as alternate self-supervised learning strategies. All the REMEDIS variants lead to data-efficient generalization improvements over the strong supervised baseline for dermatology condition classification (T1), diabetic macular edema classification (T2), and chest X-ray condition classification (T3). The gray shaded area indicates the performance of the strong supervised baseline pre-trained on JFT. |
Medical AI Research Foundations
Building on REMEDIS, we are excited to announce Medical AI Research Foundations, an expansion of the public release of chest X-ray Foundations in 2022. Medical AI Research Foundations is a repository of open-source medical foundation models hosted by PhysioNet. This expands the previous API-based approach to also encompass non-diagnostic models, to help researchers and developers accelerate their medical AI research. We believe that REMEDIS and the release of the Medical AI Research Foundations are a step toward building medical models that can generalize across healthcare settings and tasks.
We are seeding Medical AI Research Foundations with REMEDIS models for chest X-ray and pathology (with related code). Whereas the existing chest X-ray Foundation approach focuses on providing frozen embeddings for application-specific fine tuning from a model trained on several large private datasets, the REMEDIS models (trained on public datasets) enable users to fine-tune end-to-end for their application, and to run on local devices. We recommend users test different approaches based on their unique needs for their desired application. We expect to add more models and resources for training medical foundation models such as datasets and benchmarks in the future. We also welcome the medical AI research community to contribute to this.
Conclusion
These results suggest that REMEDIS has the potential to significantly accelerate the development of ML systems for medical imaging, which can preserve their strong performance when deployed in a variety of changing contexts. We believe this is an important step forward for medical imaging AI to deliver a broad impact. Beyond the experimental results presented, the approach and insights described here have been integrated into several of Google’s medical imaging research projects, such as dermatology, mammography and radiology among others. We’re using a similar self-supervised learning approach with our non-imaging foundation model efforts, such as Med-PaLM and Med-PaLM 2.
With REMEDIS, we demonstrated the potential of foundation models for medical imaging applications. Such models hold exciting possibilities in medical applications with the opportunity of multimodal representation learning. The practice of medicine is inherently multimodal and incorporates information from images, electronic health records, sensors, wearables, genomics and more. We believe ML systems that leverage these data at scale using self-supervised learning with careful consideration of privacy, safety, fairness and ethics will help lay the groundwork for the next generation of learning health systems that scale world-class healthcare to everyone.
Acknowledgements
This work involved extensive collaborative efforts from a multidisciplinary team of researchers, software engineers, clinicians, and cross-functional contributors across Google Health AI and Google Brain. In particular, we would like to thank our first co-author Jan Freyberg and our lead senior authors of these projects, Vivek Natarajan, Alan Karthikesalingam, Mohammad Norouzi and Neil Houlsby for their invaluable contributions and support. We also thank Lauren Winer, Sami Lachgar, Yun Liu and Karan Singhal for their feedback on this post and Tom Small for support in creating the visuals. Finally, we also thank the PhysioNet team for their support on hosting Medical AI Research Foundations. Users with questions can reach out to medical-ai-research-foundations at google.com.
LayerNAS: Neural Architecture Search in Polynomial Complexity
Every byte and every operation matters when trying to build a faster model, especially if the model is to run on-device. Neural architecture search (NAS) algorithms design sophisticated model architectures by searching through a larger model-space than what is possible manually. Different NAS algorithms, such as MNasNet and TuNAS, have been proposed and have discovered several efficient model architectures, including MobileNetV3, EfficientNet.
Here we present LayerNAS, an approach that reformulates the multi-objective NAS problem within the framework of combinatorial optimization to greatly reduce the complexity, which results in an order of magnitude reduction in the number of model candidates that must be searched, less computation required for multi-trial searches, and the discovery of model architectures that perform better overall. Using a search space built on backbones taken from MobileNetV2 and MobileNetV3, we find models with top-1 accuracy on ImageNet up to 4.9% better than current state-of-the-art alternatives.
Problem formulation
NAS tackles a variety of different problems on different search spaces. To understand what LayerNAS is solving, let’s start with a simple example: You are the owner of GBurger and are designing the flagship burger, which is made up with three layers, each of which has four options with different costs. Burgers taste differently with different mixtures of options. You want to make the most delicious burger you can that comes in under a certain budget.
Make up your burger with different options available for each layer, each of which has different costs and provides different benefits. |
Just like the architecture for a neural network, the search space for the perfect burger follows a layerwise pattern, where each layer has several options with different changes to costs and performance. This simplified model illustrates a common approach for setting up search spaces. For example, for models based on convolutional neural networks (CNNs), like MobileNet, the NAS algorithm can select between a different number of options — filters, strides, or kernel sizes, etc. — for the convolution layer.
Method
We base our approach on search spaces that satisfy two conditions:
- An optimal model can be constructed using one of the model candidates generated from searching the previous layer and applying those search options to the current layer.
- If we set a FLOP constraint on the current layer, we can set constraints on the previous layer by reducing the FLOPs of the current layer.
Under these conditions it is possible to search linearly, from layer 1 to layer n knowing that when searching for the best option for layer i, a change in any previous layer will not improve the performance of the model. We can then bucket candidates by their cost, so that only a limited number of candidates are stored per layer. If two models have the same FLOPs, but one has better accuracy, we only keep the better one, and assume this won’t affect the architecture of following layers. Whereas the search space of a full treatment would expand exponentially with layers since the full range of options are available at each layer, our layerwise cost-based approach allows us to significantly reduce the search space, while being able to rigorously reason over the polynomial complexity of the algorithm. Our experimental evaluation shows that within these constraints we are able to discover top-performance models.
NAS as a combinatorial optimization problem
By applying a layerwise-cost approach, we reduce NAS to a combinatorial optimization problem. I.e., for layer i, we can compute the cost and reward after training with a given component Si . This implies the following combinatorial problem: How can we get the best reward if we select one choice per layer within a cost budget? This problem can be solved with many different methods, one of the most straightforward of which is to use dynamic programming, as described in the following pseudo code:
while True: # select a candidate to search in Layer i candidate = select_candidate(layeri) if searchable(candidate): # Use the layerwise structural information to generate the children. children = generate_children(candidate) reward = train(children) bucket = bucketize(children) if memorial_table[i][bucket] < reward: memorial_table[i][bucket] = children move to next layer |
Pseudocode of LayerNAS. |
Experimental results
When comparing NAS algorithms, we evaluate the following metrics:
- Quality: What is the most accurate model that the algorithm can find?
- Stability: How stable is the selection of a good model? Can high-accuracy models be consistently discovered in consecutive trials of the algorithm?
- Efficiency: How long does it take for the algorithm to find a high-accuracy model?
We evaluate our algorithm on the standard benchmark NATS-Bench using 100 NAS runs, and we compare against other NAS algorithms, previously described in the NATS-Bench paper: random search, regularized evolution, and proximal policy optimization. Below, we visualize the differences between these search algorithms for the metrics described above. For each comparison, we record the average accuracy and variation in accuracy (variation is noted by a shaded region corresponding to the 25% to 75% interquartile range).
NATS-Bench size search defines a 5-layer CNN model, where each layer can choose from eight different options, each with different channels on the convolution layers. Our goal is to find the best model with 50% of the FLOPs required by the largest model. LayerNAS performance stands apart because it formulates the problem in a different way, separating the cost and reward to avoid searching a significant number of irrelevant model architectures. We found that model candidates with fewer channels in earlier layers tend to yield better performance, which explains how LayerNAS discovers better models much faster than other algorithms, as it avoids spending time on models outside the desired cost range. Note that the accuracy curve drops slightly after searching longer due to the lack of correlation between validation accuracy and test accuracy, i.e., some model architectures with higher validation accuracy have a lower test accuracy in NATS-Bench size search.
Top: NATS-Bench size search test accuracy on Cifar10; Middle: On Cifar100; Bottom: On ImageNet16-120. Average on 100 runs compared with random search (random), Regularized Evolution (evolution), and Proximal Policy Optimization (PPO). |
We construct search spaces based on MobileNetV2, MobileNetV2 1.4x, MobileNetV3 Small, and MobileNetV3 Large and search for an optimal model architecture under different #MADDs (number of multiply-additions per image) constraints. Among all settings, LayerNAS finds a model with better accuracy on ImageNet. See the paper for details.
Comparison on models under different #MAdds. |
Conclusion
In this post, we demonstrated how to reformulate NAS into a combinatorial optimization problem, and proposed LayerNAS as a solution that requires only polynomial search complexity. We compared LayerNAS with existing popular NAS algorithms and showed that it can find improved models on NATS-Bench. We also use the method to find better architectures based on MobileNetV2, and MobileNetV3.
Acknowledgements
We would like to thank Jingyue Shen, Keshav Kumar, Daiyi Peng, Mingxing Tan, Esteban Real, Peter Young, Weijun Wang, Qifei Wang, Xuanyi Dong, Xin Wang, Yingjie Miao, Yun Long, Zhuo Wang, Da-Cheng Juan, Deqiang Chen, Fotis Iliopoulos, Han-Byul Kim, Rino Lee, Andrew Howard, Erik Vee, Rina Panigrahy, Ravi Kumar and Andrew Tomkins for their contribution, collaboration and advice.
Google at CHI 2023
This week, the Conference on Human Factors in Computing Systems (CHI 2023) is being held in Hamburg, Germany. We are proud to be a Hero Sponsor of CHI 2023, a premier conference on human-computer interaction, where Google researchers contribute at all levels. This year we are presenting over 30 papers and are actively involved in organizing and hosting a number of different events across workshops, courses, and interactive sessions.
If you’re registered for CHI 2023, we hope you’ll visit the Google booth to learn more about the exciting work across various topics, including language interactions, causal inference, question answering and more. Take a look below to learn more about the Google research being presented at CHI 2023 (Google affiliations in bold).
Board and Organizing Committee
Technical Program Chairs include: Tesh Goyal
Case Studies Chairs include: Frank Bentley
Keynotes Chairs include: Elizabeth Churchill
Best Paper Award
Infrastructuring Care: How Trans and Non-Binary People Meet Health and Well-Being Needs through Technology
Lauren Wilcox, Renee Shelby, Rajesh Veeraraghavan, Oliver Haimson, Gabriela Erickson, Michael Turken, Beka Gulotta
Accepted papers
NewsComp: Facilitating Diverse News Reading through Comparative Annotation
Md Momen Bhuiyan, Sang Won Lee, Nitesh Goyal, Tanushree Mitra
WordGesture-GAN: Modeling Word-Gesture Movement with Generative Adversarial Network (Honorable Mention)
Jeremy Chu, Dongsheng An, Yan Ma, Wenzhe Cui, Shumin Zhai, Xianfeng David Gu, Xiaojun Bi
“The less I type, the better”: How AI Language Models can Enhance or Impede Communication for AAC Users
Stephanie Valencia, Richard Cave, Krystal Kallarackal, Katie Seaver, Michael Terry,
Shaun Kane
A Mixed-Methods Approach to Understanding User Trust after Voice Assistant Failures (Honorable Mention)
Amanda Baughan*, Xuezhi Wang, Ariel Liu, Allison Mercurio, Jilin Chen, Xiao Ma
“There’s so much responsibility on users right now:” Expert Advice for Staying Safer From Hate and Harassment
Miranda Wei, Sunny Consolvo, Patrick Gage Kelley, Tadayoshi Kohno, Franziska Roesner, Kurt Thomas
ThingShare: Ad-Hoc Digital Copies of Physical Objects for Sharing Things in Video Meetings
Erzhen Hu, Jens Emil Sloth Grønbæk, Wen Ying, Ruofei Du, Seongkook Heo
Understanding Digital-Safety Experiences of Youth in the U.S.
Diana Freed, Natalie N. Bazarova, Sunny Consolvo, Eunice Han, Patrick Gage Kelley,
Kurt Thomas, Dan Cosley
Slide Gestalt: Automatic Structure Extraction in Slide Decks for Non-Visual Access
Yi-Hao Peng*, Peggy Chi, Anjuli Kannan, Meredith Ringel Morris, Irfan Essa
Using Logs Data to Identify When Engineers Experience Flow or Focused Work
Adam Brown, Sarah D’Angelo, Ben Holtz, Ciera Jaspan, Collin Green
Enabling Conversational Interaction with Mobile UI Using Large Language Models
Bryan Wang*, Gang Li, Yang Li
Practicing Information Sensibility: How Gen Z Engages with Online Information (Honorable Mention)
Amelia Hassoun, Ian Beacock, Sunny Consolvo, Beth Goldberg, Patrick Gage Kelley, Daniel M. Russell
How Bold Can We Be? The Impact of Adjusting Font Grade on Readability in Light and Dark Polarities
Hilary Palmen, Michael Gilbert, Dave Crossland
Investigating How Practitioners Use Human-AI Guidelines: A Case Study on the People + AI Guidebook (Honorable Mention)
Nur Yildirim*, Mahima Pushkarna, Nitesh Goyal, Martin Wattenberg, Fernanda Viegas
From Plane Crashes to Algorithmic Harm: Applicability of Safety Engineering Frameworks for Responsible ML
Shalaleh Rismani, Renee Shelby, Andrew Smart, Edgar W. Jatho, Joshua A. Kroll, AJung Moon, Negar Rostamzadeh
Designing Responsible AI: Adaptations of UX Practice to Meet Responsible AI Challenges
Qiaosi Wang*, Michael Madaio, Shaun Kane, Shivani Kapania, Michael Terry, Lauren Wilcox
“It is currently hodgepodge”: Examining AI/ML Practitioners’ Challenges during Co-production of Responsible AI Values
Rama Adithya Varanasi, Nitesh Goyal
A Hunt for the Snark: Annotator Diversity in Data Practices (Honorable Mention)
Shivani Kapania, Alex S. Taylor, Ding Wang
Visual Captions: Augmenting Verbal Communication with On-the-Fly Visuals
Xingyu “Bruce” Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Alex Olwal, Peggy Chi,
Xiang “Anthony” Chen, Ruofei Du
Infrastructuring Care: How Trans and Non-Binary People Meet Health and Well-Being Needs through Technology (Best Paper Award)
Lauren Wilcox, Renee Shelby, Rajesh Veeraraghavan, Oliver Haimson, Gabriela Erickson, Michael Turken, Beka Gulotta
Kaleidoscope: Semantically-Grounded, Context-Specific ML Model Evaluation
Harini Suresh, Divya Shanmugam, Tiffany Chen, Annie G. Bryan, Alexander D’Amour, John Guttag, Arvind Satyanarayan
Rapsai: Accelerating Machine Learning Prototyping of Multimedia Applications through Visual Programming (Honorable Mention; see blog post)
Ruofei Du, Na Li, Jing Jin, Michelle Carney, Scott Miles, Maria Kleiner, Xiuxiu Yuan, Yinda Zhang, Anuva Kulkarni, Xingyu “Bruce” Liu, Ahmed Sabie, Sergio Orts-Escolano, Abhishek Kar, Ping Yu, Ram Iyengar, Adarsh Kowdle, Alex Olwal
Exploring Users’ Perceptions and Expectations of Shapes for Dialog Designs
Xinghui “Erica” Yan, Julia Feldman, Frank Bentley, Mohammed Khwaja, Michael Gilbert
Exploring the Future of Design Tooling: The Role of Artificial Intelligence in Tools for User Experience Professionals
Tiffany Knearem, Mohammed Khwaja, Yuling Gao, Frank Bentley, Clara E. Kliman-Silver
SpeakFaster Observer: Long-Term Instrumentation of Eye-Gaze Typing for Measuring AAC Communication
Shanqing Cai, Subhashini Venugopalan, Katrin Tomanek, Shaun Kane, Meredith Ringel Morris, Richard Cave, Robert MacDonald, Jon Campbell, Blair Casey, Emily Kornman, Daniel E. Vance, Jay Beavers
Designerly Tele-Experiences: A New Approach to Remote Yet Still Situated Co-design
Ferran Altarriba Bertran, Alexandra Pometko, Muskan Gupta, Lauren Wilcox, Reeta Banerjee, Katherine Isbister
“I Just Wanted to Triple Check . . . They Were All Vaccinated”: Supporting Risk Negotiation in the Context of COVID-19
Margaret E. Morris, Jennifer Brown, Paula Nurius, Savanna Yee, Jennifer C. Mankoff, Sunny Consolvo
Expectation vs Reality in Users’ Willingness to Delegate to Digital Assistants
Ekaterina Svikhnushina*, Marcel Schellenberg, Anna K. Niedbala, Iva Barisic, Jeremy N. Miles
Interactive Visual Exploration of Knowledge Graphs with Embedding-Based Guidance
Chao-Wen Hsuan Yuan, Tzu-Wei Yu, Jia-Yu Pan, Wen-Chieh Lin
Measuring the Impact of Explanation Bias: A Study of Natural Language Justifications for Recommender Systems
Krisztian Balog, Filip Radlinski, Andrey Petrov
Modeling and Improving Text Stability in Live Captions
Xingyu “Bruce” Liu, Jun Zhang, Leonardo Ferrer, Susan Xu, Vikas Bahirwani, Boris Smus, Alex Olwal, Ruofei Du
Programming without a Programming Language: Challenges and Opportunities for Designing Developer Tools for Prompt Programming
Alexander J. Fiannaca, Chinmay Kulkarni, Carrie J. Cai, Michael Terry
PromptInfuser: Bringing User Interface Mock-ups to Life with Large Language Models
Savvas Petridis, Michael Terry, Carrie J. Cai
Prototypes, Platforms and Protocols: Identifying Common Issues with Remote, Unmoderated Studies and Their Impact on Research Participants
Steven Schirra, Sasha Volkov, Frank Bentley, Shraddhaa Narasimha
Human-Centered Responsible Artificial Intelligence: Current & Future Trends
Mohammad Tahaei, Marios Constantinides, Daniele Quercia, Sean Kennedy, Michael Muller, Simone Stumpf, Q. Vera Liao, Ricardo Baeza-Yates, Lora Aroyo, Jess Holbrook, Ewa Luger, Michael Madaio, Ilana Golbin Blumenfeld, Maria De-Arteaga, Jessica Vitak, Alexandra Olteanu
Interactive sessions
Experiencing Rapid Prototyping of Machine Learning Based Multimedia Applications in Rapsai (see blog post)
Ruofei Du, Na Li, Jing Jin, Michelle Carney, Xiuxiu Yuan, Ram Iyengar, Ping Yu, Adarsh Kowdle, Alex Olwal
Workshops
The Second Workshop on Intelligent and Interactive Writing Assistants
Organizers include: Minsuk Chang
Combating Toxicity, Harassment, and Abuse in Online Social Spaces: A Workshop at CHI 2023
Organizers include: Nitesh Goyal
The Future of Computational Approaches for Understanding and Adapting User Interfaces
Keynote Speaker: Yang Li
The EmpathiCH Workshop: Unraveling Empathy-Centric Design
Panelists include: Cindy Bennett
Workshop on Trust and Reliance in AI-Human Teams (TRAIT)
Keynote Speakers: Carrie J. Cai, Michael Terry
Program committee includes: Aaron Springer, Michael Terry
Socially Assistive Robots as Decision Makers: Transparency, Motivations, and Intentions
Organizers include: Maja Matarić
Courses
Human-Computer Interaction and AI: What Practitioners Need to Know to Design and Build Effective AI Systems from a Human Perspective (Part I; Part II)
Daniel M. Russell, Q. Vera Liao, Chinmay Kulkarni, Elena L. Glassman, Nikolas Martelaro
* Work done while at Google