Google AI – Page 58

Google Research, 2022 & beyond: Research community engagement

February 28, 2023

by Google AI Google AI

Posted by Posted by Leslie Yeh, Director, University Relations

(This is Part 9 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Sharing knowledge is essential to Google’s research philosophy — it accelerates technological progress and expands capabilities community-wide. Solving complex problems requires bringing together diverse minds and resources collaboratively. This can be accomplished through building local and global connections with multidisciplinary experts and impacted communities. In partnership with these stakeholders, we bring our technical leadership, product footprint, and resources to make progress against some of society’s greatest opportunities and challenges.

We at Google see it as our responsibility to disseminate our work as contributing members of the scientific community and to help train the next generation of researchers. To do this well, collaborating with experts and researchers outside of Google is essential. In fact, just over half of our scientific publications highlight work done jointly with authors outside of Google. We are grateful to work collaboratively across the globe and have only increased our efforts with the broader research community over the past year. In this post, we will talk about some of the opportunities afforded by such partnerships, including:

· Addressing social challenges together

· Training the next generation of researchers

· Collaborating to advance scientific innovations

· Fueling innovation in products and engineering

· Open-sourcing datasets and tools

Addressing social challenges together

Engaging the wider community helps us progress on seemingly intractable problems. For example, access to timely, accurate health information is a significant challenge among women in rural and densely populated urban areas across India. To solve this challenge, ARMMAN developed mMitra, a free mobile service that sends preventive care information to expectant and new mothers. Adherence to such public health programs is a prevalent challenge, so researchers from Google Research and the Indian Institute of Technology, Madras worked with ARMMAN to design an ML system that alerts healthcare providers about participants at risk of dropping out of the health information program. This early identification helps ARMMAN provide better-targeted support, improving maternal health outcomes.

Google Research worked with ARMMAN to design a system to alert healthcare providers about participants at risk for dropping out of their preventative care information program for expectant mothers. This plot shows the cumulative engagement drops prevented using our restless multi-armed bandit model (RMAB) compared to the control group (Round Robin).

We also support Responsible AI projects directly for other organizations — including our commitment of $3M to fund the new INSAIT research center based in Bulgaria. Further, to help build a foundation of fairness, interpretability, privacy, and security, we are supporting the establishment of a first-of-its-kind multidisciplinary Center for Responsible AI with a grant of $1M to the Indian Institute of Technology, Madras.

Top

Training the next generation of researchers

Part of our responsibility in guiding how technology affects society is to help train the next generation of researchers. For example, supporting equitable student persistence in computing research through our Computer Science Research Mentorship Program, where Googlers have mentored over one thousand students since 2018 — 86% of whom identify as part of a historically marginalized group.

We work towards inclusive goals and work across the globe to achieve them. In 2022, we expanded our research interactions and programs to faculty and students across Latin America, which included grants to women in computer science in Ecuador. We partnered with ENS, a university in France, to help fund scholarships for students to train through research. Another example is our collaboration with the Computing Alliance of Hispanic-Serving Institutions (CAHSI) to provide $4.8 million to support more than 30 collaborative research projects and over 3,000 Hispanic students and faculty across a network of Hispanic-serving institutions.

Efforts like these foster the research ecosystem and help the community give back. Through exploreCSR, we partner with universities to provide students with introductory experiences in research, such as Rice University’s regional workshop on applications and research in data science (ReWARDS), which was delivered in rural Peru by faculty from Rice. Similarly, one of our Awards for Inclusion Research led to a faculty member helping startups in Africa use AI.

The funding we provide is most often unrestricted and leads to inspiring results. Last year, for example, Kean University was one of 53 institutions to receive an exploreCSR award. It used the funding to create the Research Recruits Program, a two-semester program designed to give undergraduates an introductory opportunity to participate in research with a faculty mentor. A student at Kean with a chronic condition that requires him to take different medications every day, a struggle that affects so many, decided to pursue research on the topic with a peer. Their research, set to be published this year, demonstrates an ML solution, built with Google’s TensorFlow, that can identify pills with 99.8% certainty when used correctly. Results like these are why we continue to invest in younger generations, further demonstrated by our long-term commitment to funding PhD Fellows every year across the globe.

Building an inclusive ecosystem is imperative. To this end, we’ve also partnered with the non-profit Black in Robotics (BiR), formed to address the systemic inequities in the robotics community. Together, we established doctoral student awards that help financially support graduate students and to support BiR’s newly established Bay Area Robotics lab. We also help make global conferences accessible to more researchers around the world, for example, by funding 24 students this year to attend Deep Learning Indaba in Tunisia.

Top

Collaborating to advance scientific innovations

In 2022 Google sponsored over 150 research conferences and even more workshops, which leads to invaluable engagements with the broader research community. At research conferences, Googlers serve on program committees and organize workshops, tutorials and numerous other activities to collectively advance the field. Additionally, last year, we hosted over 14 dedicated workshops to bring together researchers, such as the 2022 Quantum Symposium, which generates new ideas and directions for the research field, further advancing research initiatives. In 2022, we authored 2400 papers, many of which were presented at leading research conferences, such as NeurIPS, EMNLP, ECCV, Interspeech, ICML, CVPR, ICLR, and many others. More than 50% of these papers were authored in collaboration with researchers beyond Google.

Over the past year, we’ve expanded our engagement models to facilitate students, faculty, and Google’s research scientists coming together across schools to form constructive research triads. One such project, undertaken in partnership with faculty and students from Georgia Tech, aims to develop a robot guide dog with human behavior modeling and safe reinforcement learning. Throughout 2022, we gave over 224 grants to researchers and over $10M in Google Cloud Platform credits for topics ranging from the improvement of algorithms for post-quantum cryptography with collaborators at CNRS in France to fostering cybersecurity research at TU Munich and Fraunhofer AISEC in Germany.

In 2022, we made 22 new multi-year commitments totaling over ~$80M to 65 institutions across nine countries, where each year we will host workshops to select over 100 research projects of mutual interest. For example, in a growing partnership, we are supporting the new Max Planck VIA-Center in Germany to work together on robotics. Another large area of investment is a close partnership with four universities in Taiwan (NTU, NCKU, NYCU, NTHU) to increase innovation in silicon chip design and improve competitiveness in semiconductor design and manufacturing. We aim to collaborate by default and were proud to be recently named one of Australia’s top collaborating companies.

Top

Fueling innovation in products and engineering

The community fuels innovation at Google. For example, by facilitating student researchers to work with us on defined research projects, we’ve experienced both incremental and more dramatic improvements. Together with visiting researchers, we combine information, compute power, and a great deal of expertise to bring about breakthroughs, such as leveraging our undersea internet cables to detect earthquakes. Visiting Researchers also worked hand-in-hand with us to develop Minerva, a state-of-the-art solution that came about by training a deep learning model on a dataset that contains quantitative reasoning with symbolic expressions.

Minerva incorporates recent prompting and evaluation techniques to better solve mathematical questions. It then employs majority voting, in which it generates multiple solutions to each question and chooses the most common answer as the solution, thus improving performance significantly.

Top

Open-sourcing datasets and tools

Engaging with the broader research community is a core part of our efforts to build a more collaborative ecosystem. We support the general advancement of ML and related research through the release of open-source code and datasets. We continued to grow open source datasets in 2022, for example, in natural language processing and vision, and expanded our global index of available datasets in Google Dataset Search. We also continued to release sustainability data via Data Commons and invite others to use it for their research. See some of the datasets and tools we released in 2022 listed below.

Dataset	Description

Auto-Arborist	A multiview urban tree classification dataset that consists of ~2.6M trees covering >320 genera, which can aid in the development of models for urban forest monitoring.

Bazel GitHub Metrics	A dataset with GitHub download counts of release artifacts from selected bazelbuild repositories.

BC-Z demonstration	Episodes of a robotic arm performing 100 different manipulation tasks. Data for each episode includes the RGB video, the robot’s end-effector positions, and the natural language embedding.

BEGIN V2	A benchmark dataset for evaluating dialog systems and natural language generation metrics.

CLSE: Corpus of Linguistically Significant Entities	A dataset of named entities annotated by linguistic experts. It includes 34 languages and 74 different semantic types to support various applications from airline ticketing to video games.

CocoChorales	A dataset consisting of over 1,400 hours of audio mixtures containing four-part chorales performed by 13 instruments, all synthesized with realistic-sounding generative models.

Crossmodal-3600	A geographically diverse dataset of 3,600 images, each annotated with human-generated reference captions in 36 languages.

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus	A Common Voice-based Speech-to-Speech translation corpus that includes 2,657 hours of speech-to-speech translation sentence pairs from 21 languages into English.

DSTC11 Challenge Task	This challenge evaluates task-oriented dialog systems end-to-end, from users’ spoken utterances to inferred slot values.

EditBench	A comprehensive diagnostic and evaluation dataset for text-guided image editing.

Few-shot Regional Machine Translation	FRMT is a few-shot evaluation dataset containing en-pt and en-zh bitexts translated from Wikipedia, in two regional varieties for each non-English language (pt-BR and pt-PT; zh-CN and zh-TW).

Google Patent Phrase Similarity	A human-rated contextual phrase-to-phrase matching dataset focused on technical terms from patents.

Hinglish-TOP	Hinglish-TOP is the largest code-switched semantic parsing dataset with 10k entries annotated by humans, and 170K generated utterances using the CST5 augmentation technique introduced in the paper.

ImPaKT	A dataset that contains semantic parsing annotations for 2,489 sentences from shopping web pages in the C4 corpus, corresponding to annotations of 3,719 expressed implication relationships and 6,117 typed and summarized attributes.

InFormal	A formality style transfer dataset for four Indic Languages, made up of a pair of sentences and a corresponding gold label identifying the more formal and semantic similarity.

MAVERICS	A suite of test-only visual question answering datasets, created from Visual Question Answering image captions with question answering validation and manual verification.

MetaPose	A dataset with 3D human poses and camera estimates predicted by the MetaPose model for a subset of the public Human36M dataset with input files necessary to reproduce these results from scratch.

MGnify proteins	A 2.4B-sequence protein database with annotations.

MiQA: Metaphorical Inference Questions and Answers	MiQA assesses the capability of language models to reason with conventional metaphors. It combines the previously isolated topics of metaphor detection and commonsense reasoning into a single task that requires a model to make inferences by selecting between the literal and metaphorical register.

MT-Opt	A dataset of task episodes collected across a fleet of real robots, following the RLDS format to represent steps and episodes.

MultiBERTs Predictions on Winogender	Predictions of BERT on Winogender before and after several different interventions.

Natural Language Understanding Uncertainty Evaluation	NaLUE is a relabelled and aggregated version of three large NLU corpuses CLINC150, Banks77 and HWU64. It contains 50k utterances spanning 18 verticals, 77 domains, and ~260 intents.

NewsStories	A collection of url links to publicly available news articles with their associated images and videos.

Open Images V7	Open Images V7 expands the Open Images dataset with new point-level label annotations, which provide localization information for 5.8k classes, and a new all-in-one visualization tool for better data exploration.

Pfam-NUniProt2	A set of 6.8 million new protein sequence annotations.

Re-contextualizing Fairness in NLP for India	A dataset of region and religion-based societal stereotypes in India, with a list of identity terms and templates for reproducing the results from the “Re-contextualizing Fairness in NLP” paper.

Scanned Objects	A dataset with 1,000 common household objects that have been 3D scanned for use in robotic simulation and synthetic perception research.

Specialized Rater Pools	This dataset comes from a study designed to understand whether annotators with different self-described identities interpret toxicity differently. It contains the unaggregated toxicity annotations of 25,500 comments from pools of raters who self-identify as African American, LGBTQ, or neither.

UGIF	A multi-lingual, multi-modal UI grounded dataset for step-by-step task completion on the smartphone.

UniProt Protein Names	Data release of ~49M protein name annotations predicted from their amino acid sequence.

upwelling irradiance from GOES-16	Climate researchers can use the 4 years of outgoing longwave radiation and reflected shortwave radiation data to analyze important climate forcers, such as aircraft condensation trails.

UserLibri	The UserLibri dataset reorganizes the existing popular LibriSpeech dataset into individual “user” datasets consisting of paired audio-transcript examples and domain-matching text-only data for each user. This dataset can be used for research in speech personalization or other language processing fields.

VideoCC	A dataset containing (video-URL, caption) pairs for training video-text machine learning models.

Wiki-conciseness	A manually curated evaluation set in English for concise rewrites of 2,000 Wikipedia sentences.

Wikipedia Translated Clusters	Introductions to English Wikipedia articles and their parallel versions in 10 other languages, with machine translations to English. Also includes synthetic corruptions to the English versions, to be identified with NLI models.

Workload Traces 2022	A dataset with traces that aim to help system designers better understand warehouse-scale computing workloads and develop new solutions for front-end and data-access bottlenecks.

Tool	Description

Differential Privacy Open Source Library	An open-source library to enable developers to use analytic techniques based on DP.

Mood Board Search	The result of collaborative work with artists, photographers, and image researchers to demonstrate how ML can enable people to visually explore subjective concepts in image datasets.

Project Relate	An Android beta app that uses ML to help people with non-standard speech make their voices heard.

TensorStore	TensorStore is an open-source C++ and Python library designed for storage and manipulation of n-dimensional data, which can address key engineering challenges in scientific computing through better management and processing of large datasets.

The Data Cards Playbook	A Toolkit for Transparency in Dataset Documentation.

Top

Conclusion

Research is an amplifier, an accelerator, and an enabler — and we are grateful to partner with so many incredible people to harness it for the good of humanity. Even when investing in research that advances our products and engineering, we recognize that, ultimately, this fuels what we can offer our users. We welcome more partners to engage with us and maximize the benefits of AI for the world.

Acknowledgements

Thank you to our many research partners across the globe, including academics, universities, NGOs, and research organizations, for continuing to engage and work with Google on exciting research efforts. There are many teams within GoogIe who make this work possible, including Google’s research teams and community, research partnerships, education, and policy teams. Finally, I would especially like to thank those who provided helpful feedback in the development of this post, including Sepi Hejazi Moghadam, Jill Alvidrez, Melanie Saldaña, Ashwani Sharma, Adriana Budura Skobeltsyn, Aimin Zhu, Michelle Hurtado, Salil Banerjee and Esmeralda Cardenas.

Top

Google Research, 2022 & beyond

This was the ninth and final blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models	Computer Vision	Multimodal Models
Generative Models	Responsible AI	ML & Computer Systems
Efficient Deep Learning	Algorithmic Advances	Robotics
Natural Sciences	Health	Community Engagement

How 4 Black Founders Fund recipients are building with AI

February 27, 2023

by Google AI

Meet Google for Startups Black Founders Fund recipients from Africa, Brazil, Europe and the United States using Google AI technology to help people and society.Read More

A vision-language approach for foundational UI understanding

February 24, 2023

by Google AI Google AI

Posted by Yang Li, Research Scientist, and Gang Li, Software Engineer, Google Research

The computational understanding of user interfaces (UI) is a key step towards achieving intelligent UI behaviors. Previously, we investigated various UI modeling tasks, including widget captioning, screen summarization, and command grounding, that address diverse interaction scenarios such as automation and accessibility. We also demonstrated how machine learning can help user experience practitioners improve UI quality by diagnosing tappability confusion and providing insights for improving UI design. These works along with those developed by others in the field have showcased how deep neural networks can potentially transform end user experiences and the interaction design practice.

With these successes in addressing individual UI tasks, a natural question is whether we can obtain foundational understandings of UIs that can benefit specific UI tasks. As our first attempt to answer this question, we developed a multi-task model to address a range of UI tasks simultaneously. Although the work made some progress, a few challenges remain. Previous UI models heavily rely on UI view hierarchies — i.e., the structure or metadata of a mobile UI screen like the Document Object Model for a webpage — that allow a model to directly acquire detailed information of UI objects on the screen (e.g., their types, text content and positions). This metadata has given previous models advantages over their vision-only counterparts. However, view hierarchies are not always accessible, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the short-term gains from using view hierarchies, it may ultimately hamper the model performance and applicability. In addition, previous models had to deal with heterogeneous information across datasets and UI tasks, which often resulted in complex model architectures that were difficult to scale or generalize across tasks.

In “Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus”, accepted for publication at ICLR 2023, we present a vision-only approach that aims to achieve general UI understanding completely from raw pixels. We introduce a unified approach to represent diverse UI tasks, the information for which can be universally represented by two core modalities: vision and language. The vision modality captures what a person would see from a UI screen, and the language modality can be natural language or any token sequences related to the task. We demonstrate that Spotlight substantially improves accuracy on a range of UI tasks, including widget captioning, screen summarization, command grounding and tappability prediction.

Spotlight Model

The Spotlight model input includes a tuple of three items: the screenshot, the region of interest on the screen, and the text description of the task. The output is a text description or response about the region of interest. This simple input and output representation of the model is expressive to capture various UI tasks and allows scalable model architectures. This model design allows a spectrum of learning strategies and setups, from task-specific fine-tuning, to multi-task learning and to few-shot learning. The Spotlight model, as illustrated in the above figure, leverages existing architecture building blocks such as ViT and T5 that are pre-trained in the high-resourced, general vision-language domain, which allows us to build on top of the success of these general domain models.

Because UI tasks are often concerned with a specific object or area on the screen, which requires a model to be able to focus on the object or area of interest, we introduce a Focus Region Extractor to a vision-language model that enables the model to concentrate on the region in light of the screen context.

In particular, we design a Region Summarizer that acquires a latent representation of a screen region based on ViT encodings by using attention queries generated from the bounding box of the region (see paper for more details). Specifically, each coordinate (a scalar value, i.e., the left, top, right or bottom) of the bounding box, denoted as a yellow box on the screenshot, is first embedded via a multilayer perceptron (MLP) as a collection of dense vectors, and then fed to a Transformer model along their coordinate-type embedding. The dense vectors and their corresponding coordinate-type embeddings are color coded to indicate their affiliation with each coordinate value. Coordinate queries then attend to screen encodings output by ViT via cross attention, and the final attention output of the Transformer is used as the region representation for the downstream decoding by T5.

A target region on the screen is summarized by using its bounding box to query into screen encodings from ViT via attentional mechanisms.

Results

We pre-train the Spotlight model using two unlabeled datasets (an internal dataset based on C4 corpus and an internal mobile dataset) with 2.5 million mobile UI screens and 80 million web pages. We then separately fine-tune the pre-trained model for each of the four downstream tasks (captioning, summarization, grounding, and tappability). For widget captioning and screen summarization tasks, we report CIDEr scores, which measure how similar a model text description is to a set of references created by human raters. For command grounding, we report accuracy that measures the percentage of times the model successfully locates a target object in response to a user command. For tappability prediction, we report F1 scores that measure the model’s ability to tell tappable objects from untappable ones.

In this experiment, we compare Spotlight with several benchmark models. Widget Caption uses view hierarchy and the image of each UI object to generate a text description for the object. Similarly, Screen2Words uses view hierarchy and the screenshot as well as auxiliary features (e.g., app description) to generate a summary for the screen. In the same vein, VUT combines screenshots and view hierarchies for performing multiple tasks. Finally, the original Tappability model leverages object metadata from view hierarchy and the screenshot to predict object tappability. Taperception, a follow-up model of Tappability, uses a vision-only tappability prediction approach. We examine two Spotlight model variants with respect to the size of its ViT building block, including B/16 and L/16. Spotlight drastically exceeded the state-of-the-art across four UI modeling tasks.

Model

Captioning

Summarization

Grounding

Tappability

Baselines

Widget Caption

–

Screen2Words

–

61.3

–

VUT

99.3

65.6

82.1

–

Taperception

–

85.5

Tappability

–

87.9

Spotlight

B/16

136.6

103.5

95.7

86.9

L/16

141.8

106.7

95.8

88.4

We then pursue a more challenging setup where we ask the model to learn multiple tasks simultaneously because a multi-task model can substantially reduce model footprint. As shown in the table below, the experiments showed that our model still performs competitively.

Model

Captioning

Summarization

Grounding

Tappability

VUT multi-task

99.3

65.1

80.8

–

Spotlight B/16

140

102.7

90.8

89.4

Spotlight L/16

141.3

99.2

94.2

89.5

To understand how the Region Summarizer enables Spotlight to focus on a target region and relevant areas on the screen, we analyze the attention weights (which indicate where the model attention is on the screenshot) for both widget captioning and screen summarization tasks. In the figure below, for the widget captioning task, the model predicts “select Chelsea team” for the checkbox on the left side, highlighted with a red bounding box. We can see from its attention heatmap (which illustrates the distribution of attention weights) on the right that the model learns to attend to not only the target region of the check box, but also the text “Chelsea” on the far left to generate the caption. For the screen summarization example, the model predicts “page displaying the tutorial of a learning app” given the screenshot on the left. In this example, the target region is the entire screen, and the model learns to attend to important parts on the screen for summarization.

For the widget captioning task, the attention heatmap shows the model attending to the checkbox, i.e., the target object, and the text label on its left when generating a caption for the object. The red bounding box in the figure is for illustration purposes.

For the screen summarization task that the target region encloses the entire screen, the attention heatmap shows the model attending to various locations on the screen that contribute to generating the summary.

Conclusion

We demonstrate that Spotlight outperforms previous methods that use both screenshots and view hierarchies as the input, and establishes state-of-the-art results on multiple representative UI tasks. These tasks range from accessibility, automation to interaction design and evaluation. Our vision-only approach for mobile UI understanding alleviates the need to use view hierarchy, allows the architecture to easily scale and benefits from the success of large vision-language models pre-trained for the general domain. Compared to recent large vision-language model efforts such as Flamingo and PaLI, Spotlight is relatively small and our experiments show the trend that larger models yield better performance. Spotlight can be easily applied to more UI tasks and potentially advance the fronts of many interaction and user experience tasks.

Acknowledgment

We thank Mandar Joshi and Tao Li for their help in processing the web pre-training dataset, and Chin-Yi Cheng and Forrest Huang for their feedback for proofreading the paper. Thanks to Tom Small for his help in creating animated figures in this post.

Pre-training generalist agents using offline reinforcement learning

February 23, 2023

by Google AI Google AI

Posted by Aviral Kumar, Student Researcher, and Sergey Levine, Research Scientist, Google Research

Reinforcement learning (RL) algorithms can learn skills to solve decision-making tasks like playing games, enabling robots to pick up objects, or even optimizing microchip designs. However, running RL algorithms in the real world requires expensive active data collection. Pre-training on diverse datasets has proven to enable data-efficient fine-tuning for individual downstream tasks in natural language processing (NLP) and vision problems. In the same way that BERT or GPT-3 models provide general-purpose initialization for NLP, large RL–pre-trained models could provide general-purpose initialization for decision-making. So, we ask the question: Can we enable similar pre-training to accelerate RL methods and create a general-purpose “backbone” for efficient RL across various tasks?

In “Offline Q-learning on Diverse Multi-Task Data Both Scales and Generalizes”, to be published at ICLR 2023, we discuss how we scaled offline RL, which can be used to train value functions on previously collected static datasets, to provide such a general pre-training method. We demonstrate that Scaled Q-Learning using a diverse dataset is sufficient to learn representations that facilitate rapid transfer to novel tasks and fast online learning on new variations of a task, improving significantly over existing representation learning approaches and even Transformer-based methods that use much larger models.

Scaled Q-learning: Multi-task pre-training with conservative Q-learning

To provide a general-purpose pre-training approach, offline RL needs to be scalable, allowing us to pre-train on data across different tasks and utilize expressive neural network models to acquire powerful pre-trained backbones, specialized to individual downstream tasks. We based our offline RL pre-training method on conservative Q-learning (CQL), a simple offline RL method that combines standard Q-learning updates with an additional regularizer that minimizes the value of unseen actions. With discrete actions, the CQL regularizer is equivalent to a standard cross-entropy loss, which is a simple, one-line modification on standard deep Q-learning. A few crucial design decisions made this possible:

Neural network size: We found that multi-game Q-learning required large neural network architectures. While prior methods often used relatively shallow convolutional networks, we found that models as large as a ResNet 101 led to significant improvements over smaller models.
Neural network architecture: To learn pre-trained backbones that are useful for new games, our final architecture uses a shared neural network backbone, with separate 1-layer heads outputting Q-values of each game. This design avoids interference between the games during pre-training, while still providing enough data sharing to learn a single shared representation. Our shared vision backbone also utilized a learned position embedding (akin to Transformer models) to keep track of spatial information in the game.
Representational regularization: Recent work has observed that Q-learning tends to suffer from representational collapse issues, where even large neural networks can fail to learn effective representations. To counteract this issue, we leverage our prior work to normalize the last layer features of the shared part of the Q-network. Additionally, we utilized a categorical distributional RL loss for Q-learning, which is known to provide richer representations that improve downstream task performance.

The multi-task Atari benchmark

We evaluate our approach for scalable offline RL on a suite of Atari games, where the goal is to train a single RL agent to play a collection of games using heterogeneous data from low-quality (i.e., suboptimal) players, and then use the resulting network backbone to quickly learn new variations in pre-training games or completely new games. Training a single policy that can play many different Atari games is difficult enough even with standard online deep RL methods, as each game requires a different strategy and different representations. In the offline setting, some prior works, such as multi-game decision transformers, proposed to dispense with RL entirely, and instead utilize conditional imitation learning in an attempt to scale with large neural network architectures, such as transformers. However, in this work, we show that this kind of multi-game pre-training can be done effectively via RL by employing CQL in combination with a few careful design decisions, which we describe below.

Scalability on training games

We evaluate the Scaled Q-Learning method’s performance and scalability using two data compositions: (1) near optimal data, consisting of all the training data appearing in replay buffers of previous RL runs, and (2) low quality data, consisting of data from the first 20% of the trials in the replay buffer (i.e., only data from highly suboptimal policies). In our results below, we compare Scaled Q-Learning with an 80-million parameter model to multi-game decision transformers (DT) with either 40-million or 80-million parameter models, and a behavioral cloning (imitation learning) baseline (BC). We observe that Scaled Q-Learning is the only approach that improves over the offline data, attaining about 80% of human normalized performance.

Further, as shown below, Scaled Q-Learning improves in terms of performance, but it also enjoys favorable scaling properties: just as how the performance of pre-trained language and vision models improves as network sizes get bigger, enjoying what is typically referred as “power-law scaling”, we show that the performance of Scaled Q-learning enjoys similar scaling properties. While this may be unsurprising, this kind of scaling has been elusive in RL, with performance often deteriorating with larger model sizes. This suggests that Scaled Q-Learning in combination with the above design choices better unlocks the ability of offline RL to utilize large models.

Fine-tuning to new games and variations

To evaluate fine-tuning from this offline initialization, we consider two settings: (1) fine-tuning to a new, entirely unseen game with a small amount of offline data from that game, corresponding to 2M transitions of gameplay, and (2) fine-tuning to a new variant of the games with online interaction. The fine-tuning from offline gameplay data is illustrated below. Note that this condition is generally more favorable to imitation-style methods, Decision Transformer and behavioral cloning, since the offline data for the new games is of relatively high-quality. Nonetheless, we see that in most cases Scaled Q-learning improves over alternative approaches (80% on average), as well as dedicated representation learning methods, such as MAE or CPC, which only use the offline data to learn visual representations rather than value functions.

In the online setting, we see even larger improvements from pre-training with Scaled Q-learning. In this case, representation learning methods like MAE yield minimal improvement during online RL, whereas Scaled Q-Learning can successfully integrate prior knowledge about the pre-training games to significantly improve the final score after 20k online interaction steps.

These results demonstrate that pre-training generalist value function backbones with multi-task offline RL can significantly boost performance of RL on downstream tasks, both in offline and online mode. Note that these fine-tuning tasks are quite difficult: the various Atari games, and even variants of the same game, differ significantly in appearance and dynamics. For example, the target blocks in Breakout disappear in the variation of the game as shown below, making control difficult. However, the success of Scaled Q-learning, particularly as compared to visual representation learning techniques, such as MAE and CPC, suggests that the model is in fact learning some representation of the game dynamics, rather than merely providing better visual features.

Fine-tuning with online RL for variants of the game Freeway, Hero, and Breakout. The new variant used in fine-tuning is shown in the bottom row of each figure, the original game seen in pre-training is in the top row. Fine-tuning from Scaled Q-Learning significantly outperforms MAE (a visual representation learning method) and learning from scratch with single-game DQN.

Conclusion and takeaways

We presented Scaled Q-Learning, a pre-training method for scaled offline RL that builds on the CQL algorithm, and demonstrated how it enables efficient offline RL for multi-task training. This work made initial progress towards enabling more practical real-world training of RL agents as an alternative to costly and complex simulation-based pipelines or large-scale experiments. Perhaps in the long run, similar work will lead to generally capable pre-trained RL agents that develop broadly applicable exploration and interaction skills from large-scale offline pre-training. Validating these results on a broader range of more realistic tasks, in domains such as robotics (see some initial results) and NLP, is an important direction for future research. Offline RL pre-training has a lot of potential, and we expect that we will see many advances in this area in future work.

Acknowledgements

This work was done by Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Special thanks to Sherry Yang, Ofir Nachum, and Kuang-Huei Lee for help with the multi-game decision transformer codebase for evaluation and the multi-game Atari benchmark, and Tom Small for illustrations and animation.

Google Research, 2022 & beyond: Health

February 23, 2023

by Google AI Google AI

Posted by Greg Corrado, Distinguished Scientist, and Yossi Matias, VP Engineering and Research, Google Research

(This is Part 8 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Google’s focus on AI stems from the conviction that this transformational technology will benefit society through its capacity to assist, complement, and empower people in almost every field and sector. In no area is the magnitude of this opportunity greater than in the spheres of healthcare and medicine. Commensurate with our mission to demonstrate these societal benefits, Google Research’s programs in applied machine learning (ML) have helped place Alphabet among the top five most impactful corporate research institutions in the health and life sciences publications on the Nature Impact Index in every year from 2019 through 2022.

Our Health research publications have had broad impact, spanning the fields of biomarkers, consumer sensors, dermatology, endoscopy, epidemiology, medicine, genomics, oncology, ophthalmology, pathology, public & environmental health, and radiology. Today we examine three specific themes that came to the fore in the last year:

· Criticality of technology partnerships

· Shift towards mobile health

· Generative ML in health applications

In each section, we emphasize the importance of a measured and collaborative approach to innovation in health. Unlike the “launch and iterate” approach typical in consumer product development, applying ML to health requires thoughtful assessment, ecosystem awareness, and rigorous testing. All healthcare technologies must demonstrate to regulators that they are safe and effective prior to deployment and need to meet rigorous patient privacy and performance monitoring standards. But ML systems, as new entrants to the field, additionally must discover their best uses in the health workflows and earn the trust of healthcare professionals and patients. This domain-specific integration and validation work is not something tech companies should embark upon alone, but should do so only in close collaboration with expert health partners.

Criticality of technology partnerships

Responsible innovation requires the patience and sustained investment to collectively follow the long arc from primary research to human impact. In our own journey to promote the use of ML to prevent blindness in underserved diabetic populations, six years elapsed between our publication of the primary algorithmic research, and the recent deployment study demonstrating the real-world accuracy of the integrated ML solution in a community-based screening setting. Fortunately, we have found that we can radically accelerate this journey from benchtop-ML to AI-at-the-bedside with thoughtfully constructed technology partnerships.

The need for accelerated release of health-related ML technologies is apparent, for example, in oncology. Breast cancer and lung cancer are two of the most common cancer types, and for both, early detection is key. If ML can yield greater accuracy and expanded availability of screening for these cancers, patient outcomes will improve — but the longer we wait to deploy these advances, the fewer people will be helped. Partnership can allow new technologies to safely reach patients with less delay — established med-tech companies can integrate new AI capabilities into existing product suites, seek the appropriate regulatory clearances, and use their existing customer base to rapidly deploy these technologies.

We’ve seen this play out first hand. Just two and half years after sharing our primary research using ML to improve breast cancer screening, we partnered with iCAD, a leading purveyor of mammography software, to begin integrating our technology into their products. We see this same accelerated pattern in translating our research on deep learning for low-dose CT scans to lung cancer screening workflows through our partnership with RadNet’s Aidence.

Genomics is another area where partnership has proven a powerful accelerant for ML technology. This past year, we collaborated with Stanford University to rapidly diagnose genetic disease by combining novel sequencing technologies and ML to sequence a patient’s entire genome in record-setting time, allowing life-saving interventions. Separately, we announced a partnership with Pacific Biosciences to further advance genomic technologies in research and the clinic by layering our ML techniques on top of their sequencing methods, building on our long running open source projects in deep learning genomics. Later in the same year PacBio announced Revio, a new genome sequencing tool powered by our technology.

<!–

Diagnosing a rare genetic disease may depend on finding a handful of novel mutations in out of billions of base pairs in the patient’s genome.

–>

Partnerships between med-tech companies and AI-tech companies can accelerate translation of technology, but these partnerships are a complement to, not a substitute for, open research and open software that moves the entire field forward. For example, within our medical imaging portfolio, we introduced a new approach to simplify transfer learning for chest x-ray model development, methods to accelerate the life-cycle of ML systems for medical imaging via robust and efficient self-supervision, and techniques to make medical imaging systems more robust to outliers — all within 2022.

Moving forward, we believe this mix of scientific openness and cross-industry partnerships will be a critical catalyst in realizing the benefits of human-centered AI in healthcare and medicine.

Top

Shift towards mobile medicine

In healthcare overall, and recapitulated in ML research in health applications, there has been a shift in emphasis away from concentrated centralized care (e.g., hospitalizations) and towards distributed care (e.g., reaching patients in their communities). Thus, we’re working to develop mobile ML-solutions that can be brought to the patient, rather than bringing the patient to the (ML-powered) clinic. In 2021, we shared some of our early work using smartphone cameras to measure heart rate and to help identify skin conditions. In 2022, we shared new research on the potential for smartphone camera selfies to assess cardiovascular health and metabolic risks to eyesight and the potential for smartphone microphones held to the chest to help interpret heart and lung sounds.

These examples all use the sensors that already exist on every smartphone. While these advances are valuable, there is still great potential in extending mobile health capabilities by developing new sensing technologies. One of our most exciting research projects in this area leverages new sensors that easily connect to modern smartphones to enable mobile maternal ultrasound in under-resourced communities.

Each year, complications from pregnancy & childbirth contribute to 295,000 maternal deaths and 2.4 million neonatal deaths, disproportionately impacting low income populations globally. Obstetric ultrasound is an important component of quality antenatal care, but up to 50% of women in low-and-middle-income countries receive no ultrasound screening during pregnancy. Innovators in ultrasound hardware have made rapid progress towards low-cost, handheld, portable ultrasound probes that can be driven with just a smartphone, but there’s a critical missing piece — a shortage of field technicians with the skills and expertise to operate the ultrasound probe and interpret its shadowy images. Remote interpretation is feasible of course, but is impractical in settings with unreliable or slow internet connectivity.

With the right ML-powered mobile ultrasounds, providers such as midwives, nurses, and community health workers could have the potential to bring obstetric ultrasound to those most in need and catch problems before it’s too late. Previous work had shown that convolutional neural networks (CNNs) could interpret ultrasounds acquired by trained sonographers using a standardized acquisition protocol. Recognizing this opportunity for AI to unblock access to potentially lifesaving information, we’ve spent the last couple of years working in collaboration with academic partners and researchers in the US and Zambia to improve and expand the ability to automatically interpret ultrasound video captures acquired by simply sweeping an ultrasound probe across the mother’s abdomen, a procedure that can easily be taught to non-experts.

This ultrasound acquisition procedure can be performed by novices with a few hours of ultrasound training.

Using just a low cost, battery-powered ultrasound device and a smartphone, the accuracy of this method is on par with existing clinical standards for professional sonographers to estimate gestational age and fetal malpresentation.

The accuracy of this AI enabled procedure is on-par with the clinical standard for estimating gestational age.

We are in the early stages of a wide-spread transformation in portable medical imaging. In the future, ML-powered mobile ultrasound will augment the phone’s built-in sensors to allow in-the-field triage and screening for a wide range of medical issues, all with minimal training, extending access to care for millions.

Top

Generative ML in Health

As the long arc of the application of ML to health plays out, we expect generative modeling to settle into a role complementary to the pattern recognition systems that are now relatively commonplace. In the past we’ve explored the suitability of generative image models in data augmentation, discussed how generative models might be used to capture interactions among correlated clinical events, and even used it to generate realistic, but entirely synthetic electronic medical records for research purposes.

Generating synthetic data from the original data with EHR-Safe.

Any discussion of today’s outlook on applied generative modeling would be incomplete without mention of recent developments in the field of large language models (LLMs). Nearly a decade of research in the making, publicly available demonstrations of text synthesis via generative recurrent neural networks have captured the world’s imagination. These technologies undoubtedly have real world applications — in fact, Google was among the first to deploy earlier variants of these networks in live consumer products. But when considering their applications to health, we must again return to our mantra of measurement — we have fundamental responsibility to test technologies responsibly and proceed with caution. The gravity of building an ML system that might one day impact real people with real health issues cannot be underestimated.

To that end, in December of last year we published a pre-print on LLMs and the encoding of clinical knowledge which (1) collated and expanded benchmarks for evaluating automated medical question answering systems, and (2) introduced our own research-grade medical question answering LLM, Med-PaLM. For example if one asked Med-Palm, “Does stress cause nosebleeds?” the LLM would generate a response explaining that yes, stress can cause nosebleeds, and detail some possible mechanisms. The purpose of Med-PaLM is to allow researchers to experiment with and improve upon the representation, retrieval, and communication of health information by LLMs, but is not a finished medical question answering product.

We were excited to report that Med-PaLM substantially outperformed other systems on these benchmarks, across the board. That said, a critical take-away of our paper is that merely receiving a “passing” mark on a set of medical exam questions (which ours and some other ML systems do) still falls well short of the safety and accuracy required to support real-world use for medical question answering. We expect that progress in this area will be brisk — but that much like our journey bringing CNNs to medical imaging, the maturation of LLMs for applications in health will require further research, partnership, care, and patience.

Our model, Med-PaLM, obtains state-of-the-art performance on the MedQA USMLE dataset exceeding previous best by 7%.

Top

Concluding thoughts

We expect all these trends to continue, and perhaps even accelerate, in 2023. In a drive to more efficiently map the arc from innovation to impact in AI for healthcare, we will see increased collaboration between academic, med-tech, AI-tech, and healthcare organizations. This is likely to interact positively with the measured, but nonetheless transformational, expansion of the role of phones and mobile sensors in the provisioning of care, potentially well beyond what we presently imagine telehealth to be. And of course, it’s hard to be in the field of AI these days, and not be excited at the prospects for generative AI and large language models. But particularly in the health domain, it is essential that we use the tools of partnership, and the highest standards of testing to realize this promise. Technology will keep changing, and what we know about human health will keep changing too. What will remain the same is the people caring for each other, and trying to do things better than before. We are excited about the role AI can play in improving healthcare in years to come.

Top

Google Research, 2022 & beyond

This was the seventh blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models	Computer Vision	Multimodal Models
Generative Models	Responsible AI	ML & Computer Systems
Efficient Deep Learning	Algorithmic Advances	Robotics
Natural Sciences	Health	Community Engagement*

* Articles will be linked as they are released.

Suppressing quantum errors by scaling a surface code logical qubit

February 22, 2023

by Google AI Google AI

Posted by Hartmut Neven, VP of Engineering, and Julian Kelly, Director of Quantum Hardware, on behalf of the Google Quantum AI Team

Many years from today, scientists will be able to use fault-tolerant quantum computers for large-scale computations with applications across science and industry. These quantum computers will be much bigger than today, consisting of millions of coherent quantum bits, or qubits. But there’s a catch — these basic building blocks must be good enough or the systems will be overrun with errors.

Currently, the error rates of the qubits on our 3rd generation Sycamore processor are typically between 1 in 10,000 to 1 in 100. Through our work and that of others, we understand that developing large-scale quantum computers will require far lower error rates. We will need rates in the range of 1 in 10⁹ to 1 in 10⁶ to run quantum circuits that can solve industrially relevant problems.

So how do we get there, knowing that squeezing three to six orders of magnitude of better performance from our current physical qubits is unlikely? Our team has created a roadmap that has directed our research for the last several years, improving the performance of our quantum computers in gradual steps toward a fault-tolerant quantum computer.

Roadmap for building a useful error-corrected quantum computer with key milestones. We are currently building one logical qubit that we will scale in the future.

Today, in “Suppressing Quantum Errors by Scaling a Surface Code Logical Qubit”, published in Nature, we are announcing that we have reached the second milestone on our roadmap. Our experimental results demonstrate a prototype of the basic unit of an error-corrected quantum computer known as a logical qubit, with performance nearing the regime that enables scalable fault-tolerant quantum computing.

A paradigm shift: from physical qubits to logical qubits

Quantum error correction (QEC) represents a paradigm shift from today’s quantum computing, where each physical qubit on the processor acts as a unit of computation. It provides the recipe to reach low errors by trading many good qubits for an excellent one: information is encoded across several physical qubits to construct a single logical qubit that is more resilient and capable of running large-scale quantum algorithms. Under the right conditions, the more physical qubits used to build a logical qubit, the better that logical qubit becomes.

However, this will not work if the added errors from each additional physical qubit outweigh the benefits of QEC. Until now, the high physical error rates have always won out.

To that end, we use a particular error-correcting code called a surface code and show for the first time that increasing the size of the code decreases the error rate of the logical qubit. A first-ever for any quantum computing platform, this was achieved by painstakingly mitigating many error sources as we scaled from 17 to 49 physical qubits. This work is evidence that with enough care, we can produce the logical qubits necessary for a large-scale error-corrected quantum computer.

Quantum error correction with surface codes

How does an error-correcting code protect information? Take a simple example from classical communication: Bob wants to send Alice a single bit that reads “1” across a noisy communication channel. Recognizing that the message is lost if the bit flips to “0”, Bob instead sends three bits: “111”. If one erroneously flips, Alice could take a majority vote (a simple error-correcting code) of all the received bits and still understand the intended message. Repeating the information more than three times — increasing the “size” of the code — would enable the code to tolerate more individual errors.

Many physical qubits on a quantum processor acting as one logical qubit in an error-correcting code called a surface code.

A surface code takes this principle and imagines a practical quantum implementation. It has to satisfy two additional constraints. First, the surface code must be able to correct not just bit flips, taking a qubit from |0⟩ to |1⟩, but also phase flips. This error is unique to quantum states and transforms a qubit in a superposition state, for example from “|0⟩ + |1⟩” to “|0⟩ – |1⟩”. Second, checking the qubits’ states would destroy their superpositions, so one needs a way of detecting errors without measuring the states directly.

To address these constraints, we arrange two types of qubits on a checkerboard. “Data” qubits on the vertices make up the logical qubit, while “measure” qubits at the center of each square are used for so-called “stabilizer measurements.” These measurements tell us whether the qubits are all the same, as desired, or different, signaling that an error occurred, without actually revealing the value of the individual data qubits.

We tile two types of stabilizer measurements in a checkerboard pattern to protect the logical data from bit- and phase-flips. If some of the stabilizer measurements register an error, then correlations in the stabilizer measurements are used to identify which error(s) occurred and where.

Surface-code QEC. Data qubits (yellow) are at the vertices of a checkerboard. Measure qubits at the center of each square are used for stabilizer measurements (blue squares). Dark blue squares check for bit-flip errors, while light blue squares check for phase-flip errors. Left: A phase-flip error. The two nearest light blue stabilizer measurements register the error (light red). Right: A bit-flip error. The two nearest dark blue stabilizer measurements register the error (dark red).

Just as Bob’s message to Alice in the example above became more robust against errors with increasing code size, a larger surface code better protects the logical information it contains. The surface code can withstand a number of bit- and phase-flip errors each equal to less than half the distance, where the distance is the number of data qubits that span the surface code in either dimension.

But here’s the problem: every individual physical qubit is prone to errors, so the more qubits in a code, the more opportunity for errors. We want the higher protection offered by QEC to outweigh the increased opportunities for errors as we increase the number of qubits. For this to happen, the physical qubits must have errors below the so-called “fault-tolerant threshold.” For the surface code, this threshold is quite low. So low that it hasn’t been experimentally feasible until recently. We are now on the precipice of reaching this coveted regime.

Making and controlling high-quality physical qubits

Entering the regime where QEC improves with scale required improving every aspect of our quantum computers, from nanofabrication of the physical qubits to the optimized control of the full quantum system. These experiments ran on a state-of-the-art 3rd generation Sycamore processor architecture optimized for QEC using the surface code with improvements across the board:

Increased qubit relaxation and dephasing lifetimes through an improved fabrication process and environmental noise reduction near the quantum processor.
Lowered cross-talk between all physical qubits during parallel operation by optimizing quantum processor circuit design and nanofabrication.
Reduced drift and improved qubit control fidelity through upgraded custom electronics.
Implemented faster and higher-fidelity readout and reset operations compared with previous generations of the Sycamore processor.
Reduced calibration errors by extensively modeling the full quantum system and employing better system-optimization algorithms.
Developed context-aware and fully parallel calibrations to minimize drift and optimize control parameters for QEC circuits.
Enhanced dynamical decoupling protocols to protect physical qubits from noise and cross-talk during idling operations.

Running surface code circuits

With these upgrades in place, we ran experiments to compare the ratio (𝚲_3,5) between the logical error rate of a distance-3 surface code (ε₃) with 17 qubits to that of a distance-5 surface code (ε₅) with 49 qubits — 𝚲_3,5 = ε₃ / ε₅.

Comparison of logical fidelity (defined as 1-ε) between distance-3 (d=3) and distance-5 (d=5) surface codes. The distance-5 code contains four possible distance-3 arrangements, with one example shown in the red outline (left). As improvements were made, the d=5 fidelity increased faster than that of the d=3, eventually overtaking the distance-3 code, as shown in the top-right data points (right), whose average lies slightly to the left of the ε₃ = ε₅ line.

The results of these experiments are shown above on the right. Continued improvements over several months allowed us to reduce the logical errors of both grids, leading to the distance-5 grid (ε₅ = 2.914%) outperforming the distance-3 grids (ε₃ = 3.028%) by 4% (𝚲_3,5 = 1.04) with 5𝛔 confidence. While this might seem like a small improvement, it’s important to emphasize that the result represents a first for the field since Peter Shor’s 1995 QEC proposal. A larger code outperforming a smaller one is a key signature of QEC, and all quantum computing architectures will need to pass this hurdle to realize a path to the low errors that are necessary for quantum applications.

The path forward

These results indicate that we are entering a new era of practical QEC. The Google Quantum AI team has spent the last few years thinking about how we define success in this new era, and how we measure progress along the way.

The ultimate goal is to demonstrate a pathway to achieving the low errors needed for using quantum computers in meaningful applications. To this end, our target remains achieving logical error rates of 1 in 10⁶ or lower per cycle of QEC. In the figure below on the left, we outline the path that we anticipate to reach this target. As we continue improving our physical qubits (and hence the performance of our logical qubits), we expect to gradually increase 𝚲 from close to 1 in this work to larger numbers. The figure below shows that a value of 𝚲 = 4 and a code distance of 17 (577 physical qubits with good enough quality) will yield a logical error rate below our target of 1 in 10⁶.

While this result is still a few years out, we have an experimental technique to probe error rates this low with today’s hardware, albeit in limited circumstances. While two-dimensional surface codes allow us to correct both bit- and phase-flip errors, we can also construct one-dimensional repetition codes that are only able to solve one type of error with relaxed requirements. On the right below, we show that a distance-25 repetition code can reach error rates per cycle close to 1 in 10⁶. At such low errors, we see new kinds of error mechanisms that are not yet observable with our surface codes. By controlling for these error mechanisms, we can improve repetition codes to error rates near 1 in 10⁷.

Left: Expected progression as we improve performance (quantified by 𝚲) and scale (quantified by code distance) for surface codes. Right: Experimentally measured logical error rates per cycle versus the distance of one-dimensional repetition codes and two-dimensional surface codes.

Reaching this milestone reflects three years of focused work by the entire Google Quantum AI team following our demonstration of a quantum computer outperforming a classical computer. In our march toward building fault-tolerant quantum computers, we will continue to use the target error rates in the figure above to measure our progress. With further improvements toward our next milestone, we anticipate entering the fault-tolerant regime, where we can exponentially suppress logical errors and unlock the first useful error-corrected quantum applications. In the meantime, we continue to explore various ways of solving problems using quantum computers in topics ranging from condensed matter physics to chemistry, machine learning, and materials science.

Our progress toward quantum error correction

February 22, 2023

by Google AI

Our CEO Sundar Pichai shares news about our latest milestone in quantum computing.Read More

Google Research, 2022 & beyond: Natural sciences

February 21, 2023

by Google AI Google AI

Posted by John Platt, Distinguished Scientist, Google Research

(This is Part 7 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

It’s an incredibly exciting time to be a scientist. With the amazing advances in machine learning (ML) and quantum computing, we now have powerful new tools that enable us to act on our curiosity, collaborate in new ways, and radically accelerate progress toward breakthrough scientific discoveries.

Since joining Google Research eight years ago, I’ve had the privilege of being part of a community of talented researchers fascinated by applying cutting-edge computing to push the boundaries of what is possible in applied science. Our teams are exploring topics across the physical and natural sciences. So, for this year’s blog post I want to focus on high-impact advances we’ve made recently in the fields of biology and physics, from helping to organize the world’s protein and genomics information to benefit people’s lives to improving our understanding of the nature of the universe with quantum computers. We are inspired by the great potential of this work.

Using machine learning to unlock mysteries in biology

Many of our researchers are fascinated by the extraordinary complexity of biology, from the mysteries of the brain, to the potential of proteins, and to the genome, which encodes the very language of life. We’ve been working alongside scientists from other leading organizations around the world to tackle important challenges in the fields of connectomics, protein function prediction, and genomics, and to make our innovations accessible and useful to the greater scientific community.

Neurobiology

One exciting application of our Google-developed ML methods was to explore how information travels through the neuronal pathways in the brains of zebrafish, which provides insight into how the fish engage in social behavior like swarming. In collaboration with researchers from the Max Planck Institute for Biological Intelligence, we were able to computationally reconstruct a portion of zebrafish brains imaged with 3D electron microscopy — an exciting advance in the use of imaging and computational pipelines to map out the neuronal circuitry in small brains, and another step forward in our long-standing contributions to the field of connectomics.

Reconstruction of the neural circuitry of a larval zebrafish brain, courtesy of the Max Planck Institute for Biological Intelligence.

The technical advances necessary for this work will have applications even beyond neuroscience. For example, to address the difficulty of working with such large connectomics datasets, we developed and released TensorStore, an open-source C++ and Python software library designed for storage and manipulation of n-dimensional data. We look forward to seeing the ways it is used in other fields for the storage of large datasets.

We’re also using ML to shed light on how human brains perform remarkable feats like language by comparing human language processing and autoregressive deep language models (DLMs). For this study, a collaboration with colleagues at Princeton University and New York University Grossman School of Medicine, participants listened to a 30-minute podcast while their brain activity was recorded using electrocorticography. The recordings suggested that the human brain and DLMs share computational principles for processing language, including continuous next-word prediction, reliance on contextual embeddings, and calculation of post-onset surprise based on word match (we can measure how surprised the human brain is by the word, and correlate that surprise signal with how well the word is predicted by the DLM). These results provide new insights into language processing in the human brain, and suggest that DLMs can be used to reveal valuable insights about the neural basis of language.

Biochemistry

ML has also allowed us to make significant advances in understanding biological sequences. In 2022, we leveraged recent advances in deep learning to accurately predict protein function from raw amino acid sequences. We also worked in close collaboration with the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) to carefully assess model performance and add hundreds of millions of functional annotations to the public protein databases UniProt, Pfam/InterPro, and MGnify. Human annotation of protein databases can be a laborious and slow process and our ML methods enabled a giant leap forward — for example, increasing the number of Pfam annotations by a larger number than all other efforts during the past decade combined. The millions of scientists worldwide who access these databases each year can now use our annotations for their research.

Google Research contributions to Pfam exceed in size all expansion efforts made to the database over the last decade.

Although the first draft of the human genome was released in 2003, it was incomplete and had many gaps due to technical limitations in the sequencing technologies. In 2022 we celebrated the remarkable achievements of the Telomere-2-Telomere (T2T) Consortium in resolving these previously unavailable regions — including five full chromosome arms and nearly 200 million base pairs of novel DNA sequences — which are interesting and important for questions of human biology, evolution, and disease. Our open source genomics variant caller, DeepVariant, was one of the tools used by the T2T Consortium to prepare their release of a complete 3.055 billion base pair sequence of a human genome. The T2T Consortium is also using our newer open source method DeepConsensus, which provides on-device error correction for Pacific Biosciences long-read sequencing instruments, in their latest research toward comprehensive pan-genome resources that can represent the breadth of human genetic diversity.

Using quantum computing for new physics discoveries

When it comes to making scientific discoveries, quantum computing is still in its infancy, but has a lot of potential. We’re exploring ways of advancing the capabilities of quantum computing so that it can become a tool for scientific discovery and breakthroughs. In collaboration with physicists from around the world, we are also starting to use our existing quantum computers to create interesting new experiments in physics.

As an example of such experiments, consider the problem where a sensor measures something, and a computer then processes the data from the sensor. Traditionally, this means the sensor’s data is processed as classical information on our computers. Instead, one idea in quantum computing is to directly process quantum data from sensors. Feeding data from quantum sensors directly to quantum algorithms without going through classical measurements may provide a large advantage. In a recent Science paper written in collaboration with researchers from multiple universities, we show that quantum computing can extract information from exponentially fewer experiments than classical computing, as long as the quantum computer is coupled directly to the quantum sensors and is running a learning algorithm. This “quantum machine learning” can yield an exponential advantage in dataset size, even with today’s noisy intermediate-scale quantum computers. Because experimental data is often the limiting factor in scientific discovery, quantum ML has the potential to unlock the vast power of quantum computers for scientists. Even better, the insights from this work are also applicable to learning on the output of quantum computations, such as the output of quantum simulations that may otherwise be difficult to extract.

Even without quantum ML, a powerful application of quantum computers is to experimentally explore quantum systems that would be otherwise impossible to observe or simulate. In 2022, the Quantum AI team used this approach to observe the first experimental evidence of multiple microwave photons in a bound state using superconducting qubits. Photons typically do not interact with one another, and require an additional element of non-linearity to cause them to interact. The results of our quantum computer simulations of these interactions surprised us — we thought the existence of these bound states relied on fragile conditions, but instead we found that they were robust even to relatively strong perturbations that we applied.

Occupation probability versus discrete time step for n-photon bound states. We observe that the majority of the photons (darker colors) remain bound together.

Given the initial successes we have had in applying quantum computing to make physics breakthroughs, we are hopeful about the possibility of this technology to enable future groundbreaking discoveries that could have as significant a societal impact as the creation of transistors or GPS. The future of quantum computing as a scientific tool is exciting!

Acknowledgements

I would like to thank everyone who worked hard on the advances described in this post, including the Google Applied Sciences, Quantum AI, Genomics and Brain teams and their collaborators across Google Research and externally. Finally, I would like to thank the many Googlers who provided feedback in the writing of this post, including Lizzie Dorfman, Erica Brand, Elise Kleeman, Abe Asfaw, Viren Jain, Lucy Colwell, Andrew Carroll, Ariel Goldstein and Charina Chou.

Top

Google Research, 2022 & beyond

This was the seventh blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models	Computer Vision	Multimodal Models
Generative Models	Responsible AI	ML & Computer Systems
Efficient Deep Learning	Algorithmic Advances	Robotics
Natural Sciences	Health*	Community Engagement

* Articles will be linked as they are released.

7 ways AI is already making your Pixel more helpful

February 20, 2023

by Google AI

Whether you’re using your Pixel to translate a foreign language, edit photos or take a phone call in a noisy area — AI is making everyday moments easier.Read More

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

February 17, 2023

by Google AI Google AI

Posted by Parker Riley, Software Engineer, and Jan Botha, Research Scientist, Google Research

Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever variety has more data available online, which disproportionately affects speakers of under-resourced language varieties.

In “FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation”, accepted for publication in Transactions of the Association for Computational Linguistics, we present an evaluation dataset used to measure MT systems’ ability to support regional varieties through a case study on Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin Chinese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide.

Challenge: Few-Shot Generalization

Most modern MT systems are trained on millions or billions of example translations, such as an English input sentence and its corresponding Portuguese translation. However, the vast majority of available training data doesn’t specify what regional variety the translation is in. In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. MT models need to use the linguistic patterns showcased in the small number of labeled examples (called “exemplars”) to identify similar patterns in their unlabeled training examples. In this way, models can generalize, producing correct translations of phenomena not explicitly shown in the exemplars.

An illustration of a few-shot MT system translating the English sentence, “The bus arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; right).

Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system. While our work is specific to regional varieties of two languages, we anticipate that methods that perform well will be readily applicable to other languages and regional varieties. In principle, those methods should also work for other language distinctions, such as formality and style.

Data Collection

The FRMT dataset consists of partial English Wikipedia articles, sourced from the Wiki40b dataset, that have been translated by paid, professional translators into different regional varieties of Portuguese and Mandarin. In order to highlight key region-aware translation challenges, we designed the dataset using three content buckets: (1) Lexical, (2) Entity, and (3) Random.

The Lexical bucket focuses on regional differences in word choice, such as the “ônibus” vs. “autocarro” distinction when translating a sentence with the word “bus” into Brazilian vs. European Portuguese, respectively. We manually collected 20-30 terms that have regionally distinctive translations according to blogs and educational websites, and filtered and vetted the translations with feedback from volunteer native speakers from each region. Given the resulting list of English terms, we extracted texts of up to 100 sentences each from the associated English Wikipedia articles (e.g., bus). The same process was carried out independently for Mandarin.
The Entity bucket is populated in a similar way and concerns people, locations or other entities strongly associated with one of the two regions in question for a given language. Consider an illustrative sentence like, “In Lisbon, I often took the bus.” In order to translate this correctly into Brazilian Portuguese, a model must overcome two potential pitfalls:
1. The strong geographical association between Lisbon and Portugal might influence a model to generate a European Portuguese translation instead, e.g., by selecting “autocarro” rather than “ônibus“.
2. Replacing “Lisbon” with “Brasília” might be a naive way for a model to localize its output toward Brazilian Portuguese, but would be semantically inaccurate, even in an otherwise fluent translation.
The Random bucket is used to check that a model correctly handles other diverse phenomena, and consists of text from 100 randomly sampled articles from Wikipedia’s “featured” and “good” collections.

Evaluation Methodology

To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human evaluation of their quality. Expert annotators from each region used the Multi-dimensional Quality Metrics (MQM) framework to identify and categorize errors in the translations. The framework includes a category-wise weighting scheme to convert the identified errors into a single score that roughly represents the number of major errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM raters to score both translations from their region and translations from their language’s other region. For example, Brazilian Portuguese raters scored both the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not the other. We found that in both Portuguese and Chinese, raters identified, on average, approximately two more major errors per sentence in the mismatched translations than in the matched ones. This indicates that our dataset truly does capture region-specific phenomena.

While human evaluation is the best way to be sure of model quality, it is often slow and expensive. We therefore wanted to find an existing automatic metric that researchers can use to evaluate their models on our benchmark, and considered chrF, BLEU, and BLEURT. Using the translations from a few baseline models that were also evaluated by our MQM raters, we discovered that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson correlation coefficient, ρ) is comparable to the inter-annotator consistency (0.70 intraclass correlation).

Metric			Pearson’s ρ
chrF			0.48
BLEU			0.58
BLEURT			0.65

Correlation between different automatic metrics and human judgements of translation quality on a subset of FRMT. Values are between -1 and 1; higher is better.

System Performance

Our evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese, but for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.

Google’s recent language model, PaLM, was rated best overall among the baselines we evaluated. In order to produce region-targeted translations with PaLM, we feed an instructive prompt into the model and then generate text from it to fill in the blank (see the example shown below).

    Translate the following texts from English to European Portuguese.
    English: [English example 1].
    European Portuguese: [correct translation 1].
    ...
    English: [input].
    European Portuguese: _____"

PaLM obtained strong results using a single example, and had marginal quality gains on Portuguese when increasing to ten examples. This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way. Our results also suggest language models like PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. See our paper for more details.

MQM performance across dataset buckets using human and PaLM translations. Thick bars represent the region-matched case, where raters from each region evaluate translations targeted at their own region. Thin, inset bars represent the region-mismatched case, where raters from each region evaluate translations targeted at the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so for all Portuguese buckets and the Mandarin lexical bucket only.

Conclusion

In the near future, we hope to see a world where language generation systems, especially machine translation, can support all speaker communities. We want to meet users where they are, generating language fluent and appropriate for their locale or region. To that end, we have released the FRMT dataset and benchmark, enabling researchers to easily compare performance for region-aware MT models. Validated via our thorough human-evaluation studies, the language varieties in FRMT have significant differences that outputs from region-aware MT models should reflect. We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.

Acknowledgements

We gratefully acknowledge our paper co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. For helpful discussion and comments on the paper, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For essential feedback around specific regional language differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue. For logistical support in collecting human translations and ratings, we thank the Google Translate team. We thank the professional translators and MQM raters for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.

Addressing social challenges together

Training the next generation of researchers

Collaborating to advance scientific innovations

Fueling innovation in products and engineering

Open-sourcing datasets and tools

Conclusion

Acknowledgements

Google Research, 2022 & beyond

Spotlight Model

Results

Conclusion

Acknowledgment

Scaled Q-learning: Multi-task pre-training with conservative Q-learning

The multi-task Atari benchmark

Scalability on training games

Fine-tuning to new games and variations

Conclusion and takeaways

Acknowledgements

Criticality of technology partnerships

Shift towards mobile medicine

Generative ML in Health

Concluding thoughts

Google Research, 2022 & beyond

A paradigm shift: from physical qubits to logical qubits

Quantum error correction with surface codes

Making and controlling high-quality physical qubits

Running surface code circuits

The path forward

Using machine learning to unlock mysteries in biology

Neurobiology

Biochemistry

Using quantum computing for new physics discoveries

Acknowledgements

Google Research, 2022 & beyond

Challenge: Few-Shot Generalization

Data Collection

Evaluation Methodology

System Performance

Conclusion

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.