A vision-language approach for foundational UI understanding

A vision-language approach for foundational UI understanding

The computational understanding of user interfaces (UI) is a key step towards achieving intelligent UI behaviors. Previously, we investigated various UI modeling tasks, including widget captioning, screen summarization, and command grounding, that address diverse interaction scenarios such as automation and accessibility. We also demonstrated how machine learning can help user experience practitioners improve UI quality by diagnosing tappability confusion and providing insights for improving UI design. These works along with those developed by others in the field have showcased how deep neural networks can potentially transform end user experiences and the interaction design practice.

With these successes in addressing individual UI tasks, a natural question is whether we can obtain foundational understandings of UIs that can benefit specific UI tasks. As our first attempt to answer this question, we developed a multi-task model to address a range of UI tasks simultaneously. Although the work made some progress, a few challenges remain. Previous UI models heavily rely on UI view hierarchies — i.e., the structure or metadata of a mobile UI screen like the Document Object Model for a webpage — that allow a model to directly acquire detailed information of UI objects on the screen (e.g., their types, text content and positions). This metadata has given previous models advantages over their vision-only counterparts. However, view hierarchies are not always accessible, and are often corrupted with missing object descriptions or misaligned structure information. As a result, despite the short-term gains from using view hierarchies, it may ultimately hamper the model performance and applicability. In addition, previous models had to deal with heterogeneous information across datasets and UI tasks, which often resulted in complex model architectures that were difficult to scale or generalize across tasks.

In “Spotlight: Mobile UI Understanding using Vision-Language Models with a Focus”, accepted for publication at ICLR 2023, we present a vision-only approach that aims to achieve general UI understanding completely from raw pixels. We introduce a unified approach to represent diverse UI tasks, the information for which can be universally represented by two core modalities: vision and language. The vision modality captures what a person would see from a UI screen, and the language modality can be natural language or any token sequences related to the task. We demonstrate that Spotlight substantially improves accuracy on a range of UI tasks, including widget captioning, screen summarization, command grounding and tappability prediction.

Spotlight Model

The Spotlight model input includes a tuple of three items: the screenshot, the region of interest on the screen, and the text description of the task. The output is a text description or response about the region of interest. This simple input and output representation of the model is expressive to capture various UI tasks and allows scalable model architectures. This model design allows a spectrum of learning strategies and setups, from task-specific fine-tuning, to multi-task learning and to few-shot learning. The Spotlight model, as illustrated in the above figure, leverages existing architecture building blocks such as ViT and T5 that are pre-trained in the high-resourced, general vision-language domain, which allows us to build on top of the success of these general domain models.

Because UI tasks are often concerned with a specific object or area on the screen, which requires a model to be able to focus on the object or area of interest, we introduce a Focus Region Extractor to a vision-language model that enables the model to concentrate on the region in light of the screen context.

In particular, we design a Region Summarizer that acquires a latent representation of a screen region based on ViT encodings by using attention queries generated from the bounding box of the region (see paper for more details). Specifically, each coordinate (a scalar value, i.e., the left, top, right or bottom) of the bounding box, denoted as a yellow box on the screenshot, is first embedded via a multilayer perceptron (MLP) as a collection of dense vectors, and then fed to a Transformer model along their coordinate-type embedding. The dense vectors and their corresponding coordinate-type embeddings are color coded to indicate their affiliation with each coordinate value. Coordinate queries then attend to screen encodings output by ViT via cross attention, and the final attention output of the Transformer is used as the region representation for the downstream decoding by T5.

A target region on the screen is summarized by using its bounding box to query into screen encodings from ViT via attentional mechanisms.

Results

We pre-train the Spotlight model using two unlabeled datasets (an internal dataset based on C4 corpus and an internal mobile dataset) with 2.5 million mobile UI screens and 80 million web pages. We then separately fine-tune the pre-trained model for each of the four downstream tasks (captioning, summarization, grounding, and tappability). For widget captioning and screen summarization tasks, we report CIDEr scores, which measure how similar a model text description is to a set of references created by human raters. For command grounding, we report accuracy that measures the percentage of times the model successfully locates a target object in response to a user command. For tappability prediction, we report F1 scores that measure the model’s ability to tell tappable objects from untappable ones.

In this experiment, we compare Spotlight with several benchmark models. Widget Caption uses view hierarchy and the image of each UI object to generate a text description for the object. Similarly, Screen2Words uses view hierarchy and the screenshot as well as auxiliary features (e.g., app description) to generate a summary for the screen. In the same vein, VUT combines screenshots and view hierarchies for performing multiple tasks. Finally, the original Tappability model leverages object metadata from view hierarchy and the screenshot to predict object tappability. Taperception, a follow-up model of Tappability, uses a vision-only tappability prediction approach. We examine two Spotlight model variants with respect to the size of its ViT building block, including B/16 and L/16. Spotlight drastically exceeded the state-of-the-art across four UI modeling tasks.

Model       Captioning       Summarization       Grounding       Tappability      
Baselines    Widget Caption       97                        
Screen2Words             61.3                  
VUT       99.3       65.6       82.1            
Taperception                         85.5      
Tappability                         87.9      
Spotlight    B/16       136.6       103.5       95.7       86.9      
L/16       141.8       106.7       95.8       88.4      

We then pursue a more challenging setup where we ask the model to learn multiple tasks simultaneously because a multi-task model can substantially reduce model footprint. As shown in the table below, the experiments showed that our model still performs competitively.

Model       Captioning       Summarization       Grounding       Tappability
VUT multi-task       99.3       65.1       80.8            
Spotlight B/16       140       102.7       90.8       89.4      
Spotlight L/16       141.3       99.2       94.2       89.5      

To understand how the Region Summarizer enables Spotlight to focus on a target region and relevant areas on the screen, we analyze the attention weights (which indicate where the model attention is on the screenshot) for both widget captioning and screen summarization tasks. In the figure below, for the widget captioning task, the model predicts “select Chelsea team” for the checkbox on the left side, highlighted with a red bounding box. We can see from its attention heatmap (which illustrates the distribution of attention weights) on the right that the model learns to attend to not only the target region of the check box, but also the text “Chelsea” on the far left to generate the caption. For the screen summarization example, the model predicts “page displaying the tutorial of a learning app” given the screenshot on the left. In this example, the target region is the entire screen, and the model learns to attend to important parts on the screen for summarization.

For the widget captioning task, the attention heatmap shows the model attending to the checkbox, i.e., the target object, and the text label on its left when generating a caption for the object. The red bounding box in the figure is for illustration purposes.
For the screen summarization task that the target region encloses the entire screen, the attention heatmap shows the model attending to various locations on the screen that contribute to generating the summary.

Conclusion

We demonstrate that Spotlight outperforms previous methods that use both screenshots and view hierarchies as the input, and establishes state-of-the-art results on multiple representative UI tasks. These tasks range from accessibility, automation to interaction design and evaluation. Our vision-only approach for mobile UI understanding alleviates the need to use view hierarchy, allows the architecture to easily scale and benefits from the success of large vision-language models pre-trained for the general domain. Compared to recent large vision-language model efforts such as Flamingo and PaLI, Spotlight is relatively small and our experiments show the trend that larger models yield better performance. Spotlight can be easily applied to more UI tasks and potentially advance the fronts of many interaction and user experience tasks.

Acknowledgment

We thank Mandar Joshi and Tao Li for their help in processing the web pre-training dataset, and Chin-Yi Cheng and Forrest Huang for their feedback for proofreading the paper. Thanks to Tom Small for his help in creating animated figures in this post.

Read More

Pre-training generalist agents using offline reinforcement learning

Pre-training generalist agents using offline reinforcement learning

Reinforcement learning (RL) algorithms can learn skills to solve decision-making tasks like playing games, enabling robots to pick up objects, or even optimizing microchip designs. However, running RL algorithms in the real world requires expensive active data collection. Pre-training on diverse datasets has proven to enable data-efficient fine-tuning for individual downstream tasks in natural language processing (NLP) and vision problems. In the same way that BERT or GPT-3 models provide general-purpose initialization for NLP, large RL–pre-trained models could provide general-purpose initialization for decision-making. So, we ask the question: Can we enable similar pre-training to accelerate RL methods and create a general-purpose “backbone” for efficient RL across various tasks?

In “Offline Q-learning on Diverse Multi-Task Data Both Scales and Generalizes”, to be published at ICLR 2023, we discuss how we scaled offline RL, which can be used to train value functions on previously collected static datasets, to provide such a general pre-training method. We demonstrate that Scaled Q-Learning using a diverse dataset is sufficient to learn representations that facilitate rapid transfer to novel tasks and fast online learning on new variations of a task, improving significantly over existing representation learning approaches and even Transformer-based methods that use much larger models.

Scaled Q-learning: Multi-task pre-training with conservative Q-learning

To provide a general-purpose pre-training approach, offline RL needs to be scalable, allowing us to pre-train on data across different tasks and utilize expressive neural network models to acquire powerful pre-trained backbones, specialized to individual downstream tasks. We based our offline RL pre-training method on conservative Q-learning (CQL), a simple offline RL method that combines standard Q-learning updates with an additional regularizer that minimizes the value of unseen actions. With discrete actions, the CQL regularizer is equivalent to a standard cross-entropy loss, which is a simple, one-line modification on standard deep Q-learning. A few crucial design decisions made this possible:

  • Neural network size: We found that multi-game Q-learning required large neural network architectures. While prior methods often used relatively shallow convolutional networks, we found that models as large as a ResNet 101 led to significant improvements over smaller models.
  • Neural network architecture: To learn pre-trained backbones that are useful for new games, our final architecture uses a shared neural network backbone, with separate 1-layer heads outputting Q-values of each game. This design avoids interference between the games during pre-training, while still providing enough data sharing to learn a single shared representation. Our shared vision backbone also utilized a learned position embedding (akin to Transformer models) to keep track of spatial information in the game.
  • Representational regularization: Recent work has observed that Q-learning tends to suffer from representational collapse issues, where even large neural networks can fail to learn effective representations. To counteract this issue, we leverage our prior work to normalize the last layer features of the shared part of the Q-network. Additionally, we utilized a categorical distributional RL loss for Q-learning, which is known to provide richer representations that improve downstream task performance.

The multi-task Atari benchmark

We evaluate our approach for scalable offline RL on a suite of Atari games, where the goal is to train a single RL agent to play a collection of games using heterogeneous data from low-quality (i.e., suboptimal) players, and then use the resulting network backbone to quickly learn new variations in pre-training games or completely new games. Training a single policy that can play many different Atari games is difficult enough even with standard online deep RL methods, as each game requires a different strategy and different representations. In the offline setting, some prior works, such as multi-game decision transformers, proposed to dispense with RL entirely, and instead utilize conditional imitation learning in an attempt to scale with large neural network architectures, such as transformers. However, in this work, we show that this kind of multi-game pre-training can be done effectively via RL by employing CQL in combination with a few careful design decisions, which we describe below.

Scalability on training games

We evaluate the Scaled Q-Learning method’s performance and scalability using two data compositions: (1) near optimal data, consisting of all the training data appearing in replay buffers of previous RL runs, and (2) low quality data, consisting of data from the first 20% of the trials in the replay buffer (i.e., only data from highly suboptimal policies). In our results below, we compare Scaled Q-Learning with an 80-million parameter model to multi-game decision transformers (DT) with either 40-million or 80-million parameter models, and a behavioral cloning (imitation learning) baseline (BC). We observe that Scaled Q-Learning is the only approach that improves over the offline data, attaining about 80% of human normalized performance.

Further, as shown below, Scaled Q-Learning improves in terms of performance, but it also enjoys favorable scaling properties: just as how the performance of pre-trained language and vision models improves as network sizes get bigger, enjoying what is typically referred as “power-law scaling”, we show that the performance of Scaled Q-learning enjoys similar scaling properties. While this may be unsurprising, this kind of scaling has been elusive in RL, with performance often deteriorating with larger model sizes. This suggests that Scaled Q-Learning in combination with the above design choices better unlocks the ability of offline RL to utilize large models.

Fine-tuning to new games and variations

To evaluate fine-tuning from this offline initialization, we consider two settings: (1) fine-tuning to a new, entirely unseen game with a small amount of offline data from that game, corresponding to 2M transitions of gameplay, and (2) fine-tuning to a new variant of the games with online interaction. The fine-tuning from offline gameplay data is illustrated below. Note that this condition is generally more favorable to imitation-style methods, Decision Transformer and behavioral cloning, since the offline data for the new games is of relatively high-quality. Nonetheless, we see that in most cases Scaled Q-learning improves over alternative approaches (80% on average), as well as dedicated representation learning methods, such as MAE or CPC, which only use the offline data to learn visual representations rather than value functions.

In the online setting, we see even larger improvements from pre-training with Scaled Q-learning. In this case, representation learning methods like MAE yield minimal improvement during online RL, whereas Scaled Q-Learning can successfully integrate prior knowledge about the pre-training games to significantly improve the final score after 20k online interaction steps.

These results demonstrate that pre-training generalist value function backbones with multi-task offline RL can significantly boost performance of RL on downstream tasks, both in offline and online mode. Note that these fine-tuning tasks are quite difficult: the various Atari games, and even variants of the same game, differ significantly in appearance and dynamics. For example, the target blocks in Breakout disappear in the variation of the game as shown below, making control difficult. However, the success of Scaled Q-learning, particularly as compared to visual representation learning techniques, such as MAE and CPC, suggests that the model is in fact learning some representation of the game dynamics, rather than merely providing better visual features.

Fine-tuning with online RL for variants of the game Freeway, Hero, and Breakout. The new variant used in fine-tuning is shown in the bottom row of each figure, the original game seen in pre-training is in the top row. Fine-tuning from Scaled Q-Learning significantly outperforms MAE (a visual representation learning method) and learning from scratch with single-game DQN.

Conclusion and takeaways

We presented Scaled Q-Learning, a pre-training method for scaled offline RL that builds on the CQL algorithm, and demonstrated how it enables efficient offline RL for multi-task training. This work made initial progress towards enabling more practical real-world training of RL agents as an alternative to costly and complex simulation-based pipelines or large-scale experiments. Perhaps in the long run, similar work will lead to generally capable pre-trained RL agents that develop broadly applicable exploration and interaction skills from large-scale offline pre-training. Validating these results on a broader range of more realistic tasks, in domains such as robotics (see some initial results) and NLP, is an important direction for future research. Offline RL pre-training has a lot of potential, and we expect that we will see many advances in this area in future work.

Acknowledgements

This work was done by Aviral Kumar, Rishabh Agarwal, Xinyang Geng, George Tucker, and Sergey Levine. Special thanks to Sherry Yang, Ofir Nachum, and Kuang-Huei Lee for help with the multi-game decision transformer codebase for evaluation and the multi-game Atari benchmark, and Tom Small for illustrations and animation.

Read More

Google Research, 2022 & beyond: Health

Google Research, 2022 & beyond: Health

(This is Part 8 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Google’s focus on AI stems from the conviction that this transformational technology will benefit society through its capacity to assist, complement, and empower people in almost every field and sector. In no area is the magnitude of this opportunity greater than in the spheres of healthcare and medicine. Commensurate with our mission to demonstrate these societal benefits, Google Research’s programs in applied machine learning (ML) have helped place Alphabet among the top five most impactful corporate research institutions in the health and life sciences publications on the Nature Impact Index in every year from 2019 through 2022.

Our Health research publications have had broad impact, spanning the fields of biomarkers, consumer sensors, dermatology, endoscopy, epidemiology, medicine, genomics, oncology, ophthalmology, pathology, public & environmental health, and radiology. Today we examine three specific themes that came to the fore in the last year:

In each section, we emphasize the importance of a measured and collaborative approach to innovation in health. Unlike the “launch and iterate” approach typical in consumer product development, applying ML to health requires thoughtful assessment, ecosystem awareness, and rigorous testing. All healthcare technologies must demonstrate to regulators that they are safe and effective prior to deployment and need to meet rigorous patient privacy and performance monitoring standards. But ML systems, as new entrants to the field, additionally must discover their best uses in the health workflows and earn the trust of healthcare professionals and patients. This domain-specific integration and validation work is not something tech companies should embark upon alone, but should do so only in close collaboration with expert health partners.

Criticality of technology partnerships

Responsible innovation requires the patience and sustained investment to collectively follow the long arc from primary research to human impact. In our own journey to promote the use of ML to prevent blindness in underserved diabetic populations, six years elapsed between our publication of the primary algorithmic research, and the recent deployment study demonstrating the real-world accuracy of the integrated ML solution in a community-based screening setting. Fortunately, we have found that we can radically accelerate this journey from benchtop-ML to AI-at-the-bedside with thoughtfully constructed technology partnerships.

The need for accelerated release of health-related ML technologies is apparent, for example, in oncology. Breast cancer and lung cancer are two of the most common cancer types, and for both, early detection is key. If ML can yield greater accuracy and expanded availability of screening for these cancers, patient outcomes will improve — but the longer we wait to deploy these advances, the fewer people will be helped. Partnership can allow new technologies to safely reach patients with less delay — established med-tech companies can integrate new AI capabilities into existing product suites, seek the appropriate regulatory clearances, and use their existing customer base to rapidly deploy these technologies.

We’ve seen this play out first hand. Just two and half years after sharing our primary research using ML to improve breast cancer screening, we partnered with iCAD, a leading purveyor of mammography software, to begin integrating our technology into their products. We see this same accelerated pattern in translating our research on deep learning for low-dose CT scans to lung cancer screening workflows through our partnership with RadNet’s Aidence.

Genomics is another area where partnership has proven a powerful accelerant for ML technology. This past year, we collaborated with Stanford University to rapidly diagnose genetic disease by combining novel sequencing technologies and ML to sequence a patient’s entire genome in record-setting time, allowing life-saving interventions. Separately, we announced a partnership with Pacific Biosciences to further advance genomic technologies in research and the clinic by layering our ML techniques on top of their sequencing methods, building on our long running open source projects in deep learning genomics. Later in the same year PacBio announced Revio, a new genome sequencing tool powered by our technology.

<!–

Diagnosing a rare genetic disease may depend on finding a handful of novel mutations in out of billions of base pairs in the patient’s genome.

–>

Partnerships between med-tech companies and AI-tech companies can accelerate translation of technology, but these partnerships are a complement to, not a substitute for, open research and open software that moves the entire field forward. For example, within our medical imaging portfolio, we introduced a new approach to simplify transfer learning for chest x-ray model development, methods to accelerate the life-cycle of ML systems for medical imaging via robust and efficient self-supervision, and techniques to make medical imaging systems more robust to outliers — all within 2022.

Moving forward, we believe this mix of scientific openness and cross-industry partnerships will be a critical catalyst in realizing the benefits of human-centered AI in healthcare and medicine.

Top

Shift towards mobile medicine

In healthcare overall, and recapitulated in ML research in health applications, there has been a shift in emphasis away from concentrated centralized care (e.g., hospitalizations) and towards distributed care (e.g., reaching patients in their communities). Thus, we’re working to develop mobile ML-solutions that can be brought to the patient, rather than bringing the patient to the (ML-powered) clinic. In 2021, we shared some of our early work using smartphone cameras to measure heart rate and to help identify skin conditions. In 2022, we shared new research on the potential for smartphone camera selfies to assess cardiovascular health and metabolic risks to eyesight and the potential for smartphone microphones held to the chest to help interpret heart and lung sounds.

These examples all use the sensors that already exist on every smartphone. While these advances are valuable, there is still great potential in extending mobile health capabilities by developing new sensing technologies. One of our most exciting research projects in this area leverages new sensors that easily connect to modern smartphones to enable mobile maternal ultrasound in under-resourced communities.

Each year, complications from pregnancy & childbirth contribute to 295,000 maternal deaths and 2.4 million neonatal deaths, disproportionately impacting low income populations globally. Obstetric ultrasound is an important component of quality antenatal care, but up to 50% of women in low-and-middle-income countries receive no ultrasound screening during pregnancy. Innovators in ultrasound hardware have made rapid progress towards low-cost, handheld, portable ultrasound probes that can be driven with just a smartphone, but there’s a critical missing piece — a shortage of field technicians with the skills and expertise to operate the ultrasound probe and interpret its shadowy images. Remote interpretation is feasible of course, but is impractical in settings with unreliable or slow internet connectivity.

With the right ML-powered mobile ultrasounds, providers such as midwives, nurses, and community health workers could have the potential to bring obstetric ultrasound to those most in need and catch problems before it’s too late. Previous work had shown that convolutional neural networks (CNNs) could interpret ultrasounds acquired by trained sonographers using a standardized acquisition protocol. Recognizing this opportunity for AI to unblock access to potentially lifesaving information, we’ve spent the last couple of years working in collaboration with academic partners and researchers in the US and Zambia to improve and expand the ability to automatically interpret ultrasound video captures acquired by simply sweeping an ultrasound probe across the mother’s abdomen, a procedure that can easily be taught to non-experts.

This ultrasound acquisition procedure can be performed by novices with a few hours of ultrasound training.

Using just a low cost, battery-powered ultrasound device and a smartphone, the accuracy of this method is on par with existing clinical standards for professional sonographers to estimate gestational age and fetal malpresentation.

The accuracy of this AI enabled procedure is on-par with the clinical standard for estimating gestational age.

We are in the early stages of a wide-spread transformation in portable medical imaging. In the future, ML-powered mobile ultrasound will augment the phone’s built-in sensors to allow in-the-field triage and screening for a wide range of medical issues, all with minimal training, extending access to care for millions.

Top

Generative ML in Health

As the long arc of the application of ML to health plays out, we expect generative modeling to settle into a role complementary to the pattern recognition systems that are now relatively commonplace. In the past we’ve explored the suitability of generative image models in data augmentation, discussed how generative models might be used to capture interactions among correlated clinical events, and even used it to generate realistic, but entirely synthetic electronic medical records for research purposes.

Generating synthetic data from the original data with EHR-Safe.

Any discussion of today’s outlook on applied generative modeling would be incomplete without mention of recent developments in the field of large language models (LLMs). Nearly a decade of research in the making, publicly available demonstrations of text synthesis via generative recurrent neural networks have captured the world’s imagination. These technologies undoubtedly have real world applications — in fact, Google was among the first to deploy earlier variants of these networks in live consumer products. But when considering their applications to health, we must again return to our mantra of measurement — we have fundamental responsibility to test technologies responsibly and proceed with caution. The gravity of building an ML system that might one day impact real people with real health issues cannot be underestimated.

To that end, in December of last year we published a pre-print on LLMs and the encoding of clinical knowledge which (1) collated and expanded benchmarks for evaluating automated medical question answering systems, and (2) introduced our own research-grade medical question answering LLM, Med-PaLM. For example if one asked Med-Palm, “Does stress cause nosebleeds?” the LLM would generate a response explaining that yes, stress can cause nosebleeds, and detail some possible mechanisms. The purpose of Med-PaLM is to allow researchers to experiment with and improve upon the representation, retrieval, and communication of health information by LLMs, but is not a finished medical question answering product.

We were excited to report that Med-PaLM substantially outperformed other systems on these benchmarks, across the board. That said, a critical take-away of our paper is that merely receiving a “passing” mark on a set of medical exam questions (which ours and some other ML systems do) still falls well short of the safety and accuracy required to support real-world use for medical question answering. We expect that progress in this area will be brisk — but that much like our journey bringing CNNs to medical imaging, the maturation of LLMs for applications in health will require further research, partnership, care, and patience.

Our model, Med-PaLM, obtains state-of-the-art performance on the MedQA USMLE dataset exceeding previous best by 7%.

Top

Concluding thoughts

We expect all these trends to continue, and perhaps even accelerate, in 2023. In a drive to more efficiently map the arc from innovation to impact in AI for healthcare, we will see increased collaboration between academic, med-tech, AI-tech, and healthcare organizations. This is likely to interact positively with the measured, but nonetheless transformational, expansion of the role of phones and mobile sensors in the provisioning of care, potentially well beyond what we presently imagine telehealth to be. And of course, it’s hard to be in the field of AI these days, and not be excited at the prospects for generative AI and large language models. But particularly in the health domain, it is essential that we use the tools of partnership, and the highest standards of testing to realize this promise. Technology will keep changing, and what we know about human health will keep changing too. What will remain the same is the people caring for each other, and trying to do things better than before. We are excited about the role AI can play in improving healthcare in years to come.

Top

Google Research, 2022 & beyond

This was the seventh blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Read More

Suppressing quantum errors by scaling a surface code logical qubit

Suppressing quantum errors by scaling a surface code logical qubit

Many years from today, scientists will be able to use fault-tolerant quantum computers for large-scale computations with applications across science and industry. These quantum computers will be much bigger than today, consisting of millions of coherent quantum bits, or qubits. But there’s a catch — these basic building blocks must be good enough or the systems will be overrun with errors.

Currently, the error rates of the qubits on our 3rd generation Sycamore processor are typically between 1 in 10,000 to 1 in 100. Through our work and that of others, we understand that developing large-scale quantum computers will require far lower error rates. We will need rates in the range of 1 in 109 to 1 in 106 to run quantum circuits that can solve industrially relevant problems.

So how do we get there, knowing that squeezing three to six orders of magnitude of better performance from our current physical qubits is unlikely? Our team has created a roadmap that has directed our research for the last several years, improving the performance of our quantum computers in gradual steps toward a fault-tolerant quantum computer.

Roadmap for building a useful error-corrected quantum computer with key milestones. We are currently building one logical qubit that we will scale in the future.

Today, in “Suppressing Quantum Errors by Scaling a Surface Code Logical Qubit”, published in Nature, we are announcing that we have reached the second milestone on our roadmap. Our experimental results demonstrate a prototype of the basic unit of an error-corrected quantum computer known as a logical qubit, with performance nearing the regime that enables scalable fault-tolerant quantum computing.

A paradigm shift: from physical qubits to logical qubits

Quantum error correction (QEC) represents a paradigm shift from today’s quantum computing, where each physical qubit on the processor acts as a unit of computation. It provides the recipe to reach low errors by trading many good qubits for an excellent one: information is encoded across several physical qubits to construct a single logical qubit that is more resilient and capable of running large-scale quantum algorithms. Under the right conditions, the more physical qubits used to build a logical qubit, the better that logical qubit becomes.

However, this will not work if the added errors from each additional physical qubit outweigh the benefits of QEC. Until now, the high physical error rates have always won out.

To that end, we use a particular error-correcting code called a surface code and show for the first time that increasing the size of the code decreases the error rate of the logical qubit. A first-ever for any quantum computing platform, this was achieved by painstakingly mitigating many error sources as we scaled from 17 to 49 physical qubits. This work is evidence that with enough care, we can produce the logical qubits necessary for a large-scale error-corrected quantum computer.

Quantum error correction with surface codes

How does an error-correcting code protect information? Take a simple example from classical communication: Bob wants to send Alice a single bit that reads “1” across a noisy communication channel. Recognizing that the message is lost if the bit flips to “0”, Bob instead sends three bits: “111”. If one erroneously flips, Alice could take a majority vote (a simple error-correcting code) of all the received bits and still understand the intended message. Repeating the information more than three times — increasing the “size” of the code — would enable the code to tolerate more individual errors.

Many physical qubits on a quantum processor acting as one logical qubit in an error-correcting code called a surface code.

A surface code takes this principle and imagines a practical quantum implementation. It has to satisfy two additional constraints. First, the surface code must be able to correct not just bit flips, taking a qubit from |0⟩ to |1⟩, but also phase flips. This error is unique to quantum states and transforms a qubit in a superposition state, for example from “|0⟩ + |1⟩” to “|0⟩ – |1⟩”. Second, checking the qubits’ states would destroy their superpositions, so one needs a way of detecting errors without measuring the states directly.

To address these constraints, we arrange two types of qubits on a checkerboard. “Data” qubits on the vertices make up the logical qubit, while “measure” qubits at the center of each square are used for so-called “stabilizer measurements.” These measurements tell us whether the qubits are all the same, as desired, or different, signaling that an error occurred, without actually revealing the value of the individual data qubits.

We tile two types of stabilizer measurements in a checkerboard pattern to protect the logical data from bit- and phase-flips. If some of the stabilizer measurements register an error, then correlations in the stabilizer measurements are used to identify which error(s) occurred and where.

Surface-code QEC. Data qubits (yellow) are at the vertices of a checkerboard. Measure qubits at the center of each square are used for stabilizer measurements (blue squares). Dark blue squares check for bit-flip errors, while light blue squares check for phase-flip errors. Left: A phase-flip error. The two nearest light blue stabilizer measurements register the error (light red). Right: A bit-flip error. The two nearest dark blue stabilizer measurements register the error (dark red).

Just as Bob’s message to Alice in the example above became more robust against errors with increasing code size, a larger surface code better protects the logical information it contains. The surface code can withstand a number of bit- and phase-flip errors each equal to less than half the distance, where the distance is the number of data qubits that span the surface code in either dimension.

But here’s the problem: every individual physical qubit is prone to errors, so the more qubits in a code, the more opportunity for errors. We want the higher protection offered by QEC to outweigh the increased opportunities for errors as we increase the number of qubits. For this to happen, the physical qubits must have errors below the so-called “fault-tolerant threshold.” For the surface code, this threshold is quite low. So low that it hasn’t been experimentally feasible until recently. We are now on the precipice of reaching this coveted regime.

Making and controlling high-quality physical qubits

Entering the regime where QEC improves with scale required improving every aspect of our quantum computers, from nanofabrication of the physical qubits to the optimized control of the full quantum system. These experiments ran on a state-of-the-art 3rd generation Sycamore processor architecture optimized for QEC using the surface code with improvements across the board:

  • Increased qubit relaxation and dephasing lifetimes through an improved fabrication process and environmental noise reduction near the quantum processor.
  • Lowered cross-talk between all physical qubits during parallel operation by optimizing quantum processor circuit design and nanofabrication.
  • Reduced drift and improved qubit control fidelity through upgraded custom electronics.
  • Implemented faster and higher-fidelity readout and reset operations compared with previous generations of the Sycamore processor.
  • Reduced calibration errors by extensively modeling the full quantum system and employing better system-optimization algorithms.
  • Developed context-aware and fully parallel calibrations to minimize drift and optimize control parameters for QEC circuits.
  • Enhanced dynamical decoupling protocols to protect physical qubits from noise and cross-talk during idling operations.

Running surface code circuits

With these upgrades in place, we ran experiments to compare the ratio (𝚲3,5) between the logical error rate of a distance-3 surface code (ε3) with 17 qubits to that of a distance-5 surface code (ε5) with 49 qubits — 𝚲3,5 = ε3 / ε5.

Comparison of logical fidelity (defined as 1-ε) between distance-3 (d=3) and distance-5 (d=5) surface codes. The distance-5 code contains four possible distance-3 arrangements, with one example shown in the red outline (left). As improvements were made, the d=5 fidelity increased faster than that of the d=3, eventually overtaking the distance-3 code, as shown in the top-right data points (right), whose average lies slightly to the left of the ε3 = ε5 line.

The results of these experiments are shown above on the right. Continued improvements over several months allowed us to reduce the logical errors of both grids, leading to the distance-5 grid (ε5 = 2.914%) outperforming the distance-3 grids (ε3 = 3.028%) by 4% (𝚲3,5 = 1.04) with 5𝛔 confidence. While this might seem like a small improvement, it’s important to emphasize that the result represents a first for the field since Peter Shor’s 1995 QEC proposal. A larger code outperforming a smaller one is a key signature of QEC, and all quantum computing architectures will need to pass this hurdle to realize a path to the low errors that are necessary for quantum applications.

The path forward

These results indicate that we are entering a new era of practical QEC. The Google Quantum AI team has spent the last few years thinking about how we define success in this new era, and how we measure progress along the way.

The ultimate goal is to demonstrate a pathway to achieving the low errors needed for using quantum computers in meaningful applications. To this end, our target remains achieving logical error rates of 1 in 106 or lower per cycle of QEC. In the figure below on the left, we outline the path that we anticipate to reach this target. As we continue improving our physical qubits (and hence the performance of our logical qubits), we expect to gradually increase 𝚲 from close to 1 in this work to larger numbers. The figure below shows that a value of 𝚲 = 4 and a code distance of 17 (577 physical qubits with good enough quality) will yield a logical error rate below our target of 1 in 106.

While this result is still a few years out, we have an experimental technique to probe error rates this low with today’s hardware, albeit in limited circumstances. While two-dimensional surface codes allow us to correct both bit- and phase-flip errors, we can also construct one-dimensional repetition codes that are only able to solve one type of error with relaxed requirements. On the right below, we show that a distance-25 repetition code can reach error rates per cycle close to 1 in 106. At such low errors, we see new kinds of error mechanisms that are not yet observable with our surface codes. By controlling for these error mechanisms, we can improve repetition codes to error rates near 1 in 107.

Left: Expected progression as we improve performance (quantified by 𝚲) and scale (quantified by code distance) for surface codes. Right: Experimentally measured logical error rates per cycle versus the distance of one-dimensional repetition codes and two-dimensional surface codes.

Reaching this milestone reflects three years of focused work by the entire Google Quantum AI team following our demonstration of a quantum computer outperforming a classical computer. In our march toward building fault-tolerant quantum computers, we will continue to use the target error rates in the figure above to measure our progress. With further improvements toward our next milestone, we anticipate entering the fault-tolerant regime, where we can exponentially suppress logical errors and unlock the first useful error-corrected quantum applications. In the meantime, we continue to explore various ways of solving problems using quantum computers in topics ranging from condensed matter physics to chemistry, machine learning, and materials science.

Read More

Google Research, 2022 & beyond: Natural sciences

Google Research, 2022 & beyond: Natural sciences

(This is Part 7 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

It’s an incredibly exciting time to be a scientist. With the amazing advances in machine learning (ML) and quantum computing, we now have powerful new tools that enable us to act on our curiosity, collaborate in new ways, and radically accelerate progress toward breakthrough scientific discoveries.

Since joining Google Research eight years ago, I’ve had the privilege of being part of a community of talented researchers fascinated by applying cutting-edge computing to push the boundaries of what is possible in applied science. Our teams are exploring topics across the physical and natural sciences. So, for this year’s blog post I want to focus on high-impact advances we’ve made recently in the fields of biology and physics, from helping to organize the world’s protein and genomics information to benefit people’s lives to improving our understanding of the nature of the universe with quantum computers. We are inspired by the great potential of this work.

Using machine learning to unlock mysteries in biology

Many of our researchers are fascinated by the extraordinary complexity of biology, from the mysteries of the brain, to the potential of proteins, and to the genome, which encodes the very language of life. We’ve been working alongside scientists from other leading organizations around the world to tackle important challenges in the fields of connectomics, protein function prediction, and genomics, and to make our innovations accessible and useful to the greater scientific community.

Neurobiology

One exciting application of our Google-developed ML methods was to explore how information travels through the neuronal pathways in the brains of zebrafish, which provides insight into how the fish engage in social behavior like swarming. In collaboration with researchers from the Max Planck Institute for Biological Intelligence, we were able to computationally reconstruct a portion of zebrafish brains imaged with 3D electron microscopy — an exciting advance in the use of imaging and computational pipelines to map out the neuronal circuitry in small brains, and another step forward in our long-standing contributions to the field of connectomics.

Reconstruction of the neural circuitry of a larval zebrafish brain, courtesy of the Max Planck Institute for Biological Intelligence.

The technical advances necessary for this work will have applications even beyond neuroscience. For example, to address the difficulty of working with such large connectomics datasets, we developed and released TensorStore, an open-source C++ and Python software library designed for storage and manipulation of n-dimensional data. We look forward to seeing the ways it is used in other fields for the storage of large datasets.

We’re also using ML to shed light on how human brains perform remarkable feats like language by comparing human language processing and autoregressive deep language models (DLMs). For this study, a collaboration with colleagues at Princeton University and New York University Grossman School of Medicine, participants listened to a 30-minute podcast while their brain activity was recorded using electrocorticography. The recordings suggested that the human brain and DLMs share computational principles for processing language, including continuous next-word prediction, reliance on contextual embeddings, and calculation of post-onset surprise based on word match (we can measure how surprised the human brain is by the word, and correlate that surprise signal with how well the word is predicted by the DLM). These results provide new insights into language processing in the human brain, and suggest that DLMs can be used to reveal valuable insights about the neural basis of language.

Biochemistry

ML has also allowed us to make significant advances in understanding biological sequences. In 2022, we leveraged recent advances in deep learning to accurately predict protein function from raw amino acid sequences. We also worked in close collaboration with the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) to carefully assess model performance and add hundreds of millions of functional annotations to the public protein databases UniProt, Pfam/InterPro, and MGnify. Human annotation of protein databases can be a laborious and slow process and our ML methods enabled a giant leap forward — for example, increasing the number of Pfam annotations by a larger number than all other efforts during the past decade combined. The millions of scientists worldwide who access these databases each year can now use our annotations for their research.

Google Research contributions to Pfam exceed in size all expansion efforts made to the database over the last decade.

Although the first draft of the human genome was released in 2003, it was incomplete and had many gaps due to technical limitations in the sequencing technologies. In 2022 we celebrated the remarkable achievements of the Telomere-2-Telomere (T2T) Consortium in resolving these previously unavailable regions — including five full chromosome arms and nearly 200 million base pairs of novel DNA sequences — which are interesting and important for questions of human biology, evolution, and disease. Our open source genomics variant caller, DeepVariant, was one of the tools used by the T2T Consortium to prepare their release of a complete 3.055 billion base pair sequence of a human genome. The T2T Consortium is also using our newer open source method DeepConsensus, which provides on-device error correction for Pacific Biosciences long-read sequencing instruments, in their latest research toward comprehensive pan-genome resources that can represent the breadth of human genetic diversity.

Using quantum computing for new physics discoveries

When it comes to making scientific discoveries, quantum computing is still in its infancy, but has a lot of potential. We’re exploring ways of advancing the capabilities of quantum computing so that it can become a tool for scientific discovery and breakthroughs. In collaboration with physicists from around the world, we are also starting to use our existing quantum computers to create interesting new experiments in physics.

As an example of such experiments, consider the problem where a sensor measures something, and a computer then processes the data from the sensor. Traditionally, this means the sensor’s data is processed as classical information on our computers. Instead, one idea in quantum computing is to directly process quantum data from sensors. Feeding data from quantum sensors directly to quantum algorithms without going through classical measurements may provide a large advantage. In a recent Science paper written in collaboration with researchers from multiple universities, we show that quantum computing can extract information from exponentially fewer experiments than classical computing, as long as the quantum computer is coupled directly to the quantum sensors and is running a learning algorithm. This “quantum machine learning” can yield an exponential advantage in dataset size, even with today’s noisy intermediate-scale quantum computers. Because experimental data is often the limiting factor in scientific discovery, quantum ML has the potential to unlock the vast power of quantum computers for scientists. Even better, the insights from this work are also applicable to learning on the output of quantum computations, such as the output of quantum simulations that may otherwise be difficult to extract.

Even without quantum ML, a powerful application of quantum computers is to experimentally explore quantum systems that would be otherwise impossible to observe or simulate. In 2022, the Quantum AI team used this approach to observe the first experimental evidence of multiple microwave photons in a bound state using superconducting qubits. Photons typically do not interact with one another, and require an additional element of non-linearity to cause them to interact. The results of our quantum computer simulations of these interactions surprised us — we thought the existence of these bound states relied on fragile conditions, but instead we found that they were robust even to relatively strong perturbations that we applied.

Occupation probability versus discrete time step for n-photon bound states. We observe that the majority of the photons (darker colors) remain bound together.

Given the initial successes we have had in applying quantum computing to make physics breakthroughs, we are hopeful about the possibility of this technology to enable future groundbreaking discoveries that could have as significant a societal impact as the creation of transistors or GPS. The future of quantum computing as a scientific tool is exciting!

Acknowledgements

I would like to thank everyone who worked hard on the advances described in this post, including the Google Applied Sciences, Quantum AI, Genomics and Brain teams and their collaborators across Google Research and externally. Finally, I would like to thank the many Googlers who provided feedback in the writing of this post, including Lizzie Dorfman, Erica Brand, Elise Kleeman, Abe Asfaw, Viren Jain, Lucy Colwell, Andrew Carroll, Ariel Goldstein and Charina Chou.

Top

Google Research, 2022 & beyond

This was the seventh blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Read More

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever variety has more data available online, which disproportionately affects speakers of under-resourced language varieties.

In “FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation”, accepted for publication in Transactions of the Association for Computational Linguistics, we present an evaluation dataset used to measure MT systems’ ability to support regional varieties through a case study on Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin Chinese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide.

Challenge: Few-Shot Generalization

Most modern MT systems are trained on millions or billions of example translations, such as an English input sentence and its corresponding Portuguese translation. However, the vast majority of available training data doesn’t specify what regional variety the translation is in. In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. MT models need to use the linguistic patterns showcased in the small number of labeled examples (called “exemplars”) to identify similar patterns in their unlabeled training examples. In this way, models can generalize, producing correct translations of phenomena not explicitly shown in the exemplars.

An illustration of a few-shot MT system translating the English sentence, “The bus arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; right).

Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system. While our work is specific to regional varieties of two languages, we anticipate that methods that perform well will be readily applicable to other languages and regional varieties. In principle, those methods should also work for other language distinctions, such as formality and style.

Data Collection

The FRMT dataset consists of partial English Wikipedia articles, sourced from the Wiki40b dataset, that have been translated by paid, professional translators into different regional varieties of Portuguese and Mandarin. In order to highlight key region-aware translation challenges, we designed the dataset using three content buckets: (1) Lexical, (2) Entity, and (3) Random.

  1. The Lexical bucket focuses on regional differences in word choice, such as the “ônibus” vs. “autocarro” distinction when translating a sentence with the word “bus” into Brazilian vs. European Portuguese, respectively. We manually collected 20-30 terms that have regionally distinctive translations according to blogs and educational websites, and filtered and vetted the translations with feedback from volunteer native speakers from each region. Given the resulting list of English terms, we extracted texts of up to 100 sentences each from the associated English Wikipedia articles (e.g., bus). The same process was carried out independently for Mandarin.
  2. The Entity bucket is populated in a similar way and concerns people, locations or other entities strongly associated with one of the two regions in question for a given language. Consider an illustrative sentence like, “In Lisbon, I often took the bus.” In order to translate this correctly into Brazilian Portuguese, a model must overcome two potential pitfalls:
    1. The strong geographical association between Lisbon and Portugal might influence a model to generate a European Portuguese translation instead, e.g., by selecting “autocarro” rather than “ônibus“.
    2. Replacing “Lisbon” with “Brasília” might be a naive way for a model to localize its output toward Brazilian Portuguese, but would be semantically inaccurate, even in an otherwise fluent translation.
  3. The Random bucket is used to check that a model correctly handles other diverse phenomena, and consists of text from 100 randomly sampled articles from Wikipedia’s “featured” and “good” collections.

Evaluation Methodology

To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human evaluation of their quality. Expert annotators from each region used the Multi-dimensional Quality Metrics (MQM) framework to identify and categorize errors in the translations. The framework includes a category-wise weighting scheme to convert the identified errors into a single score that roughly represents the number of major errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM raters to score both translations from their region and translations from their language’s other region. For example, Brazilian Portuguese raters scored both the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not the other. We found that in both Portuguese and Chinese, raters identified, on average, approximately two more major errors per sentence in the mismatched translations than in the matched ones. This indicates that our dataset truly does capture region-specific phenomena.

While human evaluation is the best way to be sure of model quality, it is often slow and expensive. We therefore wanted to find an existing automatic metric that researchers can use to evaluate their models on our benchmark, and considered chrF, BLEU, and BLEURT. Using the translations from a few baseline models that were also evaluated by our MQM raters, we discovered that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson correlation coefficient, ρ) is comparable to the inter-annotator consistency (0.70 intraclass correlation).

Metric       Pearson’s ρ
chrF       0.48
BLEU       0.58
BLEURT       0.65

Correlation between different automatic metrics and human judgements of translation quality on a subset of FRMT. Values are between -1 and 1; higher is better.

System Performance

Our evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese, but for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.

Google’s recent language model, PaLM, was rated best overall among the baselines we evaluated. In order to produce region-targeted translations with PaLM, we feed an instructive prompt into the model and then generate text from it to fill in the blank (see the example shown below).

    Translate the following texts from English to European Portuguese.
English: [English example 1].
European Portuguese: [correct translation 1].
...
English: [input].
European Portuguese: _____"

PaLM obtained strong results using a single example, and had marginal quality gains on Portuguese when increasing to ten examples. This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way. Our results also suggest language models like PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. See our paper for more details.

MQM performance across dataset buckets using human and PaLM translations. Thick bars represent the region-matched case, where raters from each region evaluate translations targeted at their own region. Thin, inset bars represent the region-mismatched case, where raters from each region evaluate translations targeted at the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so for all Portuguese buckets and the Mandarin lexical bucket only.

Conclusion

In the near future, we hope to see a world where language generation systems, especially machine translation, can support all speaker communities. We want to meet users where they are, generating language fluent and appropriate for their locale or region. To that end, we have released the FRMT dataset and benchmark, enabling researchers to easily compare performance for region-aware MT models. Validated via our thorough human-evaluation studies, the language varieties in FRMT have significant differences that outputs from region-aware MT models should reflect. We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.

Acknowledgements

We gratefully acknowledge our paper co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. For helpful discussion and comments on the paper, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For essential feedback around specific regional language differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue. For logistical support in collecting human translations and ratings, we thank the Google Translate team. We thank the professional translators and MQM raters for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.

Read More

FriendlyCore: A novel differentially private aggregation framework

FriendlyCore: A novel differentially private aggregation framework

Differential privacy (DP) machine learning algorithms protect user data by limiting the effect of each data point on an aggregated output with a mathematical guarantee. Intuitively the guarantee implies that changing a single user’s contribution should not significantly change the output distribution of the DP algorithm.

However, DP algorithms tend to be less accurate than their non-private counterparts because satisfying DP is a worst-case requirement: one has to add noise to “hide” changes in any potential input point, including “unlikely points’’ that have a significant impact on the aggregation. For example, suppose we want to privately estimate the average of a dataset, and we know that a sphere of diameter, Λ, contains all possible data points. The sensitivity of the average to a single point is bounded by Λ, and therefore it suffices to add noise proportional to Λ to each coordinate of the average to ensure DP.

A sphere of diameter Λ containing all possible data points.

Now assume that all the data points are “friendly,” meaning they are close together, and each affects the average by at most 𝑟, which is much smaller than Λ. Still, the traditional way for ensuring DP requires adding noise proportional to Λ to account for a neighboring dataset that contains one additional “unfriendly” point that is unlikely to be sampled.

Two adjacent datasets that differ in a single outlier. A DP algorithm would have to add noise proportional to Λ to each coordinate to hide this outlier.

In “FriendlyCore: Practical Differentially Private Aggregation”, presented at ICML 2022, we introduce a general framework for computing differentially private aggregations. The FriendlyCore framework pre-processes data, extracting a “friendly” subset (the core) and consequently reducing the private aggregation error seen with traditional DP algorithms. The private aggregation step adds less noise since we do not need to account for unfriendly points that negatively impact the aggregation.

In the averaging example, we first apply FriendlyCore to remove outliers, and in the aggregation step, we add noise proportional to 𝑟 (not Λ). The challenge is to make our overall algorithm (outlier removal + aggregation) differentially private. This constrains our outlier removal scheme and stabilizes the algorithm so that two adjacent inputs that differ by a single point (outlier or not) should produce any (friendly) output with similar probabilities.

FriendlyCore Framework

We begin by formalizing when a dataset is considered friendly, which depends on the type of aggregation needed and should capture datasets for which the sensitivity of the aggregate is small. For example, if the aggregate is averaging, the term friendly should capture datasets with a small diameter.

To abstract away the particular application, we define friendliness using a predicate 𝑓 that is positive on points 𝑥 and 𝑦 if they are “close” to each other. For example,in the averaging application 𝑥 and 𝑦 are close if the distance between them is less than 𝑟. We say that a dataset is friendly (for this predicate) if every pair of points 𝑥 and 𝑦 are both close to a third point 𝑧 (not necessarily in the data).

Once we have fixed 𝑓 and defined when a dataset is friendly, two tasks remain. First, we construct the FriendlyCore algorithm that extracts a large friendly subset (the core) of the input stably. FriendlyCore is a filter satisfying two requirements: (1) It has to remove outliers to keep only elements that are close to many others in the core, and (2) for neighboring datasets that differ by a single element, 𝑦, the filter outputs each element except 𝑦 with almost the same probability. Furthermore, the union of the cores extracted from these neighboring datasets is friendly.

The idea underlying FriendlyCore is simple: The probability that we add a point, 𝑥, to the core is a monotonic and stable function of the number of elements close to 𝑥. In particular, if 𝑥 is close to all other points, it’s not considered an outlier and can be kept in the core with probability 1.

Second, we develop the Friendly DP algorithm that satisfies a weaker notion of privacy by adding less noise to the aggregate. This means that the outcomes of the aggregation are guaranteed to be similar only for neighboring datasets 𝐶 and 𝐶’ such that the union of 𝐶 and 𝐶’ is friendly.

Our main theorem states that if we apply a friendly DP aggregation algorithm to the core produced by a filter with the requirements listed above, then this composition is differentially private in the regular sense.

Clustering and other applications

Other applications of our aggregation method are clustering and learning the covariance matrix of a Gaussian distribution. Consider the use of FriendlyCore to develop a differentially private k-means clustering algorithm. Given a database of points, we partition it into random equal-size smaller subsets and run a good non-private k-means clustering algorithm on each small set. If the original dataset contains k large clusters then each smaller subset will contain a significant fraction of each of these k clusters. It follows that the tuples (ordered sets) of k-centers we get from the non-private algorithm for each small subset are similar. This dataset of tuples is expected to have a large friendly core (for an appropriate definition of closeness).

We use our framework to aggregate the resulting tuples of k-centers (k-tuples). We define two such k-tuples to be close if there is a matching between them such that a center is substantially closer to its mate than to any other center.

In this picture, any pair of the red, blue, and green tuples are close to each other, but none of them is close to the pink tuple. So the pink tuple is removed by our filter and is not in the core.

We then extract the core by our generic sampling scheme and aggregate it using the following steps:

  1. Pick a random k-tuple 𝑇 from the core.
  2. Partition the data by putting each point in a bucket according to its closest center in 𝑇.
  3. Privately average the points in each bucket to get our final k-centers.

Empirical results

Below are the empirical results of our algorithms based on FriendlyCore. We implemented them in the zero-Concentrated Differential Privacy (zCDP) model, which gives improved accuracy in our setting (with similar privacy guarantees as the more well-known (𝜖, 𝛿)-DP).

Averaging

We tested the mean estimation of 800 samples from a spherical Gaussian with an unknown mean. We compared it to the algorithm CoinPress. In contrast to FriendlyCore, CoinPress requires an upper bound 𝑅 on the norm of the mean. The figures below show the effect on accuracy when increasing 𝑅 or the dimension 𝑑. Our averaging algorithm performs better on large values of these parameters since it is independent of 𝑅 and 𝑑.

Left: Averaging in 𝑑= 1000, varying 𝑅. Right: Averaging with 𝑅= √𝑑, varying 𝑑.

Clustering

We tested the performance of our private clustering algorithm for k-means. We compared it to the Chung and Kamath algorithm that is based on recursive locality-sensitive hashing (LSH-clustering). For each experiment, we performed 30 repetitions and present the medians along with the 0.1 and 0.9 quantiles. In each repetition, we normalize the losses by the loss of k-means++ (where a smaller number is better).

The left figure below compares the k-means results on a uniform mixture of eight separated Gaussians in two dimensions. For small values of 𝑛 (the number of samples from the mixture), FriendlyCore often fails and yields inaccurate results. Yet, increasing 𝑛 increases the success probability of our algorithm (because the generated tuples become closer to each other) and yields very accurate results, while LSH-clustering lags behind.

Left: k-means results in 𝑑= 2 and k= 8, for varying 𝑛(number of samples). Right: A graphical illustration of the centers in one of the iterations for 𝑛= 2 X 105. Green points are the centers of our algorithm and the red points are the centers of LSH-clustering.

FriendlyCore also performs well on large datasets, even without clear separation into clusters. We used the Fonollosa and Huerta gas sensors dataset that contains 8M rows, consisting of a 16-dimensional point defined by 16 sensors’ measurements at a given point in time. We compared the clustering algorithms for varying k. FriendlyCore performs well except for k= 5 where it fails due to the instability of the non-private algorithm used by our method (there are two different solutions for k= 5 with similar cost that makes our approach fail since we do not get one set of tuples that are close to each other).

k-means results on gas sensors’ measurements over time, varying k.

Conclusion

FriendlyCore is a general framework for filtering metric data before privately aggregating it. The filtered data is stable and makes the aggregation less sensitive, enabling us to increase its accuracy with DP. Our algorithms outperform private algorithms tailored for averaging and clustering, and we believe this technique can be useful for additional aggregation tasks. Initial results show that it can effectively reduce utility loss when we deploy DP aggregations. To learn more, and see how we apply it for estimating the covariance matrix of a Gaussian distribution, see our paper.

Acknowledgements

This work was led by Eliad Tsfadia in collaboration with Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Avinatan Hassidim and Yossi Matias.

Read More

Google Research, 2022 & beyond: Robotics

Google Research, 2022 & beyond: Robotics

(This is Part 6 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people.

In 2022, we focused on challenges that come with enabling robots to be more helpful to people: 1) allowing robots and humans to communicate more efficiently and naturally; 2) enabling robots to understand and apply common sense knowledge in real-world situations; and 3) scaling the number of low-level skills robots need to effectively perform tasks in unstructured environments.

An undercurrent this past year has been the exploration of how large, generalist models, like PaLM, can work alongside other approaches to surface capabilities allowing robots to learn from a breadth of human knowledge and allowing people to engage with robots more naturally. As we do this, we’re transforming robot learning into a scalable data problem so that we can scale learning of generalized low-level skills, like manipulation. In this blog post, we’ll review key learnings and themes from our explorations in 2022.

Bringing the capabilities of LLMs to robotics

An incredible feature of large language models (LLMs) is their ability to encode descriptions and context into a format that’s understandable by both people and machines. When applied to robotics, LLMs let people task robots more easily — just by asking — with natural language. When combined with vision models and robotics learning approaches, LLMs give robots a way to understand the context of a person’s request and make decisions about what actions should be taken to complete it.

One of the underlying concepts is using LLMs to prompt other pretrained models for information that can build context about what is happening in a scene and make predictions about multimodal tasks. This is similar to the socratic method in teaching, where a teacher asks students questions to lead them through a rational thought process. In “Socratic Models”, we showed that this approach can achieve state-of-the-art performance in zero-shot image captioning and video-to-text retrieval tasks. It also enables new capabilities, like answering free-form questions about and predicting future activity from video, multimodal assistive dialogue, and as we’ll discuss next, robot perception and planning.

In “Towards Helpful Robots: Grounding Language in Robotic Affordances”, we partnered with Everyday Robots to ground the PaLM language model in a robotics affordance model to plan long horizon tasks. In previous machine-learned approaches, robots were limited to short, hard-coded commands, like “Pick up the sponge,” because they struggled with reasoning about the steps needed to complete a task — which is even harder when the task is given as an abstract goal like, “Can you help clean up this spill?”

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

For this approach to work, one needs to have both an LLM that can predict the sequence of steps to complete long horizon tasks and an affordance model representing the skills a robot can actually do in a given situation. In “Extracting Skill-Centric State Abstractions from Value Functions”, we showed that the value function in reinforcement learning (RL) models can be used to build the affordance model — an abstract representation of the actions a robot can perform under different states. This lets us connect long-horizons of real-world tasks, like “tidy the living room”, to the short-horizon skills needed to complete the task, like correctly picking, placing, and arranging items.

Having both an LLM and an affordance model doesn’t mean that the robot will actually be able to complete the task successfully. However, with Inner Monologue, we closed the loop on LLM-based task planning with other sources of information, like human feedback or scene understanding, to detect when the robot fails to complete the task correctly. Using a robot from Everyday Robots, we show that LLMs can effectively replan if the current or previous plan steps failed, allowing the robot to recover from failures and complete complex tasks like “Put a coke in the top drawer,” as shown in the video below.

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

An emergent capability from closing the loop on LLM-based task planning that we saw with Inner Monologue is that the robot can react to changes in the high-level goal mid-task. For example, a person might tell the robot to change its behavior as it is happening, by offering quick corrections or redirecting the robot to another task. This behavior is especially useful to let people interactively control and customize robot tasks when robots are working near people.

While natural language makes it easier for people to specify and modify robot tasks, one of the challenges is being able to react in real time to the full vocabulary people can use to describe tasks that a robot is capable of doing. In “Talking to Robots in Real Time”, we demonstrated a large-scale imitation learning framework for producing real-time, open-vocabulary, language-conditionable robots. With one policy we were able to address over 87,000 unique instructions, with an estimated average success rate of 93.5%. As part of this project, we released Language-Table, the largest available language-annotated robot dataset, which we hope will drive further research focused on real-time language-controllable robots.

Examples of long horizon goals reached under real time human language guidance.

We’re also excited about the potential for LLMs to write code that can control robot actions. Code-writing approaches, like in “Robots That Write Their Own Code”, show promise in increasing the complexity of tasks robots can complete by autonomously generating new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

Turning robot learning into a scalable data problem

Large language and multimodal models help robots understand the context in which they’re operating, like what’s happening in a scene and what the robot is expected to do. But robots also need low-level physical skills to complete tasks in the physical world, like picking up and precisely placing objects.

While we often take these physical skills for granted, executing them hundreds of times every day without even thinking, they present significant challenges to robots. For example, to pick up an object, the robot needs to perceive and understand the environment, reason about the spatial relation and contact dynamics between its gripper and the object, actuate the high degrees-of-freedom arm precisely, and exert the right amount of force to stably grasp the object without breaking it. The difficulty of learning these low-level skills is known as Moravec’s paradox: reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources.

Inspired by the recent success of LLMs, which shows that the generalization and performance of large Transformer-based models scale with the amount of data, we are taking a data-driven approach, turning the problem of learning low-level physical skills into a scalable data problem. With Robotics Transformer-1 (RT-1), we trained a robot manipulation policy on a large-scale, real-world robotics dataset of 130k episodes that cover 700+ tasks using a fleet of 13 robots from Everyday Robots and showed the same trend for robotics — increasing the scale and diversity of data improves the model ability to generalize to new tasks, environments, and objects.

Example PaLM-SayCan-RT1 executions of long-horizon tasks in real kitchens.

Behind both language models and many of our robotics learning approaches, like RT-1, are Transformers, which allow models to make sense of Internet-scale data. Unlike LLMs, robotics is challenged by multimodal representations of constantly changing environments and limited compute. In 2020, we introduced Performers as an approach to make Transformers more computationally efficient, which has implications for many applications beyond robotics. In Performer-MPC, we applied this to introduce a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints from Model Predictive Control (MPC). We show a >40% improvement on the robot reaching its goal and a >65% improvement on social metrics when navigating around humans in comparison to a standard MPC policy. Performer-MPC provides 8 ms latency for the 8.3M parameter model, making on-robot deployment of Transformers practical.

Navigation robot maneuvering through highly constrained spaces using: Regular MPC, Explicit Policy, and Performer-MPC.

In the last year, our team has shown that data-driven approaches are generally applicable on different robotic platforms in diverse environments to learn a wide range of tasks, including mobile manipulation, navigation, locomotion and table tennis. This shows us a clear path forward for learning low-level robot skills: scalable data collection. Unlike video and text data that is abundant on the Internet, robotic data is extremely scarce and hard to acquire. Finding approaches to collect and efficiently use rich datasets representative of real-world interactions is the key for our data-driven approaches.

Simulation is a fast, safe, and easily parallelizable option, but it is difficult to replicate the full environment, especially physics and human-robot interactions, in simulation. In i-Sim2Real, we showed an approach to address the sim-to-real gap and learn to play table tennis with a human opponent by bootstrapping from a simple model of human behavior and alternating between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

Learning to play table tennis with a human opponent.

While simulation helps, collecting data in the real world is essential for fine-tuning simulation policies or adapting existing policies in new environments. While learning, robots are prone to failure, which can cause damage to itself and surroundings — especially in the early stages of learning where they are exploring how to interact with the world. We need to collect training data safely, even while the robot is learning, and enable the robot to autonomously recover from failure. In “Learning Locomotion Skills Safely in the Real World”, we introduced a safe RL framework that switches between a “learner policy” optimized to perform the desired task and a “safe recovery policy” that prevents the robot from unsafe states. In “Legged Robots that Keep on Learning”, we trained a reset policy so the robot can recover from failures, like learning to stand up by itself after falling.

Automatic reset policies enable the robot to continue learning in a lifelong fashion without human supervision.

While robot data is scarce, videos of people performing different tasks are abundant. Of course, robots aren’t built like people — so the idea of robotic learning from people raises the problem of transferring learning across different embodiments. In “Robot See, Robot Do”, we developed Cross-Embodiment Inverse Reinforcement Learning to learn new tasks by watching people. Instead of trying to replicate the task exactly as a person would, we learn the high-level task objective, and summarize that knowledge in the form of a reward function. This type of demonstration learning could allow robots to learn skills by watching videos readily available on the internet.

We’re also progressing towards making our learning algorithms more data efficient so that we’re not relying only on scaling data collection. We improved the efficiency of RL approaches by incorporating prior information, including predictive information, adversarial motion priors, and guide policies. Further improvements are gained by utilizing a novel structured dynamical systems architecture and combining RL with trajectory optimization, supported by novel solvers. These types of prior information helped alleviate the exploration challenges, served as good regularizers, and significantly reduced the amount of data required. Furthermore, our team has invested heavily in more data-efficient imitation learning. We showed that a simple imitation learning approach, BC-Z, can enable zero-shot generalization to new tasks that were not seen during training. We also introduced an iterative imitation learning algorithm, GoalsEye, which combined Learning from Play and Goal-Conditioned Behavior Cloning for high-speed and high-precision table tennis games. On the theoretical front, we investigated dynamical-systems stability for characterizing the sample complexity of imitation learning, and the role of capturing failure-and-recovery within demonstration data to better condition offline learning from smaller datasets.

Closing

Advances in large models across the field of AI have spurred a leap in capabilities for robot learning. This past year, we’ve seen the sense of context and sequencing of events captured in LLMs help solve long-horizon planning for robotics and make robots easier for people to interact with and task. We’ve also seen a scalable path to learning robust and generalizable robot behaviors by applying a transformer model architecture to robot learning. We continue to open source data sets, like “Scanned Objects: A Dataset of 3D-Scanned Common Household Items”, and models, like RT-1, in the spirit of participating in the broader research community. We’re excited about building on these research themes in the coming year to enable helpful robots.

Acknowledgements

We would like to thank everyone who supported our research. This includes the entire Robotics at Google team, and collaborators from Everyday Robots and Google Research. We also want to thank our external collaborators, including UC Berkeley, Stanford, Gatech, University of Washington, MIT, CMU and U Penn.

Top

Google Research, 2022 & beyond

This was the sixth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances Robotics
Health* General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More