FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation

Many languages spoken worldwide cover numerous regional varieties (sometimes called dialects), such as Brazilian and European Portuguese or Mainland and Taiwan Mandarin Chinese. Although such varieties are often mutually intelligible to their speakers, there are still important differences. For example, the Brazilian Portuguese word for “bus” is ônibus, while the European Portuguese word is autocarro. Yet, today’s machine translation (MT) systems typically do not allow users to specify which variety of a language to translate into. This may lead to confusion if the system outputs the “wrong” variety or mixes varieties in an unnatural way. Also, region-unaware MT systems tend to favor whichever variety has more data available online, which disproportionately affects speakers of under-resourced language varieties.

In “FRMT: A Benchmark for Few-Shot Region-Aware Machine Translation”, accepted for publication in Transactions of the Association for Computational Linguistics, we present an evaluation dataset used to measure MT systems’ ability to support regional varieties through a case study on Brazilian vs. European Portuguese and Mainland vs. Taiwan Mandarin Chinese. With the release of the FRMT data and accompanying evaluation code, we hope to inspire and enable the research community to discover new ways of creating MT systems that are applicable to the large number of regional language varieties spoken worldwide.

Challenge: Few-Shot Generalization

Most modern MT systems are trained on millions or billions of example translations, such as an English input sentence and its corresponding Portuguese translation. However, the vast majority of available training data doesn’t specify what regional variety the translation is in. In light of this data scarcity, we position FRMT as a benchmark for few-shot translation, measuring an MT model’s ability to translate into regional varieties when given no more than 100 labeled examples of each language variety. MT models need to use the linguistic patterns showcased in the small number of labeled examples (called “exemplars”) to identify similar patterns in their unlabeled training examples. In this way, models can generalize, producing correct translations of phenomena not explicitly shown in the exemplars.

An illustration of a few-shot MT system translating the English sentence, “The bus arrived,” into two regional varieties of Portuguese: Brazilian (🇧🇷; left) and European (🇵🇹; right).

Few-shot approaches to MT are attractive because they make it much easier to add support for additional regional varieties to an existing system. While our work is specific to regional varieties of two languages, we anticipate that methods that perform well will be readily applicable to other languages and regional varieties. In principle, those methods should also work for other language distinctions, such as formality and style.

Data Collection

The FRMT dataset consists of partial English Wikipedia articles, sourced from the Wiki40b dataset, that have been translated by paid, professional translators into different regional varieties of Portuguese and Mandarin. In order to highlight key region-aware translation challenges, we designed the dataset using three content buckets: (1) Lexical, (2) Entity, and (3) Random.

  1. The Lexical bucket focuses on regional differences in word choice, such as the “ônibus” vs. “autocarro” distinction when translating a sentence with the word “bus” into Brazilian vs. European Portuguese, respectively. We manually collected 20-30 terms that have regionally distinctive translations according to blogs and educational websites, and filtered and vetted the translations with feedback from volunteer native speakers from each region. Given the resulting list of English terms, we extracted texts of up to 100 sentences each from the associated English Wikipedia articles (e.g., bus). The same process was carried out independently for Mandarin.
  2. The Entity bucket is populated in a similar way and concerns people, locations or other entities strongly associated with one of the two regions in question for a given language. Consider an illustrative sentence like, “In Lisbon, I often took the bus.” In order to translate this correctly into Brazilian Portuguese, a model must overcome two potential pitfalls:
    1. The strong geographical association between Lisbon and Portugal might influence a model to generate a European Portuguese translation instead, e.g., by selecting “autocarro” rather than “ônibus“.
    2. Replacing “Lisbon” with “Brasília” might be a naive way for a model to localize its output toward Brazilian Portuguese, but would be semantically inaccurate, even in an otherwise fluent translation.
  3. The Random bucket is used to check that a model correctly handles other diverse phenomena, and consists of text from 100 randomly sampled articles from Wikipedia’s “featured” and “good” collections.

Evaluation Methodology

To verify that the translations collected for the FRMT dataset capture region-specific phenomena, we conducted a human evaluation of their quality. Expert annotators from each region used the Multi-dimensional Quality Metrics (MQM) framework to identify and categorize errors in the translations. The framework includes a category-wise weighting scheme to convert the identified errors into a single score that roughly represents the number of major errors per sentence; so a lower number indicates a better translation. For each region, we asked MQM raters to score both translations from their region and translations from their language’s other region. For example, Brazilian Portuguese raters scored both the Brazilian and European Portuguese translations. The difference between these two scores indicates the prevalence of linguistic phenomena that are acceptable in one variety but not the other. We found that in both Portuguese and Chinese, raters identified, on average, approximately two more major errors per sentence in the mismatched translations than in the matched ones. This indicates that our dataset truly does capture region-specific phenomena.

While human evaluation is the best way to be sure of model quality, it is often slow and expensive. We therefore wanted to find an existing automatic metric that researchers can use to evaluate their models on our benchmark, and considered chrF, BLEU, and BLEURT. Using the translations from a few baseline models that were also evaluated by our MQM raters, we discovered that BLEURT has the best correlation with human judgments, and that the strength of that correlation (0.65 Pearson correlation coefficient, ρ) is comparable to the inter-annotator consistency (0.70 intraclass correlation).

Metric       Pearson’s ρ
chrF       0.48
BLEU       0.58
BLEURT       0.65

Correlation between different automatic metrics and human judgements of translation quality on a subset of FRMT. Values are between -1 and 1; higher is better.

System Performance

Our evaluation covered a handful of recent models capable of few-shot control. Based on human evaluation with MQM, the baseline methods all showed some ability to localize their output for Portuguese, but for Mandarin, they mostly failed to use knowledge of the targeted region to produce superior Mainland or Taiwan translations.

Google’s recent language model, PaLM, was rated best overall among the baselines we evaluated. In order to produce region-targeted translations with PaLM, we feed an instructive prompt into the model and then generate text from it to fill in the blank (see the example shown below).

    Translate the following texts from English to European Portuguese.
English: [English example 1].
European Portuguese: [correct translation 1].
...
English: [input].
European Portuguese: _____"

PaLM obtained strong results using a single example, and had marginal quality gains on Portuguese when increasing to ten examples. This performance is impressive when taking into consideration that PaLM was trained in an unsupervised way. Our results also suggest language models like PaLM may be particularly adept at memorizing region-specific word choices required for fluent translation. However, there is still a significant performance gap between PaLM and human performance. See our paper for more details.

MQM performance across dataset buckets using human and PaLM translations. Thick bars represent the region-matched case, where raters from each region evaluate translations targeted at their own region. Thin, inset bars represent the region-mismatched case, where raters from each region evaluate translations targeted at the other region. Human translations exhibit regional phenomena in all cases. PaLM translations do so for all Portuguese buckets and the Mandarin lexical bucket only.

Conclusion

In the near future, we hope to see a world where language generation systems, especially machine translation, can support all speaker communities. We want to meet users where they are, generating language fluent and appropriate for their locale or region. To that end, we have released the FRMT dataset and benchmark, enabling researchers to easily compare performance for region-aware MT models. Validated via our thorough human-evaluation studies, the language varieties in FRMT have significant differences that outputs from region-aware MT models should reflect. We are excited to see how researchers utilize this benchmark in development of new MT models that better support under-represented language varieties and all speaker communities, leading to improved equitability in natural-language technologies.

Acknowledgements

We gratefully acknowledge our paper co-authors for all their contributions to this project: Timothy Dozat, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and Noah Constant. For helpful discussion and comments on the paper, we thank Jacob Eisenstein, Noah Fiedel, Macduff Hughes and Mingfei Lau. For essential feedback around specific regional language differences, we thank Andre Araujo, Chung-Ching Chang, Andreia Cunha, Filipe Gonçalves, Nuno Guerreiro, Mandy Guo, Luis Miranda, Vitor Rodrigues and Linting Xue. For logistical support in collecting human translations and ratings, we thank the Google Translate team. We thank the professional translators and MQM raters for their role in producing the dataset. We also thank Tom Small for providing the animation in this post.

Read More

FriendlyCore: A novel differentially private aggregation framework

FriendlyCore: A novel differentially private aggregation framework

Differential privacy (DP) machine learning algorithms protect user data by limiting the effect of each data point on an aggregated output with a mathematical guarantee. Intuitively the guarantee implies that changing a single user’s contribution should not significantly change the output distribution of the DP algorithm.

However, DP algorithms tend to be less accurate than their non-private counterparts because satisfying DP is a worst-case requirement: one has to add noise to “hide” changes in any potential input point, including “unlikely points’’ that have a significant impact on the aggregation. For example, suppose we want to privately estimate the average of a dataset, and we know that a sphere of diameter, Λ, contains all possible data points. The sensitivity of the average to a single point is bounded by Λ, and therefore it suffices to add noise proportional to Λ to each coordinate of the average to ensure DP.

A sphere of diameter Λ containing all possible data points.

Now assume that all the data points are “friendly,” meaning they are close together, and each affects the average by at most 𝑟, which is much smaller than Λ. Still, the traditional way for ensuring DP requires adding noise proportional to Λ to account for a neighboring dataset that contains one additional “unfriendly” point that is unlikely to be sampled.

Two adjacent datasets that differ in a single outlier. A DP algorithm would have to add noise proportional to Λ to each coordinate to hide this outlier.

In “FriendlyCore: Practical Differentially Private Aggregation”, presented at ICML 2022, we introduce a general framework for computing differentially private aggregations. The FriendlyCore framework pre-processes data, extracting a “friendly” subset (the core) and consequently reducing the private aggregation error seen with traditional DP algorithms. The private aggregation step adds less noise since we do not need to account for unfriendly points that negatively impact the aggregation.

In the averaging example, we first apply FriendlyCore to remove outliers, and in the aggregation step, we add noise proportional to 𝑟 (not Λ). The challenge is to make our overall algorithm (outlier removal + aggregation) differentially private. This constrains our outlier removal scheme and stabilizes the algorithm so that two adjacent inputs that differ by a single point (outlier or not) should produce any (friendly) output with similar probabilities.

FriendlyCore Framework

We begin by formalizing when a dataset is considered friendly, which depends on the type of aggregation needed and should capture datasets for which the sensitivity of the aggregate is small. For example, if the aggregate is averaging, the term friendly should capture datasets with a small diameter.

To abstract away the particular application, we define friendliness using a predicate 𝑓 that is positive on points 𝑥 and 𝑦 if they are “close” to each other. For example,in the averaging application 𝑥 and 𝑦 are close if the distance between them is less than 𝑟. We say that a dataset is friendly (for this predicate) if every pair of points 𝑥 and 𝑦 are both close to a third point 𝑧 (not necessarily in the data).

Once we have fixed 𝑓 and defined when a dataset is friendly, two tasks remain. First, we construct the FriendlyCore algorithm that extracts a large friendly subset (the core) of the input stably. FriendlyCore is a filter satisfying two requirements: (1) It has to remove outliers to keep only elements that are close to many others in the core, and (2) for neighboring datasets that differ by a single element, 𝑦, the filter outputs each element except 𝑦 with almost the same probability. Furthermore, the union of the cores extracted from these neighboring datasets is friendly.

The idea underlying FriendlyCore is simple: The probability that we add a point, 𝑥, to the core is a monotonic and stable function of the number of elements close to 𝑥. In particular, if 𝑥 is close to all other points, it’s not considered an outlier and can be kept in the core with probability 1.

Second, we develop the Friendly DP algorithm that satisfies a weaker notion of privacy by adding less noise to the aggregate. This means that the outcomes of the aggregation are guaranteed to be similar only for neighboring datasets 𝐶 and 𝐶’ such that the union of 𝐶 and 𝐶’ is friendly.

Our main theorem states that if we apply a friendly DP aggregation algorithm to the core produced by a filter with the requirements listed above, then this composition is differentially private in the regular sense.

Clustering and other applications

Other applications of our aggregation method are clustering and learning the covariance matrix of a Gaussian distribution. Consider the use of FriendlyCore to develop a differentially private k-means clustering algorithm. Given a database of points, we partition it into random equal-size smaller subsets and run a good non-private k-means clustering algorithm on each small set. If the original dataset contains k large clusters then each smaller subset will contain a significant fraction of each of these k clusters. It follows that the tuples (ordered sets) of k-centers we get from the non-private algorithm for each small subset are similar. This dataset of tuples is expected to have a large friendly core (for an appropriate definition of closeness).

We use our framework to aggregate the resulting tuples of k-centers (k-tuples). We define two such k-tuples to be close if there is a matching between them such that a center is substantially closer to its mate than to any other center.

In this picture, any pair of the red, blue, and green tuples are close to each other, but none of them is close to the pink tuple. So the pink tuple is removed by our filter and is not in the core.

We then extract the core by our generic sampling scheme and aggregate it using the following steps:

  1. Pick a random k-tuple 𝑇 from the core.
  2. Partition the data by putting each point in a bucket according to its closest center in 𝑇.
  3. Privately average the points in each bucket to get our final k-centers.

Empirical results

Below are the empirical results of our algorithms based on FriendlyCore. We implemented them in the zero-Concentrated Differential Privacy (zCDP) model, which gives improved accuracy in our setting (with similar privacy guarantees as the more well-known (𝜖, 𝛿)-DP).

Averaging

We tested the mean estimation of 800 samples from a spherical Gaussian with an unknown mean. We compared it to the algorithm CoinPress. In contrast to FriendlyCore, CoinPress requires an upper bound 𝑅 on the norm of the mean. The figures below show the effect on accuracy when increasing 𝑅 or the dimension 𝑑. Our averaging algorithm performs better on large values of these parameters since it is independent of 𝑅 and 𝑑.

Left: Averaging in 𝑑= 1000, varying 𝑅. Right: Averaging with 𝑅= √𝑑, varying 𝑑.

Clustering

We tested the performance of our private clustering algorithm for k-means. We compared it to the Chung and Kamath algorithm that is based on recursive locality-sensitive hashing (LSH-clustering). For each experiment, we performed 30 repetitions and present the medians along with the 0.1 and 0.9 quantiles. In each repetition, we normalize the losses by the loss of k-means++ (where a smaller number is better).

The left figure below compares the k-means results on a uniform mixture of eight separated Gaussians in two dimensions. For small values of 𝑛 (the number of samples from the mixture), FriendlyCore often fails and yields inaccurate results. Yet, increasing 𝑛 increases the success probability of our algorithm (because the generated tuples become closer to each other) and yields very accurate results, while LSH-clustering lags behind.

Left: k-means results in 𝑑= 2 and k= 8, for varying 𝑛(number of samples). Right: A graphical illustration of the centers in one of the iterations for 𝑛= 2 X 105. Green points are the centers of our algorithm and the red points are the centers of LSH-clustering.

FriendlyCore also performs well on large datasets, even without clear separation into clusters. We used the Fonollosa and Huerta gas sensors dataset that contains 8M rows, consisting of a 16-dimensional point defined by 16 sensors’ measurements at a given point in time. We compared the clustering algorithms for varying k. FriendlyCore performs well except for k= 5 where it fails due to the instability of the non-private algorithm used by our method (there are two different solutions for k= 5 with similar cost that makes our approach fail since we do not get one set of tuples that are close to each other).

k-means results on gas sensors’ measurements over time, varying k.

Conclusion

FriendlyCore is a general framework for filtering metric data before privately aggregating it. The filtered data is stable and makes the aggregation less sensitive, enabling us to increase its accuracy with DP. Our algorithms outperform private algorithms tailored for averaging and clustering, and we believe this technique can be useful for additional aggregation tasks. Initial results show that it can effectively reduce utility loss when we deploy DP aggregations. To learn more, and see how we apply it for estimating the covariance matrix of a Gaussian distribution, see our paper.

Acknowledgements

This work was led by Eliad Tsfadia in collaboration with Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer, Avinatan Hassidim and Yossi Matias.

Read More

Google Research, 2022 & beyond: Robotics

Google Research, 2022 & beyond: Robotics

(This is Part 6 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people.

In 2022, we focused on challenges that come with enabling robots to be more helpful to people: 1) allowing robots and humans to communicate more efficiently and naturally; 2) enabling robots to understand and apply common sense knowledge in real-world situations; and 3) scaling the number of low-level skills robots need to effectively perform tasks in unstructured environments.

An undercurrent this past year has been the exploration of how large, generalist models, like PaLM, can work alongside other approaches to surface capabilities allowing robots to learn from a breadth of human knowledge and allowing people to engage with robots more naturally. As we do this, we’re transforming robot learning into a scalable data problem so that we can scale learning of generalized low-level skills, like manipulation. In this blog post, we’ll review key learnings and themes from our explorations in 2022.

Bringing the capabilities of LLMs to robotics

An incredible feature of large language models (LLMs) is their ability to encode descriptions and context into a format that’s understandable by both people and machines. When applied to robotics, LLMs let people task robots more easily — just by asking — with natural language. When combined with vision models and robotics learning approaches, LLMs give robots a way to understand the context of a person’s request and make decisions about what actions should be taken to complete it.

One of the underlying concepts is using LLMs to prompt other pretrained models for information that can build context about what is happening in a scene and make predictions about multimodal tasks. This is similar to the socratic method in teaching, where a teacher asks students questions to lead them through a rational thought process. In “Socratic Models”, we showed that this approach can achieve state-of-the-art performance in zero-shot image captioning and video-to-text retrieval tasks. It also enables new capabilities, like answering free-form questions about and predicting future activity from video, multimodal assistive dialogue, and as we’ll discuss next, robot perception and planning.

In “Towards Helpful Robots: Grounding Language in Robotic Affordances”, we partnered with Everyday Robots to ground the PaLM language model in a robotics affordance model to plan long horizon tasks. In previous machine-learned approaches, robots were limited to short, hard-coded commands, like “Pick up the sponge,” because they struggled with reasoning about the steps needed to complete a task — which is even harder when the task is given as an abstract goal like, “Can you help clean up this spill?”

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

For this approach to work, one needs to have both an LLM that can predict the sequence of steps to complete long horizon tasks and an affordance model representing the skills a robot can actually do in a given situation. In “Extracting Skill-Centric State Abstractions from Value Functions”, we showed that the value function in reinforcement learning (RL) models can be used to build the affordance model — an abstract representation of the actions a robot can perform under different states. This lets us connect long-horizons of real-world tasks, like “tidy the living room”, to the short-horizon skills needed to complete the task, like correctly picking, placing, and arranging items.

Having both an LLM and an affordance model doesn’t mean that the robot will actually be able to complete the task successfully. However, with Inner Monologue, we closed the loop on LLM-based task planning with other sources of information, like human feedback or scene understanding, to detect when the robot fails to complete the task correctly. Using a robot from Everyday Robots, we show that LLMs can effectively replan if the current or previous plan steps failed, allowing the robot to recover from failures and complete complex tasks like “Put a coke in the top drawer,” as shown in the video below.

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

An emergent capability from closing the loop on LLM-based task planning that we saw with Inner Monologue is that the robot can react to changes in the high-level goal mid-task. For example, a person might tell the robot to change its behavior as it is happening, by offering quick corrections or redirecting the robot to another task. This behavior is especially useful to let people interactively control and customize robot tasks when robots are working near people.

While natural language makes it easier for people to specify and modify robot tasks, one of the challenges is being able to react in real time to the full vocabulary people can use to describe tasks that a robot is capable of doing. In “Talking to Robots in Real Time”, we demonstrated a large-scale imitation learning framework for producing real-time, open-vocabulary, language-conditionable robots. With one policy we were able to address over 87,000 unique instructions, with an estimated average success rate of 93.5%. As part of this project, we released Language-Table, the largest available language-annotated robot dataset, which we hope will drive further research focused on real-time language-controllable robots.

Examples of long horizon goals reached under real time human language guidance.

We’re also excited about the potential for LLMs to write code that can control robot actions. Code-writing approaches, like in “Robots That Write Their Own Code”, show promise in increasing the complexity of tasks robots can complete by autonomously generating new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

Turning robot learning into a scalable data problem

Large language and multimodal models help robots understand the context in which they’re operating, like what’s happening in a scene and what the robot is expected to do. But robots also need low-level physical skills to complete tasks in the physical world, like picking up and precisely placing objects.

While we often take these physical skills for granted, executing them hundreds of times every day without even thinking, they present significant challenges to robots. For example, to pick up an object, the robot needs to perceive and understand the environment, reason about the spatial relation and contact dynamics between its gripper and the object, actuate the high degrees-of-freedom arm precisely, and exert the right amount of force to stably grasp the object without breaking it. The difficulty of learning these low-level skills is known as Moravec’s paradox: reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources.

Inspired by the recent success of LLMs, which shows that the generalization and performance of large Transformer-based models scale with the amount of data, we are taking a data-driven approach, turning the problem of learning low-level physical skills into a scalable data problem. With Robotics Transformer-1 (RT-1), we trained a robot manipulation policy on a large-scale, real-world robotics dataset of 130k episodes that cover 700+ tasks using a fleet of 13 robots from Everyday Robots and showed the same trend for robotics — increasing the scale and diversity of data improves the model ability to generalize to new tasks, environments, and objects.

Example PaLM-SayCan-RT1 executions of long-horizon tasks in real kitchens.

Behind both language models and many of our robotics learning approaches, like RT-1, are Transformers, which allow models to make sense of Internet-scale data. Unlike LLMs, robotics is challenged by multimodal representations of constantly changing environments and limited compute. In 2020, we introduced Performers as an approach to make Transformers more computationally efficient, which has implications for many applications beyond robotics. In Performer-MPC, we applied this to introduce a new class of implicit control policies combining the benefits of imitation learning with the robust handling of system constraints from Model Predictive Control (MPC). We show a >40% improvement on the robot reaching its goal and a >65% improvement on social metrics when navigating around humans in comparison to a standard MPC policy. Performer-MPC provides 8 ms latency for the 8.3M parameter model, making on-robot deployment of Transformers practical.

Navigation robot maneuvering through highly constrained spaces using: Regular MPC, Explicit Policy, and Performer-MPC.

In the last year, our team has shown that data-driven approaches are generally applicable on different robotic platforms in diverse environments to learn a wide range of tasks, including mobile manipulation, navigation, locomotion and table tennis. This shows us a clear path forward for learning low-level robot skills: scalable data collection. Unlike video and text data that is abundant on the Internet, robotic data is extremely scarce and hard to acquire. Finding approaches to collect and efficiently use rich datasets representative of real-world interactions is the key for our data-driven approaches.

Simulation is a fast, safe, and easily parallelizable option, but it is difficult to replicate the full environment, especially physics and human-robot interactions, in simulation. In i-Sim2Real, we showed an approach to address the sim-to-real gap and learn to play table tennis with a human opponent by bootstrapping from a simple model of human behavior and alternating between training in simulation and deploying in the real world. In each iteration, both the human behavior model and the policy are refined.

Learning to play table tennis with a human opponent.

While simulation helps, collecting data in the real world is essential for fine-tuning simulation policies or adapting existing policies in new environments. While learning, robots are prone to failure, which can cause damage to itself and surroundings — especially in the early stages of learning where they are exploring how to interact with the world. We need to collect training data safely, even while the robot is learning, and enable the robot to autonomously recover from failure. In “Learning Locomotion Skills Safely in the Real World”, we introduced a safe RL framework that switches between a “learner policy” optimized to perform the desired task and a “safe recovery policy” that prevents the robot from unsafe states. In “Legged Robots that Keep on Learning”, we trained a reset policy so the robot can recover from failures, like learning to stand up by itself after falling.

Automatic reset policies enable the robot to continue learning in a lifelong fashion without human supervision.

While robot data is scarce, videos of people performing different tasks are abundant. Of course, robots aren’t built like people — so the idea of robotic learning from people raises the problem of transferring learning across different embodiments. In “Robot See, Robot Do”, we developed Cross-Embodiment Inverse Reinforcement Learning to learn new tasks by watching people. Instead of trying to replicate the task exactly as a person would, we learn the high-level task objective, and summarize that knowledge in the form of a reward function. This type of demonstration learning could allow robots to learn skills by watching videos readily available on the internet.

We’re also progressing towards making our learning algorithms more data efficient so that we’re not relying only on scaling data collection. We improved the efficiency of RL approaches by incorporating prior information, including predictive information, adversarial motion priors, and guide policies. Further improvements are gained by utilizing a novel structured dynamical systems architecture and combining RL with trajectory optimization, supported by novel solvers. These types of prior information helped alleviate the exploration challenges, served as good regularizers, and significantly reduced the amount of data required. Furthermore, our team has invested heavily in more data-efficient imitation learning. We showed that a simple imitation learning approach, BC-Z, can enable zero-shot generalization to new tasks that were not seen during training. We also introduced an iterative imitation learning algorithm, GoalsEye, which combined Learning from Play and Goal-Conditioned Behavior Cloning for high-speed and high-precision table tennis games. On the theoretical front, we investigated dynamical-systems stability for characterizing the sample complexity of imitation learning, and the role of capturing failure-and-recovery within demonstration data to better condition offline learning from smaller datasets.

Closing

Advances in large models across the field of AI have spurred a leap in capabilities for robot learning. This past year, we’ve seen the sense of context and sequencing of events captured in LLMs help solve long-horizon planning for robotics and make robots easier for people to interact with and task. We’ve also seen a scalable path to learning robust and generalizable robot behaviors by applying a transformer model architecture to robot learning. We continue to open source data sets, like “Scanned Objects: A Dataset of 3D-Scanned Common Household Items”, and models, like RT-1, in the spirit of participating in the broader research community. We’re excited about building on these research themes in the coming year to enable helpful robots.

Acknowledgements

We would like to thank everyone who supported our research. This includes the entire Robotics at Google team, and collaborators from Everyday Robots and Google Research. We also want to thank our external collaborators, including UC Berkeley, Stanford, Gatech, University of Washington, MIT, CMU and U Penn.

Top

Google Research, 2022 & beyond

This was the sixth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances Robotics
Health* General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Google Research, 2022 & beyond: Algorithmic advances

Google Research, 2022 & beyond: Algorithmic advances

(This is Part 5 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

Robust algorithm design is the backbone of systems across Google, particularly for our ML and AI models. Hence, developing algorithms with improved efficiency, performance and speed remains a high priority as it empowers services ranging from Search and Ads to Maps and YouTube. Google Research has been at the forefront of this effort, developing many innovations from privacy-safe recommendation systems to scalable solutions for large-scale ML. In 2022, we continued this journey, and advanced the state-of-the-art in several related areas. Here we highlight our progress in a subset of these, including scalability, privacy, market algorithms, and algorithmic foundations.

Scalable algorithms: Graphs, clustering, and optimization

As the need to handle large-scale datasets increases, scalability and reliability of complex algorithms that also exhibit improved explainability, robustness, and speed remain a high priority. We continued our efforts in developing new algorithms for handling large datasets in various areas, including unsupervised and semi-supervised learning, graph-based learning, clustering, and large-scale optimization.

An important component of such systems is to build a similarity graph — a nearest-neighbor graph that represents similarities between objects. For scalability and speed, this graph should be sparse without compromising quality. We proposed a 2-hop spanner technique, called STAR, as an efficient and distributed graph building strategy, and showed how it significantly decreases the number of similarity computations in theory and practice, building much sparser graphs while producing high-quality graph learning or clustering outputs. As an example, for graphs with 10T edges, we demonstrate ~100-fold improvements in pairwise similarity comparisons and significant running time speedups with negligible quality loss. We had previously applied this idea to develop massively parallel algorithms for metric, and minimum-size clustering. More broadly in the context of clustering, we developed the first linear-time hierarchical agglomerative clustering (HAC) algorithm as well as DBSCAN, the first parallel algorithm for HAC with logarithmic depth, which achieves 50x speedup on 100B-edge graphs. We also designed improved sublinear algorithms for different flavors of clustering problems such as geometric linkage clustering, constant-round correlation clustering, and fully dynamic k-clustering.

Inspired by the success of multi-core processing (e.g., GBBS), we embarked on a mission to develop graph mining algorithms that can handle graphs with 100B edges on a single multi-core machine. The big challenge here is to achieve fast (e.g., sublinear) parallel running time (i.e., depth). Following our previous work for community detection and correlation clustering, we developed an algorithm for HAC, called ParHAC, which has provable polylogarithmic depth and near-linear work and achieves a 50x speedup. As an example, it took ParHAC only ~10 minutes to find an approximate affinity hierarchy over a graph of over 100B edges, and ~3 hours to find the full HAC on a single machine. Following our previous work on distributed HAC, we use these multi-core algorithms as a subroutine within our distributed algorithms in order to handle tera-scale graphs.

We also had a number of interesting results on graph neural networks (GNN) in 2022. We provided a model-based taxonomy that unified many graph learning methods. In addition, we discovered insights for GNN models from their performance across thousands of graphs with varying structure (shown below). We also proposed a new hybrid architecture to overcome the depth requirements of existing GNNs for solving fundamental graph problems, such as shortest paths and the minimum spanning tree.

Relative performance results of three GNN variants (GCN, APPNP, FiLM) across 50,000 distinct node classification datasets in GraphWorld. We find that academic GNN benchmark datasets exist in regions where model rankings do not change. GraphWorld can discover previously unexplored graphs that reveal new insights about GNN architectures.

Furthermore, to bring some of these many advances to the broader community, we had three releases of our flagship modeling library for building graph neural networks in TensorFlow (TF-GNN). Highlights include a model library and model orchestration API to make it easy to compose GNN solutions. Following our NeurIPS’20 workshop on Mining and Learning with Graphs at Scale, we ran a workshop on graph-based learning at ICML’22, and a tutorial for GNNs in TensorFlow at NeurIPS’22.

In “Robust Routing Using Electrical Flows”, we presented a recent paper that proposed a Google Maps solution to efficiently compute alternate paths in road networks that are resistant to failures (e.g., closures, incidents). We demonstrate how it significantly outperforms the state-of-the-art plateau and penalty methods on real-world road networks.

Example of how we construct the electrical circuit corresponding to the road network. The current can be decomposed into three flows, i1, i2 and i3, each of which corresponds to a viable alternate path from Fremont, CA to San Rafael, CA.

On the optimization front, we open-sourced Vizier, our flagship blackbox optimization and hyperparameter tuning library at Google. We also developed new techniques for linear programming (LP) solvers that address scalability limits caused by their reliance on matrix factorizations, which restricts the opportunity for parallelism and distributed approaches. To this end, we open-sourced a primal-dual hybrid gradient (PDHG) solution for LP called primal-dual linear programming (PDLP), a new first-order solver for large-scale LP problems. PDLP has been used to solve real-world problems with as many as 12B non-zeros (and an internal distributed version scaled to 92B non-zeros). PDLP’s effectiveness is due to a combination of theoretical developments and algorithm engineering.

With OSS Vizier, multiple clients each send a “Suggest” request to the Service API, which produces Suggestions for the clients using Pythia policies. The clients evaluate these suggestions and return measurements. All transactions are stored to allow fault-tolerance.

Top

Privacy and federated learning

Respecting user privacy while providing high-quality services remains a top priority for all Google systems. Research in this area spans many products and uses principles from differential privacy (DP) and federated learning.

First of all, we have made a variety of algorithmic advances to address the problem of training large neural networks with DP. Building on our earlier work, which enabled us to launch a DP neural network based on the DP-FTRL algorithm, we developed the matrix factorization DP-FTRL approach. This work demonstrates that one can design a mathematical program to optimize over a large set of possible DP mechanisms to find those best suited for specific learning problems. We also establish margin guarantees that are independent of the input feature dimension for DP learning of neural networks and kernel-based methods. We further extend this concept to a broader range of ML tasks, matching baseline performance with 300x less computation. For fine-tuning of large models, we argued that once pre-trained, these models (even with DP) essentially operate over a low-dimensional subspace, hence circumventing the curse of dimensionality that DP imposes.

On the algorithmic front, for estimating the entropy of a high-dimensional distribution, we obtained local DP mechanisms (that work even when as little as one bit per sample is available) and efficient shuffle DP mechanisms. We proposed a more accurate method to simultaneously estimate the top-k most popular items in the database in a private manner, which we employed in the Plume library. Moreover, we showed a near-optimal approximation algorithm for DP clustering in the massively parallel computing (MPC) model, which further improves on our previous work for scalable and distributed settings.

Another exciting research direction is the intersection of privacy and streaming. We obtained a near-optimal approximation-space trade-off for the private frequency moments and a new algorithm for privately counting distinct elements in the sliding window streaming model. We also presented a general hybrid framework for studying adversarial streaming.

Addressing applications at the intersection of security and privacy, we developed new algorithms that are secure, private, and communication-efficient, for measuring cross-publisher reach and frequency. The World Federation of Advertisers has adopted these algorithms as part of their measurement system. In subsequent work, we developed new protocols that are secure and private for computing sparse histograms in the two-server model of DP. These protocols are efficient from both computation and communication points of view, are substantially better than what standard methods would yield, and combine tools and techniques from sketching, cryptography and multiparty computation, and DP.

While we have trained BERT and transformers with DP, understanding training example memorization in large language models (LLMs) is a heuristic way to evaluate their privacy. In particular, we investigated when and why LLMs forget (potentially memorized) training examples during training. Our findings suggest that earlier-seen examples may observe privacy benefits at the expense of examples seen later. We also quantified the degree to which LLMs emit memorized training data.

Top

Market algorithms and causal inference

We also continued our research in improving online marketplaces in 2022. For example, an important recent area in ad auction research is the study of auto-bidding online advertising where the majority of bidding happens via proxy bidders that optimize higher-level objectives on behalf of advertisers. The complex dynamics of users, advertisers, bidders, and ad platforms leads to non-trivial problems in this space. Following our earlier work in analyzing and improving mechanisms under auto-bidding auctions, we continued our research in improving online marketplaces in the context of automation while taking different aspects into consideration, such as user experience and advertiser budgets. Our findings suggest that properly incorporating ML advice and randomization techniques, even in non-truthful auctions, can robustly improve the overall welfare at equilibria among auto-bidding algorithms.

Structure of auto-bidding online ads system.

Beyond auto-bidding systems, we also studied auction improvements in complex environments, e.g., settings where buyers are represented by intermediaries, and with Rich Ads where each ad can be shown in one of several possible variants. We summarize our work in this area in a recent survey. Beyond auctions, we also investigate the use of contracts in multi-agent and adversarial settings.

Online stochastic optimization remains an important part of online advertising systems with application in optimal bidding and budget pacing. Building on our long-term research in online allocation, we recently blogged about dual mirror descent, a new algorithm for online allocation problems that is simple, robust, and flexible. This state-of-the-art algorithm is robust against a wide range of adversarial and stochastic input distributions and can optimize important objectives beyond economic efficiency, such as fairness. We also show that by tailoring dual mirror descent to the special structure of the increasingly popular return-on-spend constraints, we can optimize advertiser value. Dual mirror descent has a wide range of applications and has been used over time to help advertisers obtain more value through better algorithmic decision making.

An overview of the dual mirror descent algorithm.

Furthermore, following our recent work at the interplay of ML, mechanism design and markets, we investigated transformers for asymmetric auction design, designed utility-maximizing strategies for no-regret learning buyers, and developed new learning algorithms to bid or to price in auctions.

An overview of bipartite experimental design to reduce causal interactions between entities.

A critical component of any sophisticated online service is the ability to experimentally measure the response of users and other players to new interventions. A major challenge of estimating these causal effects accurately is handling complex interactions — or interference — between the control and treatment units of these experiments. We combined our graph clustering and causal inference expertise to expand the results of our previous work in this area, with improved results under a flexible response model and a new experimental design that is more effective at reducing these interactions when treatment assignments and metric measurements occur on the same side of a bipartite platform. We also showed how synthetic control and optimization techniques can be combined to design more powerful experiments, especially in small data regimes.

Top

Algorithmic foundations and theory

Finally, we continued our fundamental algorithmic research by tackling long-standing open problems. A surprisingly concise paper affirmatively resolved a four-decade old open question on whether there is a mechanism that guarantees a constant fraction of the gains-from-trade attainable whenever buyer’s value weakly exceeds seller’s cost. Another recent paper obtained the state-of-the-art approximation for the classic and highly-studied k-means problem. We also improved the best approximation for correlation clustering breaking the barrier approximation factor of 2. Finally, our work on dynamic data structures to solve min-cost and other network flow problems has contributed to a breakthrough line of work in adapting continuous optimization techniques to solve classic discrete optimization problems.

Top

Concluding thoughts

Designing effective algorithms and mechanisms is a critical component of many Google systems that need to handle tera-scale data robustly with critical privacy and safety considerations. Our approach is to develop algorithms with solid theoretical foundations that can be deployed effectively in our product systems. In addition, we are bringing many of these advances to the broader community by open-sourcing some of our most novel developments and by publishing the advanced algorithms behind them. In this post, we covered a subset of algorithmic advances in privacy, market algorithms, scalable algorithms, graph-based learning, and optimization. As we move toward an AI-first Google with further automation, developing robust, scalable, and privacy-safe ML algorithms remains a high priority. We are excited about developing new algorithms and deploying them more broadly.

Acknowledgements

This post summarizes research from a large number of teams and benefited from input from several researchers including Gagan Aggarwal, Amr Ahmed, David Applegate, Santiago Balseiro, Vincent Cohen-addad, Yuan Deng, Alessandro Epasto, Matthew Fahrbach, Badih Ghazi, Sreenivas Gollapudi, Rajesh Jayaram, Ravi Kumar, Sanjiv Kumar, Silvio Lattanzi, Kuba Lacki, Brendan McMahan, Aranyak Mehta, Bryan Perozzi, Daniel Ramage, Ananda Theertha Suresh, Andreas Terzis, Sergei Vassilvitskii, Di Wang, and Song Zuo. Special thanks to Ravi Kumar for his contributions to this post.

Google Research, 2022 & beyond

This was the fifth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances Robotics*
Health General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More

Amplification at the Quantum limit

Amplification at the Quantum limit

The Google Quantum AI team is building quantum computers with superconducting microwave circuits, but much like a classical computer the superconducting processor at the heart of these computers is only part of the story. An entire technology stack of peripheral hardware is required to make the quantum computer work properly. In many cases these parts must be custom designed, requiring extensive research and development to reach the highest levels of performance.

In this post, we highlight one aspect of this supplemental hardware: our superconducting microwave amplifiers. In “Readout of a Quantum Processor with High Dynamic Range Josephson Parametric Amplifiers”, published in Applied Physics Letters, we describe how we increased the maximum output power of our superconducting microwave amplifiers by a factor of over 100x. We discuss how this work can pave the way for the operation of larger quantum processor chips with improved performance.

Why microwave amplifiers?

One of the challenges of operating a superconducting quantum processor is measuring the state of a qubit without disturbing its operation. Fundamentally, this comes down to a microwave engineering problem, where we need to be able to measure the energy inside the qubit resonator without exposing it to noisy or lossy wiring. This can be accomplished by adding an additional microwave resonator to the system that is coupled to the qubit, but far from the qubit’s resonance frequency. The resonator acts as a filter that isolates the qubit from the control lines but also picks up a state-dependent frequency shift from the qubit. Just like in the binary phase shift keying (BPSK) encoding technique, the digital state of the qubit (0 or 1) is translated into a phase for a probe tone (microwave signal) reflecting off of this auxiliary resonator. Measuring the phase of this probe tone allows us to infer the state of the qubit without directly interfacing with the qubit itself.

While this sounds simple, the qubit actually imposes a severe cap on how much power can be used for this probe tone. In normal operation, a qubit should be in the 0 state or the 1 state or some superposition of the two. A measurement pulse should collapse the qubit into one of these two states, but using too much power can push it into a higher excited state and corrupt the computation. A safe measurement power is typically around -125 dBm, which amounts to only a handful of microwave photons interacting with the processor during the measurement. Typically, small signals are measured using microwave amplifiers, which increase the signal level, but also add their own noise. How much noise is acceptable? If the measurement process takes too long, the qubit state can change due to energy loss in the circuit. This means that these very small signals must be measured in just a few hundred nanoseconds with very high (>99%) fidelity. We therefore cannot afford to average the signal over a longer time to reduce the noise. Unfortunately, even the best semiconductor low-noise amplifiers are still almost a factor of 10 too noisy.

The solution is to design our own custom amplifiers based on the same circuit elements as the qubits themselves. These amplifiers typically consist of Josephson junctions to provide a tunable inductance wired into a superconducting resonant circuit. By constructing a resonant circuit out of these elements, you can create a parametric amplifier where amplification is achieved by modulating the tunable inductance at twice the frequency you want to amplify. Additionally, because all of the wiring is made of lossless superconductors, these devices operate near the quantum limit of added noise, where the only noise in the signal is coming from amplification of the zero point quantum voltage fluctuations.

The one downside to these devices is that the Josephson junctions constrain the power of the signals we can measure. If the signal is too large, the drive current can approach the junction critical current and degrade the amplifier performance. Even if this limit was sufficient to measure a single qubit, our goal was to increase efficiency by measuring up to six qubits at a time using the same amplifier. Some groups get around this limit by making traveling wave amplifiers, where the signals are distributed across thousands of junctions. This increases the saturation power, but the amplifiers get very complicated to produce and take up a lot of space on the chip. Our goal was to create an amplifier that could handle as much power as a traveling wave amplifier but with the same simple and compact design we were used to.

Results

The critical current of each Josephson junction limits our amplifier’s power handling. However, increasing this critical current also changes the inductance and, thus, the operating frequency of the amplifier. To avoid these constraints, we replaced a standard 2-junction DC SQUID with a nonlinear tunable inductor made up of two RF-SQUID arrays in parallel, which we call a snake inductor. Each RF-SQUID consists of a Josephson junction and geometric inductances L1 and L2, and each array contains 20 RF-SQUIDs. In this case, each junction of a standard DC SQUID is replaced by one of these RF-SQUID arrays. While the critical current of each RF-SQUID is much higher, we chain them together to keep the inductance and operating frequency the same. While this is a relatively modest increase in device complexity, it enables us to increase the power handling of each amplifier by roughly a factor of 100x. It is also fully compatible with existing designs that use impedance matching circuits to provide large measurement bandwidth.

Circuit diagram of our superconducting microwave amplifier. A split bias coil allows both DC and RF modulation of the snake inductor, while a shunt capacitor sets the frequency range. The flow of current is illustrated in the animation where an applied current (blue) on the bias line causes a circulating current (red) in the snake. A tapered impedance transformer lowers the loaded Q of the device. Since the Q is defined as frequency divided by bandwidth, lowering the Q with a constant frequency increases the bandwidth of the amplifier. Example circuit parameters used for a real device are Cs=6.0 pF, L1=2.6 pH, L2=8.0 pH, Lb=30 pH, M=50 pH, Z0 = 50 Ohms, and Zfinal = 18 ohms. The device operation is illustrated with a small signal (magenta) reflecting off the input of the amplifier. When the large pump tone (blue) is applied to the bias port, it generates amplified versions of the signal (gold) and a secondary tone known as an idler (also gold).
Microscope image of the nonlinear resonator showing the resonant circuit that consists of a large parallel plate capacitor, nonlinear snake inductor, and a current bias transformer to tune the inductance.

We measure this performance improvement by measuring the saturation power of the amplifier, or the point at which the gain is compressed by 1 dB. We also measure this power value vs. frequency to see how it scales with amplifier gain and distance from the center of the amplifier bandwidth. Since the amplifier gain is symmetric about its center frequency we measure this in terms of absolute detuning, which is just the absolute value of the difference between the center frequency of the amplifier and the probe tone frequency.

Input and output saturation power (1-dB gain compression point), calibrated using a superconducting quantum processor vs. absolute detuning from the amplifier center frequency.

Conclusion and future directions

The new microwave amplifiers represent a big step forward for our qubit measurement system. They will allow us to measure more qubits using a single device, and enable techniques that require higher power for each measurement tone. However, there are still quite a few areas we would like to explore. For example, we are currently investigating the application of snake inductors in amplifiers with advanced impedance matching techniques, directional amplifiers, and non-reciprocal devices like microwave circulators.

Acknowledgements

We would like to thank the Quantum AI team for the infrastructure and support that enabled the creation and measurement of our microwave amplifier devices. Thanks to our cohort of talented Google Research Interns that contributed to the future work mentioned above: Andrea Iorio for developing algorithms that automatically tune amplifiers and provide a snapshot of the local parameter space, Ryan Kaufman for measuring a new class of amplifiers using multi-pole impedance matching networks, and Randy Kwende for designing and testing a range of parametric devices based on snake inductors. With their contributions, we are gaining a better understanding of our amplifiers and designing the next generation of parametrically-driven devices.

Read More

Unsupervised and semi-supervised anomaly detection with data-centric ML

Unsupervised and semi-supervised anomaly detection with data-centric ML

Anomaly detection (AD), the task of distinguishing anomalies from normal data, plays a vital role in many real-world applications, such as detecting faulty products from vision sensors in manufacturing, fraudulent behaviors in financial transactions, or network security threats. Depending on the availability of the type of data — negative (normal) vs. positive (anomalous) and the availability of their labels — the task of AD involves different challenges.

(a) Fully supervised anomaly detection, (b) normal-only anomaly detection, (c, d, e) semi-supervised anomaly detection, (f) unsupervised anomaly detection.

While most previous works were shown to be effective for cases with fully-labeled data (either (a) or (b) in the above figure), such settings are less common in practice because labels are particularly tedious to obtain. In most scenarios users have a limited labeling budget, and sometimes there aren’t even any labeled samples during training. Furthermore, even when labeled data are available, there could be biases in the way samples are labeled, causing distribution differences. Such real-world data challenges limit the achievable accuracy of prior methods in detecting anomalies.

This post covers two of our recent papers on AD, published in Transactions on Machine Learning Research (TMLR), that address the above challenges in unsupervised and semi-supervised settings. Using data-centric approaches, we show state-of-the-art results in both. In “Self-supervised, Refine, Repeat: Improving Unsupervised Anomaly Detection”, we propose a novel unsupervised AD framework that relies on the principles of self-supervised learning without labels and iterative data refinement based on the agreement of one-class classifier (OCC) outputs. In “SPADE: Semi-supervised Anomaly Detection under Distribution Mismatch”, we propose a novel semi-supervised AD framework that yields robust performance even under distribution mismatch with limited labeled samples.

Unsupervised anomaly detection with SRR: Self-supervised, Refine, Repeat

Discovering a decision boundary for a one-class (normal) distribution (i.e., OCC training) is challenging in fully unsupervised settings as unlabeled training data include two classes (normal and abnormal). The challenge gets further exacerbated as the anomaly ratio gets higher for unlabeled data. To construct a robust OCC with unlabeled data, excluding likely-positive (anomalous) samples from the unlabeled data, the process referred to as data refinement, is critical. The refined data, with a lower anomaly ratio, are shown to yield superior anomaly detection models.

SRR first refines data from an unlabeled dataset, then iteratively trains deep representations using refined data while improving the refinement of unlabeled data by excluding likely-positive samples. For data refinement, an ensemble of OCCs is employed, each of which is trained on a disjoint subset of unlabeled training data. If there is consensus among all the OCCs in the ensemble, the data that are predicted to be negative (normal) are included in the refined data. Finally, the refined training data are used to train the final OCC to generate the anomaly predictions.

Training SRR with a data refinement module (OCCs ensemble), representation learner, and final OCC. (Green/red dots represent normal/abnormal samples, respectively).

SRR results

We conduct extensive experiments across various datasets from different domains, including semantic AD (CIFAR-10, Dog-vs-Cat), real-world manufacturing visual AD (MVTec), and real-world tabular AD benchmarks such as detecting medical (Thyroid) or network security (KDD 1999) anomalies. We consider methods with both shallow (e.g., OC-SVM) and deep (e.g., GOAD, CutPaste) models. Since the anomaly ratio of real-world data can vary, we evaluate models at different anomaly ratios of unlabeled training data and show that SRR significantly boosts AD performance. For example, SRR improves more than 15.0 average precision (AP) with a 10% anomaly ratio compared to a state-of-the-art one-class deep model on CIFAR-10. Similarly, on MVTec, SRR retains solid performance, dropping less than 1.0 AUC with a 10% anomaly ratio, while the best existing OCC drops more than 6.0 AUC. Lastly, on Thyroid (tabular data), SRR outperforms a state-of-the-art one-class classifier by 22.9 F1 score with a 2.5% anomaly ratio.

Across various domains, SRR (blue line) significantly boosts AD performance with various anomaly ratios in fully unsupervised settings.

SPADE: Semi-supervised Pseudo-labeler Anomaly Detection with Ensembling

Most semi-supervised learning methods (e.g., FixMatch, VIME) assume that the labeled and unlabeled data come from the same distributions. However, in practice, distribution mismatch commonly occurs, with labeled and unlabeled data coming from different distributions. One such case is positive and unlabeled (PU) or negative and unlabeled (NU) settings, where the distributions between labeled (either positive or negative) and unlabeled (both positive and negative) samples are different. Another cause of distribution shift is additional unlabeled data being gathered after labeling. For example, manufacturing processes may keep evolving, causing the corresponding defects to change and the defect types at labeling to differ from the defect types in unlabeled data. In addition, for applications like financial fraud detection and anti-money laundering, new anomalies can appear after the data labeling process, as criminal behavior may adapt. Lastly, labelers are more confident on easy samples when they label them; thus, easy/difficult samples are more likely to be included in the labeled/unlabeled data. For example, with some crowd-sourcing–based labeling, only the samples with some consensus on the labels (as a measure of confidence) are included in the labeled set.

Three common real-world scenarios with distribution mismatches (blue box: normal samples, red box: known/easy anomaly samples, yellow box: new/difficult anomaly samples).

Standard semi-supervised learning methods assume that labeled and unlabeled data come from the same distribution, so are sub-optimal for semi-supervised AD under distribution mismatch. SPADE utilizes an ensemble of OCCs to estimate the pseudo-labels of the unlabeled data — it does this independent of the given positive labeled data, thus reducing the dependency on the labels. This is especially beneficial when there is a distribution mismatch. In addition, SPADE employs partial matching to automatically select the critical hyper-parameters for pseudo-labeling without relying on labeled validation data, a crucial capability given limited labeled data.

Block diagram of SPADE with zoom in the detailed block diagram of the proposed pseudo-labelers.

SPADE results

We conduct extensive experiments to showcase the benefits of SPADE in various real-world settings of semi-supervised learning with distribution mismatch. We consider multiple AD datasets for image (including MVTec) and tabular (including Covertype, Thyroid) data.

SPADE shows state-of-the-art semi-supervised anomaly detection performance across a wide range of scenarios: (i) new-types of anomalies, (ii) easy-to-label samples, and (iii) positive-unlabeled examples. As shown below, with new-types of anomalies, SPADE outperforms the state-of-the-art alternatives by 5% AUC on average.

AD performances with three different scenarios across various datasets (Covertype, MVTec, Thyroid) in terms of AUC. Some baselines are only applicable to some scenarios. More results with other baselines and datasets can be found in the paper.

We also evaluate SPADE on real-world financial fraud detection datasets: Kaggle credit card fraud and Xente fraud detection. For these, anomalies evolve (i.e., their distributions change over time) and to identify evolving anomalies, we need to keep labeling for new anomalies and retrain the AD model. However, labeling would be costly and time consuming. Even without additional labeling, SPADE can improve the AD performance using both labeled data and newly-gathered unlabeled data.

AD performances with time-varying distributions using two real-world fraud detection datasets with 10% labeling ratio. More baselines can be found in the paper.

As shown above, SPADE consistently outperforms alternatives on both datasets, taking advantage of the unlabeled data and showing robustness to evolving distributions.

Conclusions

AD has a wide range of use cases with significant importance in real-world applications, from detecting security threats in financial systems to identifying faulty behaviors of manufacturing machines.

One challenging and costly aspect of building an AD system is that anomalies are rare and not easily detectable by people. To this end, we have proposed SRR, a canonical AD framework to enable high performance AD without the need for manual labels for training. SRR can be flexibly integrated with any OCC, and applied on raw data or on trainable representations.

Semi-supervised AD is another highly-important challenge — in many scenarios, the distributions of labeled and unlabeled samples don’t match. SPADE introduces a robust pseudo-labeling mechanism using an ensemble of OCCs and a judicious way of combining supervised and self-supervised learning. In addition, SPADE introduces an efficient approach to pick critical hyperparameters without a validation set, a crucial component for data-efficient AD.

Overall, we demonstrate that SRR and SPADE consistently outperform the alternatives in various scenarios across multiple types of datasets.

Acknowledgements

We gratefully acknowledge the contributions of Kihyuk Sohn, Chun-Liang Li, Chen-Yu Lee, Kyle Ziegler, Nate Yoder, and Tomas Pfister.

Read More

Google Research, 2022 & beyond: Algorithms for efficient deep learning

Google Research, 2022 & beyond: Algorithms for efficient deep learning

(This is Part 4 in our series of posts covering different topical areas of research at Google. You can find other posts in the series here.)

The explosion in deep learning a decade ago was catapulted in part by the convergence of new algorithms and architectures, a marked increase in data, and access to greater compute. In the last 10 years, AI and ML models have become bigger and more sophisticated — they’re deeper, more complex, with more parameters, and trained on much more data, resulting in some of the most transformative outcomes in the history of machine learning.

As these models increasingly find themselves deployed in production and business applications, the efficiency and costs of these models has gone from a minor consideration to a primary constraint. In response, Google has continued to invest heavily in ML efficiency, taking on the biggest challenges in (a) efficient architectures, (b) training efficiency, (c) data efficiency, and (d) inference efficiency. Beyond efficiency, there are a number of other challenges around factuality, security, privacy and freshness in these models. Below, we highlight a panoply of works that demonstrate Google Research’s efforts in developing new algorithms to address the above challenges.

Efficient architectures

A fundamental question is “Are there better ways of parameterizing a model to allow for greater efficiency?” In 2022, we focused on new techniques for infusing external knowledge by augmenting models via retrieved context; mixture of experts; and making transformers (which lie at the heart of most large ML models) more efficient.

Context-augmented models

In the quest for higher quality and efficiency, neural models can be augmented with external context from large databases or trainable memory. By leveraging retrieved context, a neural network may not have to memorize the huge amount of world knowledge within its internal parameters, leading to better parameter efficiency, interpretability and factuality.

In “Decoupled Context Processing for Context Augmented Language Modeling”, we explored a simple architecture for incorporating external context into language models based on a decoupled encoder-decoder architecture. This led to significant computational savings while giving competitive results on auto-regressive language modeling and open domain question answering tasks. However, pre-trained large language models (LLMs) consume a significant amount of information through self-supervision on big training sets. But, it is unclear precisely how the “world knowledge” of such models interacts with the presented context. With knowledge aware fine-tuning (KAFT), we strengthen both controllability and robustness of LLMs by incorporating counterfactual and irrelevant contexts into standard supervised datasets.

One of the questions in the quest for a modular deep network is how a database of concepts with corresponding computational modules could be designed. We proposed a theoretical architecture that would “remember events” in the form of sketches stored in an external LSH table with pointers to modules that process such sketches.

Another challenge in context-augmented models is fast retrieval on accelerators of information from a large database. We have developed a TPU-based similarity search algorithm that aligns with the performance model of TPUs and gives analytical guarantees on expected recall, achieving peak performance. Search algorithms typically involve a large number of hyperparameters and design choices that make it hard to tune them on new tasks. We have proposed a new constrained optimization algorithm for automating hyperparameter tuning. Fixing the desired cost or recall as input, the proposed algorithm generates tunings that empirically are very close to the speed-recall Pareto frontier and give leading performance on standard benchmarks.

Mixture-of-experts models

Mixture-of-experts (MoE) models have proven to be an effective means of increasing neural network model capacity without overly increasing their computational cost. The basic idea of MoEs is to construct a network from a number of expert sub-networks, where each input is processed by a suitable subset of experts. Thus, compared to a standard neural network, MoEs invoke only a small portion of the overall model, resulting in high efficiency as shown in language model applications such as GLaM.

The decision of which experts should be active for a given input is determined by a routing function, the design of which is challenging, since one would like to prevent both under- and over-utilization of each expert. In a recent work, we proposed Expert Choice Routing, a new routing mechanism that, instead of assigning each input token to the top-k experts, assigns each expert to the top-k tokens. This automatically ensures load-balancing of experts while also naturally allowing for an input token to be handled by multiple experts.

Efficient transformers

Transformers are popular sequence-to-sequence models that have shown remarkable success in a range of challenging problems from vision to natural language understanding. A central component of such models is the attention layer, which identifies the similarity between “queries” and “keys”, and uses these to construct a suitable weighted combination of “values”. While effective, attention mechanisms have poor (i.e., quadratic) scaling with sequence length.

As the scale of transformers continues to grow, it is interesting to study if there are any naturally occurring structures or patterns in the learned models that may help us decipher how they work. Towards that, we studied the learned embeddings in intermediate MLP layers, revealing that they are very sparse — e.g, T5-Large models have <1% nonzero entries. Sparsity further suggests that we can potentially reduce FLOPs without affecting model performance.

We recently proposed Treeformer, an alternative to standard attention computation that relies on decision trees. Intuitively, this quickly identifies a small subset of keys that are relevant for a query and only performs the attention operation on this set. Empirically, the Treeformer can lead to a 30x reduction in FLOPs for the attention layer. We also introduced Sequential Attention, a differentiable feature selection method that combines attention with a greedy algorithm. This technique has strong provable guarantees for linear models and scales seamlessly to large embedding models.

Another way to make transformers efficient is by making the softmax computations faster in the attention layer. Building on our previous work on low-rank approximation of the softmax kernel, we proposed a new class of random features that provides the first “positive and bounded” random feature approximation of the softmax kernel and is computationally linear in the sequence length. We also proposed the first approach for incorporating various attention masking mechanisms, such as causal and relative position encoding, in a scalable manner (i.e., sub-quadratic with relation to the input sequence length).

Top

Training efficiency

Efficient optimization methods are the cornerstone of modern ML applications and are particularly crucial in large scale settings. In such settings, even first order adaptive methods like Adam are often expensive, and training stability becomes challenging. In addition, these approaches are often agnostic to the architecture of the neural network, thereby ignoring the rich structure of the architecture leading to inefficient training. This motivates new techniques to more efficiently and effectively optimize modern neural network models. We are developing new architecture-aware training techniques, e.g., for training transformer networks, including new scale-invariant transformer networks and novel clipping methods that, when combined with vanilla stochastic gradient descent (SGD), results in faster training. Using this approach, for the first time, we were able to effectively train BERT using simple SGD without the need for adaptivity.

Moreover, with LocoProp we proposed a new method that achieves performance similar to that of a second-order optimizer while using the same computational and memory resources as a first-order optimizer. LocoProp takes a modular view of neural networks by decomposing them into a composition of layers. Each layer is then allowed to have its own loss function as well as output target and weight regularizer. With this setup, after a suitable forward-backward pass, LocoProp proceeds to perform parallel updates to each layer’s “local loss”. In fact, these updates can be shown to resemble those of higher-order optimizers, both theoretically and empirically. On a deep autoencoder benchmark, LocoProp achieves performance comparable to that of higher-order optimizers while being significantly faster.

One key assumption in optimizers like SGD is that each data point is sampled independently and identically from a distribution. This is unfortunately hard to satisfy in practical settings such as reinforcement learning, where the model (or agent) has to learn from data generated based on its own predictions. We proposed a new algorithmic approach named SGD with reverse experience replay, which finds optimal solutions in several settings like linear dynamical systems, non-linear dynamical systems, and in Q-learning for reinforcement learning. Furthermore, an enhanced version of this method — IER — turns out to be the state of the art and is the most stable experience replay technique on a variety of popular RL benchmarks.

Top

Data efficiency

For many tasks, deep neural networks heavily rely on large datasets. In addition to the storage costs and potential security/privacy concerns that come along with large datasets, training modern deep neural networks on such datasets incurs high computational costs. One promising way to solve this problem is with data subset selection, where the learner aims to find the most informative subset from a large number of training samples to approximate (or even improve upon) training with the entire training set.

We analyzed a subset selection framework designed to work with arbitrary model families in a practical batch setting. In such a setting, a learner can sample examples one at a time, accessing both the context and true label, but in order to limit overhead costs, is only able to update its state (i.e., further train model weights) once a large enough batch of examples is selected. We developed an algorithm, called IWeS, that selects examples by importance sampling where the sampling probability assigned to each example is based on the entropy of models trained on previously selected batches. We provide a theoretical analysis, proving generalization and sampling rate bounds.

Another concern with training large networks is that they can be highly sensitive to distribution shifts between training data and data seen at deployment time, especially when working with limited amounts of training data that might not cover all of deployment time scenarios. A recent line of work has hypothesized “extreme simplicity bias” as the key issue behind this brittleness of neural networks. Our latest work makes this hypothesis actionable, leading to two new complementary approaches — DAFT and FRR — that when combined provide significantly more robust neural networks. In particular, these two approaches use adversarial fine-tuning along with inverse feature predictions to make the learned network robust.

Top

Inference efficiency

Increasing the size of neural networks has proven surprisingly effective in improving their predictive accuracy. However, it is challenging to realize these gains in the real-world, as the inference costs of large models may be prohibitively high for deployment. This motivates strategies to improve the serving efficiency, without sacrificing accuracy. In 2022, we studied different strategies to achieve this, notably those based on knowledge distillation and adaptive computation.

Distillation

Distillation is a simple yet effective method for model compression, which greatly expands the potential applicability of large neural models. Distillation has proved widely effective in a range of practical applications, such as ads recommendation. Most use-cases of distillation involve a direct application of the basic recipe to the given domain, with limited understanding of when and why this ought to work. Our research this year has looked at tailoring distillation to specific settings and formally studying the factors that govern the success of distillation.

On the algorithmic side, by carefully modeling the noise in the teacher labels, we developed a principled approach to reweight the training examples, and a robust method to sample a subset of data to have the teacher label. In “Teacher Guided Training”, we presented a new distillation framework: rather than passively using the teacher to annotate a fixed dataset, we actively use the teacher to guide the selection of informative samples to annotate. This makes the distillation process shine in limited data or long-tail settings.

We also researched new recipes for distillation from a cross-encoder (e.g., BERT) to a factorized dual-encoder, an important setting for the task of scoring the relevance of a [query, document] pair. We studied the reasons for the performance gap between cross- and dual-encoders, noting that this can be the result of generalization rather than capacity limitation in dual-encoders. The careful construction of the loss function for distillation can mitigate this and reduce the gap between cross- and dual-encoder performance. Subsequently, in EmbedDistil, we looked at further improving dual-encoder distillation by matching embeddings from the teacher model. This strategy can also be used to distill from a large to small dual-encoder model, wherein inheriting and freezing the teacher’s document embeddings can prove highly effective.

On the theoretical side, we provided a new perspective on distillation through the lens of supervision complexity, a measure of how well the student can predict the teacher labels. Drawing on neural tangent kernel (NTK) theory, this offers conceptual insights, such as the fact that a capacity gap may affect distillation because such teachers’ labels may appear akin to purely random labels to the student. We further demonstrated that distillation can cause the student to underfit points the teacher model finds “hard” to model. Intuitively, this may help the student focus its limited capacity on those samples that it can reasonably model.

Adaptive computation

While distillation is an effective means of reducing inference cost, it does so uniformly across all samples. Intuitively however, some “easy” samples may inherently require less compute than the “hard” samples. The goal of adaptive compute is to design mechanisms that enable such sample-dependent computation.

Confident Adaptive Language Modeling introduced a controlled early-exit functionality to Transformer-based text generators such as T5. In this form of adaptive computation, the model dynamically modifies the number of transformer layers that it uses per decoding step. The early-exit gates use a confidence measure with a decision threshold that is calibrated to satisfy statistical performance guarantees. In this way, the model needs to compute the full stack of decoder layers for only the most challenging predictions. Easier predictions only require computing a few decoder layers. In practice, the model uses about a third of the layers for prediction on average, yielding 2–3x speed-ups while preserving the same level of generation quality.

One popular adaptive compute mechanism is a cascade of two or more base models. A key issue in using cascades is deciding whether to simply use the current model’s predictions, or whether to defer prediction to a downstream model. Learning when to defer requires designing a suitable loss function, which can leverage appropriate signals to act as supervision for the deferral decision. We formally studied existing loss functions for this goal, demonstrating that they may underfit the training sample owing to an implicit application of label smoothing. We showed that one can mitigate this with post-hoc training of a deferral rule, which does not require modifying the model internals in any way.

For the retrieval applications, standard semantic search techniques use a fixed representation for each embedding generated by a large model. That is, irrespective of downstream task and its associated compute environment or constraints, the representation size and capability is mostly fixed. Matryoshka representation learning introduces flexibility to adapt representations according to the deployment environment. That is, it forces representations to have a natural ordering within its coordinates such that for resource constrained environments, we can use only the top few coordinates of the representation, while for richer and precision-critical settings, we can use more coordinates of the representation. When combined with standard approximate nearest neighbor search techniques like ScaNN, MRL is able to provide up to 16x lower compute with the same recall and accuracy metrics.

Top

Concluding thoughts

Large ML models are showing transformational outcomes in several domains but efficiency in both training and inference is emerging as a critical need to make these models practical in the real-world. Google Research has been investing significantly in making large ML models efficient by developing new foundational techniques. This is an on-going effort and over the next several months we will continue to explore core challenges to make ML models even more robust and efficient.

Acknowledgements

The work in efficient deep learning is a collaboration among many researchers from Google Research, including Amr Ahmed, Ehsan Amid, Rohan Anil, Mohammad Hossein Bateni, Gantavya Bhatt, Srinadh Bhojanapalli, Zhifeng Chen, Felix Chern, Gui Citovsky, Andrew Dai, Andy Davis, Zihao Deng, Giulia DeSalvo, Nan Du, Avi Dubey, Matthew Fahrbach, Ruiqi Guo, Blake Hechtman, Yanping Huang, Prateek Jain, Wittawat Jitkrittum, Seungyeon Kim, Ravi Kumar, Aditya Kusupati, James Laudon, Quoc Le, Daliang Li, Zonglin Li, Lovish Madaan, David Majnemer, Aditya Menon, Don Metzler, Vahab Mirrokni, Vaishnavh Nagarajan, Harikrishna Narasimhan, Rina Panigrahy, Srikumar Ramalingam, Ankit Singh Rawat, Sashank Reddi, Aniket Rege, Afshin Rostamizadeh, Tal Schuster, Si Si, Apurv Suman, Phil Sun, Erik Vee, Chong You, Felix Yu, Manzil Zaheer, and Yanqi Zhou.

Google Research, 2022 & beyond

This was the fourth blog post in the “Google Research, 2022 & Beyond” series. Other posts in this series are listed in the table below:

Language Models Computer Vision Multimodal Models
Generative Models Responsible AI ML & Computer Systems
Efficient Deep Learning Algorithmic Advances* Robotics
Health General Science & Quantum Community Engagement
* Articles will be linked as they are released.

Read More