Google AI – Page 50

Introducing Google’s Secure AI Framework

June 8, 2023

by Google AI

Today Google released released the Secure AI Framework to help collaboratively secure AI technology.Read More

Evaluating speech synthesis in many languages with SQuId

June 7, 2023

by Google AI Google AI

Posted by Thibault Sellam, Research Scientist, Google

Previously, we presented the 1,000 languages initiative and the Universal Speech Model with the goal of making speech and language technologies available to billions of users around the world. Part of this commitment involves developing high-quality speech synthesis technologies, which build upon projects such as VDTTS and AudioLM, for users that speak many different languages.

After developing a new model, one must evaluate whether the speech it generates is accurate and natural: the content must be relevant to the task, the pronunciation correct, the tone appropriate, and there should be no acoustic artifacts such as cracks or signal-correlated noise. Such evaluation is a major bottleneck in the development of multilingual speech systems.

The most popular method to evaluate the quality of speech synthesis models is human evaluation: a text-to-speech (TTS) engineer produces a few thousand utterances from the latest model, sends them for human evaluation, and receives results a few days later. This evaluation phase typically involves listening tests, during which dozens of annotators listen to the utterances one after the other to determine how natural they sound. While humans are still unbeaten at detecting whether a piece of text sounds natural, this process can be impractical — especially in the early stages of research projects, when engineers need rapid feedback to test and restrategize their approach. Human evaluation is expensive, time consuming, and may be limited by the availability of raters for the languages of interest.

Another barrier to progress is that different projects and institutions typically use various ratings, platforms and protocols, which makes apples-to-apples comparisons impossible. In this regard, speech synthesis technologies lag behind text generation, where researchers have long complemented human evaluation with automatic metrics such as BLEU or, more recently, BLEURT.

In “SQuId: Measuring Speech Naturalness in Many Languages“, to be presented at ICASSP 2023, we introduce SQuId (Speech Quality Identification), a 600M parameter regression model that describes to what extent a piece of speech sounds natural. SQuId is based on mSLAM (a pre-trained speech-text model developed by Google), fine-tuned on over a million quality ratings across 42 languages and tested in 65. We demonstrate how SQuId can be used to complement human ratings for evaluation of many languages. This is the largest published effort of this type to date.

Evaluating TTS with SQuId

The main hypothesis behind SQuId is that training a regression model on previously collected ratings can provide us with a low-cost method for assessing the quality of a TTS model. The model can therefore be a valuable addition to a TTS researcher’s evaluation toolbox, providing a near-instant, albeit less accurate alternative to human evaluation.

SQuId takes an utterance as input and an optional locale tag (i.e., a localized variant of a language, such as “Brazilian Portuguese” or “British English”). It returns a score between 1 and 5 that indicates how natural the waveform sounds, with a higher value indicating a more natural waveform.

Internally, the model includes three components: (1) an encoder, (2) a pooling / regression layer, and (3) a fully connected layer. First, the encoder takes a spectrogram as input and embeds it into a smaller 2D matrix that contains 3,200 vectors of size 1,024, where each vector encodes a time step. The pooling / regression layer aggregates the vectors, appends the locale tag, and feeds the result into a fully connected layer that returns a score. Finally, we apply application-specific post-processing that rescales or normalizes the score so it is within the [1, 5] range, which is common for naturalness human ratings. We train the whole model end-to-end with a regression loss.

The encoder is by far the largest and most important piece of the model. We used mSLAM, a pre-existing 600M-parameter Conformer pre-trained on both speech (51 languages) and text (101 languages).

The SQuId model.

To train and evaluate the model, we created the SQuId corpus: a collection of 1.9 million rated utterances across 66 languages, collected for over 2,000 research and product TTS projects. The SQuId corpus covers a diverse array of systems, including concatenative and neural models, for a broad range of use cases, such as driving directions and virtual assistants. Manual inspection reveals that SQuId is exposed to a vast range of of TTS errors, such as acoustic artifacts (e.g., cracks and pops), incorrect prosody (e.g., questions without rising intonations in English), text normalization errors (e.g., verbalizing “7/7” as “seven divided by seven” rather than “July seventh”), or pronunciation mistakes (e.g., verbalizing “tough” as “toe”).

A common issue that arises when training multilingual systems is that the training data may not be uniformly available for all the languages of interest. SQuId was no exception. The following figure illustrates the size of the corpus for each locale. We see that the distribution is largely dominated by US English.

Locale distribution in the SQuId dataset.

How can we provide good performance for all languages when there are such variations? Inspired by previous work on machine translation, as well as past work from the speech literature, we decided to train one model for all languages, rather than using separate models for each language. The hypothesis is that if the model is large enough, then cross-locale transfer can occur: the model’s accuracy on each locale improves as a result of jointly training on the others. As our experiments show, cross-locale proves to be a powerful driver of performance.

Experimental results

To understand SQuId’s overall performance, we compare it to a custom Big-SSL-MOS model (described in the paper), a competitive baseline inspired by MOS-SSL, a state-of-the-art TTS evaluation system. Big-SSL-MOS is based on w2v-BERT and was trained on the VoiceMOS’22 Challenge dataset, the most popular dataset at the time of evaluation. We experimented with several variants of the model, and found that SQuId is up to 50.0% more accurate.

SQuId versus state-of-the-art baselines. We measure agreement with human ratings using the Kendall Tau, where a higher value represents better accuracy.

To understand the impact of cross-locale transfer, we run a series of ablation studies. We vary the amount of locales introduced in the training set and measure the effect on SQuId’s accuracy. In English, which is already over-represented in the dataset, the effect of adding locales is negligible.

SQuId’s performance on US English, using 1, 8, and 42 locales during fine-tuning.

However, cross-locale transfer is much more effective for most other locales:

SQuId’s performance on four selected locales (Korean, French, Thai, and Tamil), using 1, 8, and 42 locales during fine-tuning. For each locale, we also provide the training set size.

To push transfer to its limit, we held 24 locales out during training and used them for testing exclusively. Thus, we measure to what extent SQuId can deal with languages that it has never seen before. The plot below shows that although the effect is not uniform, cross-locale transfer works.

SQuId’s performance on four “zero-shot” locales; using 1, 8, and 42 locales during fine-tuning.

When does cross-locale operate, and how? We present many more ablations in the paper, and show that while language similarity plays a role (e.g., training on Brazilian Portuguese helps European Portuguese) it is surprisingly far from being the only factor that matters.

Conclusion and future work

We introduce SQuId, a 600M parameter regression model that leverages the SQuId dataset and cross-locale learning to evaluate speech quality and describe how natural it sounds. We demonstrate that SQuId can complement human raters in the evaluation of many languages. Future work includes accuracy improvements, expanding the range of languages covered, and tackling new error types.

Acknowledgements

The author of this post is now part of Google DeepMind. Many thanks to all authors of the paper: Ankur Bapna, Joshua Camp, Diana Mackinnon, Ankur P. Parikh, and Jason Riesa.

Bard is getting better at logic and reasoning

June 7, 2023

by Google AI

Two improvements are launching in Bard to help you get more accurate responses and export to Google Sheets.Read More

Visual captions: Using large language models to augment video conferences with dynamic visuals

June 6, 2023

by Google AI Google AI

Posted by Ruofei Du, Research Scientist, and Alex Olwal, Senior Staff Research Scientist, Google Augmented Reality

Recent advances in video conferencing have significantly improved remote video communication through features like live captioning and noise cancellation. However, there are various situations where dynamic visual augmentation would be useful to better convey complex and nuanced information. For example, when discussing what to order at a Japanese restaurant, your friends could share visuals that would help you feel more confident about ordering the “Sukiyaki”. Or when talking about your recent family trip to San Francisco, you may want to show a photo from your personal album.

In “Visual Captions: Augmenting Verbal Communication With On-the-fly Visuals”, presented at ACM CHI 2023, we introduce a system that uses verbal cues to augment synchronous video communication with real-time visuals. We fine-tuned a large language model to proactively suggest relevant visuals in open-vocabulary conversations using a dataset we curated for this purpose. We open sourced Visual Captions as part of the ARChat project, which is designed for rapid prototyping of augmented communication with real-time transcription.

Visual Captions facilitates verbal communication with real-time visuals. The system is even robust against typical mistakes that may often appear in real-time speech-to-text transcription. For example, out of context, the transcription model misunderstood the word “pier” as “pair”, but Visual Captions still recommends images of the Santa Monica Pier.

Design space for augmenting verbal communication with dynamic visuals

We invited 10 internal participants, each with various technical and non-technical backgrounds, including software engineers, researchers, UX designers, visual artists, students, etc., to discuss their particular needs and desires for a potential real-time visual augmentation service. In two sessions, we introduced low-fidelity prototypes of the envisioned system, followed by video demos of the existing text-to-image systems. These discussions informed a design space with eight dimensions for visual augmentation of real-time conversations, labeled below as D1 to D8.

Visual augmentations could be synchronous or asynchronous with the conversation (D1: Temporal), could be used for both expressing and understanding speech content (D2: Subject), and could be applied using a wide range of different visual content, visual types, and visual sources (D3: Visual). Such visual augmentation might vary depending on the scale of the meetings (D4: Scale) and whether a meeting is in co-located or remote settings (D5: Space). These factors also influence whether the visuals should be displayed privately, shared between participants, or public to everyone (D6: Privacy). Participants also identified different ways in which they would like to interact with the system while having conversations (D7: Initiation). For example, people proposed different levels of “proactivity”, which indicates the degree to which users would like the model to take the initiative. Finally, participants envisioned different methods of interaction, for example, using speech or gestures for input. (D8: Interaction).

Design space for augmenting verbal communication with dynamic visuals.

Informed by this initial feedback, we designed Visual Captions to focus on generating synchronous visuals of semantically relevant visual content, type, and source. While participants in these initial exploratory sessions were participating in one-to-one remote conversations, deployment of Visual Captions in the wild will often be in one-to-many (e.g., an individual giving a presentation to an audience) and many-to-many scenarios (e.g., a discussion among multiple people in a meeting).

Because the visual that best complements a conversation depends strongly on the context of the discussion, we needed a training set specific to this purpose. So, we collected a dataset of 1595 quadruples of language (1), visual content (2), type (3), and source (4) across a variety of contexts, including daily conversations, lectures, and travel guides. For example, “I would love to see it!” corresponds to visual content of “face smiling”, a visual type of “emoji”, and visual source of “public search”. “Did she tell you about our trip to Mexico?” corresponds to visual content of “a photo from the trip to Mexico”, a visual type of “photo”, and visual source of “personal album”. We publicly released this VC1.5K dataset for the research community.

Visual intent prediction model

To predict what visuals could supplement a conversation, we trained a visual intent prediction model based on a large language model using the VC1.5K dataset. For training, we parsed each visual intent into the format of “<Visual Type> of <Visual Content> from <Visual Source>“.

{"prompt": "<Previous Two Sentences> →", 
  "completion": 
"<Visual Type 1> of "<Visual Type 1> from "<Visual Source 1>;
 <Visual Type 2> of "<Visual Type 2> from "<Visual Source 2>; 
  ... 𝑛"}

Using this format, this system can handle open-vocabulary conversations and contextually predict visual content, visual source, and visual type. Anecdotally, we found that it outperforms keyword-based approaches, which fail to handle open-vocabulary examples like “Your aunt Amy will be visiting this Saturday,” and cannot suggest relevant visual types or visual sources.

Examples of visual intent predictions by our model.

We used 1276 (80%) examples from the VC1.5K dataset for fine-tuning the large language model and the remaining 319 (20%) examples as test data. We measured the performance of the fine-tuned model with the token accuracy metric, i.e., the percentage of tokens in a batch that were correctly predicted by the model. During training, our model reached a training token accuracy of 97% and a validation token accuracy of 87%.

Performance

To evaluate the utility of the trained Visual Captions model, we invited 89 participants to perform 846 tasks. They were asked to provide feedback on a scale of “1 — Strongly Disagree” to “7 — Strongly Agree” for six qualitative statements. Most participants preferred to have the visual during a conversation (Q1, 83% ≥ 5–Somewhat Agree). Moreover, they considered the displayed visuals to be useful and informative (Q2, 82% ≥ 5–Somewhat Agree), high-quality (Q3, 82% ≥ 5–Somewhat Agree), and relevant to the original speech (Q4, 84% ≥ 5–Somewhat Agree). Participants also found the predicted visual type (Q5, 87% ≥ 5–Somewhat Agree) and visual source (Q6, 86% ≥ 5–Somewhat Agree) to be accurate given the context of the corresponding conversation.

Technical evaluation results of the visual prediction model rated by study participants.

With this fine-tuned visual intent prediction model, we developed Visual Captions on the ARChat platform, which can add new interactive widgets directly on the camera streams of video conferencing platforms, such as Google Meet. As shown in the system workflow below, Visual Captions automatically captures the user’s speech, retrieves the last sentences, feeds them into the visual intent prediction model every 100 ms, retrieves relevant visuals, and then suggests visuals in real time.

System workflow of Visual Captions.

Visual Captions provides three levels of proactivity when suggesting visuals:

Auto-display (high-proactivity): The system autonomously searches and displays visuals publicly to all meeting participants. No user interaction required.
Auto-suggest (medium-proactivity): The suggested visuals are shown in a private scrolling view. A user then clicks a visual to display it publicly. In this mode, the system is proactively recommending visuals, but the user decides when and what to display.
On-demand-suggest (low-proactivity): The system will only suggest visuals if a user presses the spacebar.

Quantitative and qualitative evaluation: User studies

We evaluated Visual Captions in both a controlled lab study (n = 26) and in-the-wild deployment studies (n = 10). Participants found that real-time visuals facilitated live conversations by helping explain unfamiliar concepts, resolve language ambiguities, and make conversations more engaging. Participants also reported different preferences for interacting with the system in-situ, and that varying levels of proactivity were preferred in different social scenarios.

Participants’ Task Load Index and Likert scale ratings (from 1 – Strongly Disagree to 7 – Strongly Agree) of four conversations without Visual Captions (“No VC”) and the three Visual Captions modes: auto-display, auto-suggest, and on-demand suggest.

Conclusions and future directions

This work proposes a system for real-time visual augmentation of verbal communication, called Visual Captions, that was trained using a dataset of 1595 visual intents collected from 246 participants, covering 15 topic categories. We publicly release the training dataset, VC1.5K to the research community to support further research in this space. We have also deployed Visual Captions in ARChat, which facilitates video conferences in Google Meet by transcribing meetings and augmenting the camera video streams.

Visual Captions represents a significant step towards enhancing verbal communication with on-the-fly visuals. By understanding the importance of visual cues in everyday conversations, we can create more effective communication tools and improve how people connect.

Acknowledgements

This work is a collaboration across multiple teams at Google. Key contributors to the project include Xingyu “Bruce” Liu, Vladimir Kirilyuk, Xiuxiu Yuan, Peggy Chi, Alex Olwal, and Ruofei Du.

We would like to extend our thanks to those on the ARChat team who provided assistance, including Jason Mayes, Max Spear, Na Li, Jun Zhang, Jing Jin, Yuan Ren, Adarsh Kowdle, Ping Yu, Darcy Philippon, and Ezgi Oztelcan. We would also like to thank the many people with whom we’ve had insightful discussions and those who provided feedback on the manuscript, including Eric Turner, Yinda Zhang, Feitong Tan, Danhang Tang, and Shahram Izadi. We would also like to thank our CHI reviewers for their insightful feedback.

AVFormer: Injecting vision into frozen speech models for zero-shot AV-ASR

June 2, 2023

by Google AI Google AI

Posted by Arsha Nagrani and Paul Hongsuck Seo, Research Scientists, Google Research

Automatic speech recognition (ASR) is a well-established technology that is widely adopted for various applications such as conference calls, streamed video transcription and voice commands. While the challenges for this technology are centered around noisy audio inputs, the visual stream in multimodal videos (e.g., TV, online edited videos) can provide strong cues for improving the robustness of ASR systems — this is called audiovisual ASR (AV-ASR).

Although lip motion can provide strong signals for speech recognition and is the most common area of focus for AV-ASR, the mouth is often not directly visible in videos in the wild (e.g., due to egocentric viewpoints, face coverings, and low resolution) and therefore, a new emerging area of research is unconstrained AV-ASR (e.g., AVATAR), which investigates the contribution of entire visual frames, and not just the mouth region.

Building audiovisual datasets for training AV-ASR models, however, is challenging. Datasets such as How2 and VisSpeech have been created from instructional videos online, but they are small in size. In contrast, the models themselves are typically large and consist of both visual and audio encoders, and so they tend to overfit on these small datasets. Nonetheless, there have been a number of recently released large-scale audio-only models that are heavily optimized via large-scale training on massive audio-only data obtained from audio books, such as LibriLight and LibriSpeech. These models contain billions of parameters, are readily available, and show strong generalization across domains.

With the above challenges in mind, in “AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR”, we present a simple method for augmenting existing large-scale audio-only models with visual information, at the same time performing lightweight domain adaptation. AVFormer injects visual embeddings into a frozen ASR model (similar to how Flamingo injects visual information into large language models for vision-text tasks) using lightweight trainable adaptors that can be trained on a small amount of weakly labeled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training, which we show is crucial to enable the model to jointly process audio and visual information effectively. The resulting AVFormer model achieves state-of-the-art zero-shot performance on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (i.e., LibriSpeech).

Unconstrained audiovisual speech recognition. We inject vision into a frozen speech model (BEST-RQ, in grey) for zero-shot audiovisual ASR via lightweight modules to create a parameter- and data-efficient model called AVFormer (blue). The visual context can provide helpful clues for robust speech recognition especially when the audio signal is noisy (the visual loaf of bread helps correct the audio-only mistake “clove” to “loaf” in the generated transcript).

Injecting vision using lightweight modules

Our goal is to add visual understanding capabilities to an existing audio-only ASR model while maintaining its generalization performance to various domains (both AV and audio-only domains).

To achieve this, we augment an existing state-of-the-art ASR model (Best-RQ) with the following two components: (i) linear visual projector and (ii) lightweight adapters. The former projects visual features in the audio token embedding space. This process allows the model to properly connect separately pre-trained visual feature and audio input token representations. The latter then minimally modifies the model to add understanding of multimodal inputs from videos. We then train these additional modules on unlabeled web videos from the HowTo100M dataset, along with the outputs of an ASR model as pseudo ground truth, while keeping the rest of the Best-RQ model frozen. Such lightweight modules enable data-efficiency and strong generalization of performance.

We evaluated our extended model on AV-ASR benchmarks in a zero-shot setting, where the model is never trained on a manually annotated AV-ASR dataset.

Curriculum learning for vision injection

After the initial evaluation, we discovered empirically that with a naïve single round of joint training, the model struggles to learn both the adapters and the visual projectors in one go. To mitigate this issue, we introduced a two-phase curriculum learning strategy that decouples these two factors — domain adaptation and visual feature integration — and trains the network in a sequential manner. In the first phase, the adapter parameters are optimized without feeding visual tokens at all. Once the adapters are trained, we add the visual tokens and train the visual projection layers alone in the second phase while the trained adapters are kept frozen.

The first stage focuses on audio domain adaptation. By the second phase, the adapters are completely frozen and the visual projector must simply learn to generate visual prompts that project the visual tokens into the audio space. In this way, our curriculum learning strategy allows the model to incorporate visual inputs as well as adapt to new audio domains in AV-ASR benchmarks. We apply each phase just once, as an iterative application of alternating phases leads to performance degradation.

Overall architecture and training procedure for AVFormer. The architecture consists of a frozen Conformer encoder-decoder model, and a frozen CLIP encoder (frozen layers shown in gray with a lock symbol), in conjunction with two lightweight trainable modules – (i) visual projection layer (orange) and bottleneck adapters (blue) to enable multimodal domain adaptation. We propose a two-phase curriculum learning strategy: the adapters (blue) are first trained without any visual tokens, after which the visual projection layer (orange) is tuned while all the other parts are kept frozen.

The plots below show that without curriculum learning, our AV-ASR model is worse than the audio-only baseline across all datasets, with the gap increasing as more visual tokens are added. In contrast, when the proposed two-phase curriculum is applied, our AV-ASR model performs significantly better than the baseline audio-only model.

Effects of curriculum learning. Red and blue lines are for audiovisual models and are shown on 3 datasets in the zero-shot setting (lower WER % is better). Using the curriculum helps on all 3 datasets (for How2 (a) and Ego4D (c) it is crucial for outperforming audio-only performance). Performance improves up until 4 visual tokens, at which point it saturates.

Results in zero-shot AV-ASR

We compare AVFormer to BEST-RQ, the audio version of our model, and AVATAR, the state of the art in AV-ASR, for zero-shot performance on the three AV-ASR benchmarks: How2, VisSpeech and Ego4D. AVFormer outperforms AVATAR and BEST-RQ on all, even outperforming both AVATAR and BEST-RQ when they are trained on LibriSpeech and the full set of HowTo100M. This is notable because for BEST-RQ, this involves training 600M parameters, while AVFormer only trains 4M parameters and therefore requires only a small fraction of the training dataset (5% of HowTo100M). Moreover, we also evaluate performance on LibriSpeech, which is audio-only, and AVFormer outperforms both baselines.

Comparison to state-of-the-art methods for zero-shot performance across different AV-ASR datasets. We also show performances on LibriSpeech which is audio-only. Results are reported as WER % (lower is better). AVATAR and BEST-RQ are finetuned end-to-end (all parameters) on HowTo100M whereas AVFormer works effectively even with 5% of the dataset thanks to the small set of finetuned parameters.

Conclusion

We introduce AVFormer, a lightweight method for adapting existing, frozen state-of-the-art ASR models for AV-ASR. Our approach is practical and efficient, and achieves impressive zero-shot performance. As ASR models get larger and larger, tuning the entire parameter set of pre-trained models becomes impractical (even more so for different domains). Our method seamlessly allows both domain transfer and visual input mixing in the same, parameter efficient model.

Acknowledgements

This research was conducted by Paul Hongsuck Seo, Arsha Nagrani and Cordelia Schmid.

Retrieval-augmented visual-language pre-training

June 1, 2023

by Google AI Google AI

Posted by Ziniu Hu, Student Researcher, and Alireza Fathi, Research Scientist, Google Research, Perception Team

Large-scale models, such as T5, GPT-3, PaLM, Flamingo and PaLI, have demonstrated the ability to store substantial amounts of knowledge when scaled to tens of billions of parameters and trained on large text and image datasets. These models achieve state-of-the-art results on downstream tasks, such as image captioning, visual question answering and open vocabulary recognition. Despite such achievements, these models require a massive volume of data for training and end up with a tremendous number of parameters (billions in many cases), resulting in significant computational requirements. Moreover, the data used to train these models can become outdated, requiring re-training every time the world’s knowledge is updated. For example, a model trained just two years ago might yield outdated information about the current president of the United States.

In the fields of natural language processing (RETRO, REALM) and computer vision (KAT), researchers have attempted to address these challenges using retrieval-augmented models. Typically, these models use a backbone that is able to process a single modality at a time, e.g., only text or only images, to encode and retrieve information from a knowledge corpus. However, these retrieval-augmented models are unable to leverage all available modalities in a query and knowledge corpora, and may not find the information that is most helpful for generating the model’s output.

To address these issues, in “REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory”, to appear at CVPR 2023, we introduce a visual-language model that learns to utilize a multi-source multi-modal “memory” to answer knowledge-intensive queries. REVEAL employs neural representation learning to encode and convert diverse knowledge sources into a memory structure consisting of key-value pairs. The keys serve as indices for the memory items, while the corresponding values store pertinent information about those items. During training, REVEAL learns the key embeddings, value tokens, and the ability to retrieve information from this memory to address knowledge-intensive queries. This approach allows the model parameters to focus on reasoning about the query, rather than being dedicated to memorization.

We augment a visual-language model with the ability to retrieve multiple knowledge entries from a diverse set of knowledge sources, which helps generation.

Memory construction from multimodal knowledge corpora

Our approach is similar to REALM in that we precompute key and value embeddings of knowledge items from different sources and index them in a unified knowledge memory, where each knowledge item is encoded into a key-value pair. Each key is a d-dimensional embedding vector, while each value is a sequence of token embeddings representing the knowledge item in more detail. In contrast to previous work, REVEAL leverages a diverse set of multimodal knowledge corpora, including the WikiData knowledge graph, Wikipedia passages and images, web image-text pairs and visual question answering data. Each knowledge item could be text, an image, a combination of both (e.g., pages in Wikipedia) or a relationship or attribute from a knowledge graph (e.g., Barack Obama is 6’ 2” tall). During training, we continuously re-compute the memory key and value embeddings as the model parameters get updated. We update the memory asynchronously at every thousand training steps.

Scaling memory using compression

A naïve solution for encoding a memory value is to keep the whole sequence of tokens for each knowledge item. Then, the model could fuse the input query and the top-k retrieved memory values by concatenating all their tokens together and feeding them into a transformer encoder-decoder pipeline. This approach has two issues: (1) storing hundreds of millions of knowledge items in memory is impractical if each memory value consists of hundreds of tokens and (2) the transformer encoder has a quadratic complexity with respect to the total number of tokens times k for self-attention. Therefore, we propose to use the Perceiver architecture to encode and compress knowledge items. The Perceiver model uses a transformer decoder to compress the full token sequence into an arbitrary length. This lets us retrieve top-k memory entries for k as large as a hundred.

The following figure illustrates the procedure of constructing the memory key-value pairs. Each knowledge item is processed through a multi-modal visual-language encoder, resulting in a sequence of image and text tokens. The key head then transforms these tokens into a compact embedding vector. The value head (perceiver) condenses these tokens into fewer ones, retaining the pertinent information about the knowledge item within them.

We encode the knowledge entries from different corpora into unified key and value embedding pairs, where the keys are used to index the memory and values contain information about the entries.

Large-scale pre-training on image-text pairs

To train the REVEAL model, we begin with the large-scale corpus, collected from the public Web with three billion image alt-text caption pairs, introduced in LiT. Since the dataset is noisy, we add a filter to remove data points with captions shorter than 50 characters, which yields roughly 1.3 billion image caption pairs. We then take these pairs, combined with the text generation objective used in SimVLM, to train REVEAL. Given an image-text example, we randomly sample a prefix containing the first few tokens of the text. We feed the text prefix and image to the model as input with the objective of generating the rest of the text as output. The training goal is to condition the prefix and autoregressively generate the remaining text sequence.

To train all components of the REVEAL model end-to-end, we need to warm start the model to a good state (setting initial values to model parameters). Otherwise, if we were to start with random weights (cold-start), the retriever would often return irrelevant memory items that would never generate useful training signals. To avoid this cold-start problem, we construct an initial retrieval dataset with pseudo–ground-truth knowledge to give the pre-training a reasonable head start.

We create a modified version of the WIT dataset for this purpose. Each image-caption pair in WIT also comes with a corresponding Wikipedia passage (words surrounding the text). We put together the surrounding passage with the query image and use it as the pseudo ground-truth knowledge that corresponds to the input query. The passage provides rich information about the image and caption, which is useful for initializing the model.

To prevent the model from relying on low-level image features for retrieval, we apply random data augmentation to the input query image. Given this modified dataset that contains pseudo-retrieval ground-truth, we train the query and memory key embeddings to warm start the model.

REVEAL workflow

The overall workflow of REVEAL consists of four primary steps. First, REVEAL encodes a multimodal input into a sequence of token embeddings along with a condensed query embedding. Then, the model translates each multi-source knowledge entry into unified pairs of key and value embeddings, with the key being utilized for memory indexing and the value encompassing the entire information about the entry. Next, REVEAL retrieves the top-k most related knowledge pieces from multiple knowledge sources, returns the pre-processed value embeddings stored in memory, and re-encodes the values. Finally, REVEAL fuses the top-k knowledge pieces through an attentive knowledge fusion layer by injecting the retrieval score (dot product between query and key embeddings) as a prior during attention calculation. This structure is instrumental in enabling the memory, encoder, retriever and the generator to be concurrently trained in an end-to-end fashion.

Overall workflow of REVEAL.

Results

We evaluate REVEAL on knowledge-based visual question answering tasks using OK-VQA and A-OKVQA datasets. We fine-tune our pre-trained model on the VQA tasks using the same generative objective where the model takes in an image-question pair as input and generates the text answer as output. We demonstrate that REVEAL achieves better results on the A-OKVQA dataset than earlier attempts that incorporate a fixed knowledge or the works that utilize large language models (e.g., GPT-3) as an implicit source of knowledge.

Visual question answering results on A-OKVQA. REVEAL achieves higher accuracy in comparison to previous works including ViLBERT, LXMERT, ClipCap, KRISP and GPV-2.

We also evaluate REVEAL on the image captioning benchmarks using MSCOCO and NoCaps dataset. We directly fine-tune REVEAL on the MSCOCO training split via the cross-entropy generative objective. We measure our performance on the MSCOCO test split and NoCaps evaluation set using the CIDEr metric, which is based on the idea that good captions should be similar to reference captions in terms of word choice, grammar, meaning, and content. Our results on MSCOCO caption and NoCaps datasets are shown below.

Image Captioning results on MSCOCO and NoCaps using the CIDEr metric. REVEAL achieves a higher score in comparison to Flamingo, VinVL, SimVLM and CoCa.

Below we show a couple of qualitative examples of how REVEAL retrieves relevant documents to answer visual questions.

REVEAL can use knowledge from different sources to correctly answer the question.

Conclusion

We present an end-to-end retrieval-augmented visual language (REVEAL) model, which contains a knowledge retriever that learns to utilize a diverse set of knowledge sources with different modalities. We train REVEAL on a massive image-text corpus with four diverse knowledge corpora, and achieve state-of-the-art results on knowledge-intensive visual question answering and image caption tasks. In the future we would like to explore the ability of this model for attribution, and apply it to a broader class of multimodal tasks.

Acknowledgements

This research was conducted by Ziniu Hu, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross and Alireza Fathi.

Large sequence models for software development activities

May 31, 2023

by Google AI Google AI

Posted by Petros Maniatis and Daniel Tarlow, Research Scientists, Google

Software isn’t created in one dramatic step. It improves bit by bit, one little step at a time — editing, running unit tests, fixing build errors, addressing code reviews, editing some more, appeasing linters, and fixing more errors — until finally it becomes good enough to merge into a code repository. Software engineering isn’t an isolated process, but a dialogue among human developers, code reviewers, bug reporters, software architects and tools, such as compilers, unit tests, linters and static analyzers.

Today we describe DIDACT (Dynamic Integrated Developer ACTivity), which is a methodology for training large machine learning (ML) models for software development. The novelty of DIDACT is that it uses the process of software development as the source of training data for the model, rather than just the polished end state of that process, the finished code. By exposing the model to the contexts that developers see as they work, paired with the actions they take in response, the model learns about the dynamics of software development and is more aligned with how developers spend their time. We leverage instrumentation of Google’s software development to scale up the quantity and diversity of developer-activity data beyond previous works. Results are extremely promising along two dimensions: usefulness to professional software developers, and as a potential basis for imbuing ML models with general software development skills.

DIDACT is a multi-task model trained on development activities that include editing, debugging, repair, and code review.

We built and deployed internally three DIDACT tools, Comment Resolution (which we recently announced), Build Repair, and Tip Prediction, each integrated at different stages of the development workflow. All three of these tools received enthusiastic feedback from thousands of internal developers. We see this as the ultimate test of usefulness: do professional developers, who are often experts on the code base and who have carefully honed workflows, leverage the tools to improve their productivity?

Perhaps most excitingly, we demonstrate how DIDACT is a first step towards a general-purpose developer-assistance agent. We show that the trained model can be used in a variety of surprising ways, via prompting with prefixes of developer activities, and by chaining together multiple predictions to roll out longer activity trajectories. We believe DIDACT paves a promising path towards developing agents that can generally assist across the software development process.

A treasure trove of data about the software engineering process

Google’s software engineering toolchains store every operation related to code as a log of interactions among tools and developers, and have done so for decades. In principle, one could use this record to replay in detail the key episodes in the “software engineering video” of how Google’s codebase came to be, step-by-step — one code edit, compilation, comment, variable rename, etc., at a time.

Google code lives in a monorepo, a single repository of code for all tools and systems. A software developer typically experiments with code changes in a local copy-on-write workspace managed by a system called Clients in the Cloud (CitC). When the developer is ready to package a set of code changes together for a specific purpose (e.g., fixing a bug), they create a changelist (CL) in Critique, Google’s code-review system. As with other types of code-review systems, the developer engages in a dialog with a peer reviewer about functionality and style. The developer edits their CL to address reviewer comments as the dialog progresses. Eventually, the reviewer declares “LGTM!” (“looks good to me”), and the CL is merged into the code repository.

Of course, in addition to a dialog with the code reviewer, the developer also maintains a “dialog” of sorts with a plethora of other software engineering tools, such as the compiler, the testing framework, linters, static analyzers, fuzzers, etc.

An illustration of the intricate web of activities involved in developing software: small actions by the developer, interactions with a code reviewer, and invocations of tools such as compilers.

A multi-task model for software engineering

DIDACT utilizes interactions among engineers and tools to power ML models that assist Google developers, by suggesting or enhancing actions developers take — in context — while pursuing their software-engineering tasks. To do that, we have defined a number of tasks about individual developer activities: repairing a broken build, predicting a code-review comment, addressing a code-review comment, renaming a variable, editing a file, etc. We use a common formalism for each activity: it takes some State (a code file), some Intent (annotations specific to the activity, such as code-review comments or compiler errors), and produces an Action (the operation taken to address the task). This Action is like a mini programming language, and can be extended for newly added activities. It covers things like editing, adding comments, renaming variables, marking up code with errors, etc. We call this language DevScript.

The DIDACT model is prompted with a task, code snippets, and annotations related to that task, and produces development actions, e.g., edits or comments.

This state-intent-action formalism enables us to capture many different tasks in a general way. What’s more, DevScript is a concise way to express complex actions, without the need to output the whole state (the original code) as it would be after the action takes place; this makes the model more efficient and more interpretable. For example, a rename might touch a file in dozens of places, but a model can predict a single rename action.

An ML peer programmer

DIDACT does a good job on individual assistive tasks. For example, below we show DIDACT doing code clean-up after functionality is mostly done. It looks at the code along with some final comments by the code reviewer (marked with “human” in the animation), and predicts edits to address those comments (rendered as a diff).

Given an initial snippet of code and the comments that a code reviewer attached to that snippet, the Pre-Submit Cleanup task of DIDACT produces edits (insertions and deletions of text) that address those comments.

The multimodal nature of DIDACT also gives rise to some surprising capabilities, reminiscent of behaviors emerging with scale. One such capability is history augmentation, which can be enabled via prompting. Knowing what the developer did recently enables the model to make a better guess about what the developer should do next.

An illustration of history-augmented code completion in action.

A powerful such task exemplifying this capability is history-augmented code completion. In the figure below, the developer adds a new function parameter (1), and moves the cursor into the documentation (2). Conditioned on the history of developer edits and the cursor position, the model completes the line (3) by correctly predicting the docstring entry for the new parameter.

An illustration of edit prediction, over multiple chained iterations.

In an even more powerful history-augmented task, edit prediction, the model can choose where to edit next in a fashion that is historically consistent. If the developer deletes a function parameter (1), the model can use history to correctly predict an update to the docstring (2) that removes the deleted parameter (without the human developer manually placing the cursor there) and to update a statement in the function (3) in a syntactically (and — arguably — semantically) correct way. With history, the model can unambiguously decide how to continue the “editing video” correctly. Without history, the model wouldn’t know whether the missing function parameter is intentional (because the developer is in the process of a longer edit to remove it) or accidental (in which case the model should re-add it to fix the problem).

The model can go even further. For example, we started with a blank file and asked the model to successively predict what edits would come next until it had written a full code file. The astonishing part is that the model developed code in a step-by-step way that would seem natural to a developer: It started by first creating a fully working skeleton with imports, flags, and a basic main function. It then incrementally added new functionality, like reading from a file and writing results, and added functionality to filter out some lines based on a user-provided regular expression, which required changes across the file, like adding new flags.

Conclusion

DIDACT turns Google’s software development process into training demonstrations for ML developer assistants, and uses those demonstrations to train models that construct code in a step-by-step fashion, interactively with tools and code reviewers. These innovations are already powering tools enjoyed by Google developers every day. The DIDACT approach complements the great strides taken by large language models at Google and elsewhere, towards technologies that ease toil, improve productivity, and enhance the quality of work of software engineers.

Acknowledgements

This work is the result of a multi-year collaboration among Google Research, Google Core Systems and Experiences, and DeepMind. We would like to acknowledge our colleagues Jacob Austin, Pascal Lamblin, Pierre-Antoine Manzagol, and Daniel Zheng, who join us as the key drivers of this project. This work could not have happened without the significant and sustained contributions of our partners at Alphabet (Peter Choy, Henryk Michalewski, Subhodeep Moitra, Malgorzata Salawa, Vaibhav Tulsyan, and Manushree Vijayvergiya), as well as the many people who collected data, identified tasks, built products, strategized, evangelized, and helped us execute on the many facets of this agenda (Ankur Agarwal, Paige Bailey, Marc Brockschmidt, Rodrigo Damazio Bovendorp, Satish Chandra, Savinee Dancs, Matt Frazier, Alexander Frömmgen, Nimesh Ghelani, Chris Gorgolewski, Chenjie Gu, Vincent Hellendoorn, Franjo Ivančić, Marko Ivanković, Emily Johnston, Luka Kalinovcic, Lera Kharatyan, Jessica Ko, Markus Kusano, Kathy Nix, Sara Qu, Marc Rasi, Marcus Revaj, Ballie Sandhu, Michael Sloan, Tom Small, Gabriela Surita, Maxim Tabachnyk, David Tattersall, Sara Toth, Kevin Villela, Sara Wiltberger, and Donald Duo Zhao) and our extremely supportive leadership (Martín Abadi, Joelle Barral, Jeff Dean, Madhura Dudhgaonkar, Douglas Eck, Zoubin Ghahramani, Hugo Larochelle, Chandu Thekkath, and Niranjan Tulpule). Thank you!

Foundation models for reasoning on charts

May 26, 2023

by Google AI Google AI

Posted by Julian Eisenschlos, Research Software Engineer, Google Research

Visual language is the form of communication that relies on pictorial symbols outside of text to convey information. It is ubiquitous in our digital life in the form of iconography, infographics, tables, plots, and charts, extending to the real world in street signs, comic books, food labels, etc. For that reason, having computers better understand this type of media can help with scientific communication and discovery, accessibility, and data transparency.

While computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images, where all sorts of tasks, such as classification, visual question answering (VQA), captioning, detection and segmentation, have been defined, studied and in some cases advanced to reach human performance. However, visual language has not garnered a similar level of attention, possibly because of the lack of large-scale training sets in this space. But over the last few years, new academic datasets have been created with the goal of evaluating question answering systems on visual language images, like PlotQA, InfographicsVQA, and ChartQA.

Example from ChartQA. Answering the question requires reading the information and computing the sum and the difference.

Existing models built for these tasks relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines but the process is error prone, slow, and generalizes poorly. The prevalence of these methods was because existing end-to-end computer vision models based on convolutional neural networks (CNNs) or transformers pre-trained on natural images could not be easily adapted to visual language. But existing models are ill-prepared for the challenges in answering questions on charts, including reading the relative height of bars or the angle of slices in pie charts, understanding axis scales, correctly mapping pictograms with their legend values with colors, sizes and textures, and finally performing numerical operations with the extracted numbers.

In light of these challenges, we propose “MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering”. MatCha, which stands for math and charts, is a pixels-to-text foundation model (a pre-trained model with built-in inductive biases that can be fine-tuned for multiple applications) trained on two complementary tasks: (a) chart de-rendering and (b) math reasoning. In chart de-rendering, given a plot or chart, the image-to-text model is required to generate its underlying data table or the code used to render it. For math reasoning pre-training, we pick textual numerical reasoning datasets and render the input into images, which the image-to-text model needs to decode for answers. We also propose “DePlot: One-shot visual language reasoning by plot-to-table translation”, a model built on top of MatCha for one-shot reasoning on charts via translation to tables. With these methods we surpass the previous state of the art in ChartQA by more than 20% and match the best summarization systems that have 1000 times more parameters. Both papers will be presented at ACL2023.

Chart de-rendering

Plots and charts are usually generated by an underlying data table and a piece of code. The code defines the overall layout of the figure (e.g., type, direction, color/shape scheme) and the underlying data table establishes the actual numbers and their groupings. Both the data and code are sent to a compiler/rendering engine to create the final image. To understand a chart, one needs to discover the visual patterns in the image and effectively parse and group them to extract the key information. Reversing the plot rendering process demands all such capabilities and can thus serve as an ideal pre-training task.

A chart created from a table in the Airbus A380 Wikipedia page using random plotting options. The pre-training task for MatCha consists of recovering the source table or the source code from the image.

In practice, it is challenging to simultaneously obtain charts, their underlying data tables, and their rendering code. To collect sufficient pre-training data, we independently accumulate [chart, code] and [chart, table] pairs. For [chart, code], we crawl all GitHub IPython notebooks with appropriate licenses and extract blocks with figures. A figure and the code block right before it are saved as a [chart, code] pair. For [chart, table] pairs, we explored two sources. For the first source, synthetic data, we manually write code to convert web-crawled Wikipedia tables from the TaPas codebase to charts. We sampled from and combined several plotting options depending on the column types. In addition, we also add [chart, table] pairs generated in PlotQA to diversify the pre-training corpus. The second source is web-crawled [chart, table] pairs. We directly use the [chart, table] pairs crawled in the ChartQA training set, containing around 20k pairs in total from four websites: Statista, Pew, Our World in Data, and OECD.

Math reasoning

We incorporate numerical reasoning knowledge into MatCha by learning math reasoning skills from textual math datasets. We use two existing textual math reasoning datasets, MATH and DROP for pre-training. MATH is synthetically created, containing two million training examples per module (type) of questions. DROP is a reading-comprehension–style QA dataset where the input is a paragraph context and a question.

To solve questions in DROP, the model needs to read the paragraph, extract relevant numbers and perform numerical computation. We found both datasets to be complementary. MATH contains a large number of questions across different categories, which helps us identify math operations needed to explicitly inject into the model. DROP’s reading-comprehension format resembles the typical QA format wherein models simultaneously perform information extraction and reasoning. In practice, we render inputs of both datasets into images. The model is trained to decode the answer.

To improve the math reasoning skills of MatCha we incorporate examples from MATH and DROP into the pre-training objective, by rendering the input text as images.

End-to-end results

We use a Pix2Struct model backbone, which is an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by fine-tuning it on several visual language tasks — tasks involving charts and plots for question answering and summarization where no access to the underlying table is possible. MatCha surpasses previous models’ performance by a large margin and also outperforms the previous state of the art, which assumes access to underlying tables.

In the figure below, we first evaluate two baseline models that incorporate information from an OCR pipeline, which until recently was the standard approach for working with charts. The first is based on T5, the second on VisionTaPas. We also compare against PaLI-17B, which is a large (~1000 times larger than the other models) image plus text-to-text transformer trained on a diverse set of tasks but with limited capabilities for reading text and other forms of visual language. Finally, we report the Pix2Struct and MatCha model results.

Experimental results on two chart QA benchmarks ChartQA & PlotQA (using relaxed accuracy) and a chart summarization benchmark chart-to-text (using BLEU4). Matcha surpasses the state of the art by a large margin on QA, compared to larger models, and matches these larger models on summarization.

For QA datasets, we use the official relaxed accuracy metric that allows for small relative errors in numerical outputs. For chart-to-text summarization, we report BLEU scores. MatCha achieves noticeably improved results compared to baselines for question answering, and comparable results to PaLI in summarization, where large size and extensive long text/captioning generation pre-training are advantageous for this kind of long-form text generation.

Derendering plus large language model chains

While extremely performant for their number of parameters, particularly on extractive tasks, we observed that fine-tuned MatCha models could still struggle with end-to-end complex reasoning (e.g., mathematical operations involving large numbers or multiple steps). Thus, we also propose a two-step method to tackle this: 1) a model reads a chart, then outputs the underlying table, 2) a large language model (LLM) reads this output and then tries to answer the question solely based on the textual input.

For the first model, we fine-tuned MatCha solely on the chart-to-table task, increasing the output sequence length to guarantee it could recover all or most of the information in the chart. DePlot is the resulting model. In the second stage, any LLM (such as FlanPaLM or Codex) can be used for the task, and we can rely on the standard methods to increase performance on LLMs, for example chain-of-thought and self-consistency. We also experimented with program-of-thoughts where the model produces executable Python code to offload complex computations.

An illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLM and the red boxes contain the answer generated by the LLMs. We highlight some of the key reasoning steps in each answer.

As shown in the example above, the DePlot model in combination with LLMs outperforms fine-tuned models by a significant margin, especially so in the human-sourced portion of ChartQA, where the questions are more natural but demand more difficult reasoning. Furthermore, DePlot+LLM can do so without access to any training data.

We have released the new models and code at our GitHub repo, where you can try it out yourself in colab. Checkout the papers for MatCha and DePlot for more details on the experimental results. We hope that our results can benefit the research community and make the information in charts and plots more accessible to everyone.

Acknowledgements

This work was carried out by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Yasemin Altun from our Language Team as part of Fangyu’s internship project. Nigel Collier from Cambridge also was a collaborator. We would like to thank Joshua Howland, Alex Polozov, Shrestha Basu Mallick, Massimo Nicosia and William Cohen for their valuable comments and suggestions.

Barkour: Benchmarking animal-level agility with quadruped robots

May 26, 2023

by Google AI Google AI

Posted by Ken Caluwaerts and Atil Iscen, Research Scientists, Google

Creating robots that exhibit robust and dynamic locomotion capabilities, similar to animals or humans, has been a long-standing goal in the robotics community. In addition to completing tasks quickly and efficiently, agility allows legged robots to move through complex environments that are otherwise difficult to traverse. Researchers at Google have been pursuing agility for multiple years and across various form factors. Yet, while researchers have enabled robots to hike or jump over some obstacles, there is still no generally accepted benchmark that comprehensively measures robot agility or mobility. In contrast, benchmarks are driving forces behind the development of machine learning, such as ImageNet for computer vision, and OpenAI Gym for reinforcement learning (RL).

In “Barkour: Benchmarking Animal-level Agility with Quadruped Robots”, we introduce the Barkour agility benchmark for quadruped robots, along with a Transformer-based generalist locomotion policy. Inspired by dog agility competitions, a legged robot must sequentially display a variety of skills, including moving in different directions, traversing uneven terrains, and jumping over obstacles within a limited timeframe to successfully complete the benchmark. By providing a diverse and challenging obstacle course, the Barkour benchmark encourages researchers to develop locomotion controllers that move fast in a controllable and versatile way. Furthermore, by tying the performance metric to real dog performance, we provide an intuitive metric to understand the robot performance with respect to their animal counterparts.

We invited a handful of dooglers to try the obstacle course to ensure that our agility objectives were realistic and challenging. Small dogs complete the obstacle course in approximately 10s, whereas our robot’s typical performance hovers around 20s.

Barkour benchmark

The Barkour scoring system uses a per obstacle and an overall course target time based on the target speed of small dogs in the novice agility competitions (about 1.7m/s). Barkour scores range from 0 to 1, with 1 corresponding to the robot successfully traversing all the obstacles along the course within the allotted time of approximately 10 seconds, the average time needed for a similar-sized dog to traverse the course. The robot receives penalties for skipping, failing obstacles, or moving too slowly.

Our standard course consists of four unique obstacles in a 5m x 5m area. This is a denser and smaller setup than a typical dog competition to allow for easy deployment in a robotics lab. Beginning at the start table, the robot needs to weave through a set of poles, climb an A-frame, clear a 0.5m broad jump and then step onto the end table. We chose this subset of obstacles because they test a diverse set of skills while keeping the setup within a small footprint. As is the case for real dog agility competitions, the Barkour benchmark can be easily adapted to a larger course area and may incorporate a variable number of obstacles and course configurations.

Overview of the Barkour benchmark’s obstacle course setup, which consists of weave poles, an A-frame, a broad jump, and pause tables. The intuitive scoring mechanism, inspired by dog agility competitions, balances speed, agility and performance and can be easily modified to incorporate other types of obstacles or course configurations.

Learning agile locomotion skills

The Barkour benchmark features a diverse set of obstacles and a delayed reward system, which pose a significant challenge when training a single policy that can complete the entire obstacle course. So in order to set a strong performance baseline and demonstrate the effectiveness of the benchmark for robotic agility research, we adopt a student-teacher framework combined with a zero-shot sim-to-real approach. First, we train individual specialist locomotion skills (teacher) for different obstacles using on-policy RL methods. In particular, we leverage recent advances in large-scale parallel simulation to equip the robot with individual skills, including walking, slope climbing, and jumping policies.

Next, we train a single policy (student) that performs all the skills and transitions in between by using a student-teacher framework, based on the specialist skills we previously trained. We use simulation rollouts to create datasets of state-action pairs for each one of the specialist skills. This dataset is then distilled into a single Transformer-based generalist locomotion policy, which can handle various terrains and adjust the robot’s gait based on the perceived environment and the robot’s state.

During deployment, we pair the locomotion transformer policy that is capable of performing multiple skills with a navigation controller that provides velocity commands based on the robot’s position. Our trained policy controls the robot based on the robot’s surroundings represented as an elevation map, velocity commands, and on-board sensory information provided by the robot.

Deployment pipeline for the locomotion transformer architecture. At deployment time, a high-level navigation controller guides the real robot through the obstacle course by sending commands to the locomotion transformer policy.

Robustness and repeatability are difficult to achieve when we aim for peak performance and maximum speed. Sometimes, the robot might fail when overcoming an obstacle in an agile way. To handle failures we train a recovery policy that quickly gets the robot back on its feet, allowing it to continue the episode.

Evaluation

We evaluate the Transformer-based generalist locomotion policy using custom-built quadruped robots and show that by optimizing for the proposed benchmark, we obtain agile, robust, and versatile skills for our robot in the real world. We further provide analysis for various design choices in our system and their impact on the system performance.

Model of the custom-built robots used for evaluation.

We deploy both the specialist and generalist policies to hardware (zero-shot sim-to-real). The robot’s target trajectory is provided by a set of waypoints along the various obstacles. In the case of the specialist policies, we switch between specialist policies by using a hand-tuned policy switching mechanism that selects the most suitable policy given the robot’s position.

Typical performance of our agile locomotion policies on the Barkour benchmark. Our custom-built quadruped robot robustly navigates the terrain’s obstacles by leveraging various skills learned using RL in simulation.

We find that very often our policies can handle unexpected events or even hardware degradation resulting in good average performance, but failures are still possible. As illustrated in the image below, in case of failures, our recovery policy quickly gets the robot back on its feet, allowing it to continue the episode. By combining the recovery policy with a simple walk-back-to-start policy, we are able to run repeated experiments with minimal human intervention to measure the robustness.

Qualitative example of robustness and recovery behaviors. The robot trips and rolls over after heading down the A-frame. This triggers the recovery policy, which enables the robot to get back up and continue the course.

We find that across a large number of evaluations, the single generalist locomotion transformer policy and the specialist policies with the policy switching mechanism achieve similar performance. The locomotion transformer policy has a slightly lower average Barkour score, but exhibits smoother transitions between behaviors and gaits.

Measuring robustness of the different policies across a large number of runs on the Barkour benchmark.

Histogram of the agility scores for the locomotion transformer policy. The highest scores shown in blue (0.75 – 0.9) represent the runs where the robot successfully completes all obstacles.

Conclusion

We believe that developing a benchmark for legged robotics is an important first step in quantifying progress toward animal-level agility. To establish a strong baseline, we investigated a zero-shot sim-to-real approach, taking advantage of large-scale parallel simulation and recent advancements in training Transformer-based architectures. Our findings demonstrate that Barkour is a challenging benchmark that can be easily customized, and that our learning-based method for solving the benchmark provides a quadruped robot with a single low-level policy that can perform a variety of agile low-level skills.

Acknowledgments

The authors of this post are now part of Google DeepMind. We would like to thank our co-authors at Google DeepMind and our collaborators at Google Research: Wenhao Yu, J. Chase Kew, Tingnan Zhang, Daniel Freeman, Kuang-Hei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Yuheng Kuang, Edward Lee, Linda Luu, Ofir Nachum, Ken Oslund, Jason Powell, Diego Reyes, Francesco Romano, Feresteh Sadeghi, Ron Sloat, Baruch Tabanpour, Daniel Zheng, Michael Neunert, Raia Hadsell, Nicolas Heess, Francesco Nori, Jeff Seto, Carolina Parada, Vikas Sindhwani, Vincent Vanhoucke, and Jie Tan. We would also like to thank Marissa Giustina, Ben Jyenis, Gus Kouretas, Nubby Lee, James Lubin, Sherry Moore, Thinh Nguyen, Krista Reymann, Satoshi Kataoka, Trish Blazina, and the members of the robotics team at Google DeepMind for their contributions to the project.

Differentially private clustering for large-scale datasets

May 25, 2023

by Google AI Google AI

Posted by Vincent Cohen-Addad and Alessandro Epasto, Research Scientists, Google Research, Graph Mining team

Clustering is a central problem in unsupervised machine learning (ML) with many applications across domains in both industry and academic research more broadly. At its core, clustering consists of the following problem: given a set of data elements, the goal is to partition the data elements into groups such that similar objects are in the same group, while dissimilar objects are in different groups. This problem has been studied in math, computer science, operations research and statistics for more than 60 years in its myriad variants. Two common forms of clustering are metric clustering, in which the elements are points in a metric space, like in the k-means problem, and graph clustering, where the elements are nodes of a graph whose edges represent similarity among them.

In the k-means clustering problem, we are given a set of points in a metric space with the objective to identify k representative points, called centers (here depicted as triangles), so as to minimize the sum of the squared distances from each point to its closest center. Source, rights: CC-BY-SA-4.0

Despite the extensive literature on algorithm design for clustering, few practical works have focused on rigorously protecting the user’s privacy during clustering. When clustering is applied to personal data (e.g., the queries a user has made), it is necessary to consider the privacy implications of using a clustering solution in a real system and how much information the output solution reveals about the input data.

To ensure privacy in a rigorous sense, one solution is to develop differentially private (DP) clustering algorithms. These algorithms ensure that the output of the clustering does not reveal private information about a specific data element (e.g., whether a user has made a given query) or sensitive data about the input graph (e.g., a relationship in a social network). Given the importance of privacy protections in unsupervised machine learning, in recent years Google has invested in research on theory and practice of differentially private metric or graph clustering, and differential privacy in a variety of contexts, e.g., heatmaps or tools to design DP algorithms.

Today we are excited to announce two important updates: 1) a new differentially-private algorithm for hierarchical graph clustering, which we’ll be presenting at ICML 2023, and 2) the open-source release of the code of a scalable differentially-private k-means algorithm. This code brings differentially private k-means clustering to large scale datasets using distributed computing. Here, we will also discuss our work on clustering technology for a recent launch in the health domain for informing public health authorities.

Differentially private hierarchical clustering

Hierarchical clustering is a popular clustering approach that consists of recursively partitioning a dataset into clusters at an increasingly finer granularity. A well known example of hierarchical clustering is the phylogenetic tree in biology in which all life on Earth is partitioned into finer and finer groups (e.g., kingdom, phylum, class, order, etc.). A hierarchical clustering algorithm receives as input a graph representing the similarity of entities and learns such recursive partitions in an unsupervised way. Yet at the time of our research no algorithm was known to compute hierarchical clustering of a graph with edge privacy, i.e., preserving the privacy of the vertex interactions.

In “Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees”, we consider how well the problem can be approximated in a DP context and establish firm upper and lower bounds on the privacy guarantee. We design an approximation algorithm (the first of its kind) with a polynomial running time that achieves both an additive error that scales with the number of nodes n (of order n^2.5) and a multiplicative approximation of O(log^½ n), with the multiplicative error identical to the non-private setting. We further provide a new lower bound on the additive error (of order n²) for any private algorithm (irrespective of its running time) and provide an exponential-time algorithm that matches this lower bound. Moreover, our paper includes a beyond-worst-case analysis focusing on the hierarchical stochastic block model, a standard random graph model that exhibits a natural hierarchical clustering structure, and introduces a private algorithm that returns a solution with an additive cost over the optimum that is negligible for larger and larger graphs, again matching the non-private state-of-the-art approaches. We believe this work expands the understanding of privacy preserving algorithms on graph data and will enable new applications in such settings.

Large-scale differentially private clustering

We now switch gears and discuss our work for metric space clustering. Most prior work in DP metric clustering has focused on improving the approximation guarantees of the algorithms on the k-means objective, leaving scalability questions out of the picture. Indeed, it is not clear how efficient non-private algorithms such as k-means++ or k-means// can be made differentially private without sacrificing drastically either on the approximation guarantees or the scalability. On the other hand, both scalability and privacy are of primary importance at Google. For this reason, we recently published multiple papers that address the problem of designing efficient differentially private algorithms for clustering that can scale to massive datasets. Our goal is, moreover, to offer scalability to large scale input datasets, even when the target number of centers, k, is large.

We work in the massively parallel computation (MPC) model, which is a computation model representative of modern distributed computation architectures. The model consists of several machines, each holding only part of the input data, that work together with the goal of solving a global problem while minimizing the amount of communication between machines. We present a differentially private constant factor approximation algorithm for k-means that only requires a constant number of rounds of synchronization. Our algorithm builds upon our previous work on the problem (with code available here), which was the first differentially-private clustering algorithm with provable approximation guarantees that can work in the MPC model.

The DP constant factor approximation algorithm drastically improves on the previous work using a two phase approach. In an initial phase it computes a crude approximation to “seed” the second phase, which consists of a more sophisticated distributed algorithm. Equipped with the first-step approximation, the second phase relies on results from the Coreset literature to subsample a relevant set of input points and find a good differentially private clustering solution for the input points. We then prove that this solution generalizes with approximately the same guarantee to the entire input.

Vaccination search insights via DP clustering

We then apply these advances in differentially private clustering to real-world applications. One example is our application of our differentially-private clustering solution for publishing COVID vaccine-related queries, while providing strong privacy protections for the users.

The goal of Vaccination Search Insights (VSI) is to help public health decision makers (health authorities, government agencies and nonprofits) identify and respond to communities’ information needs regarding COVID vaccines. In order to achieve this, the tool allows users to explore at different geolocation granularities (zip-code, county and state level in the U.S.) the top themes searched by users regarding COVID queries. In particular, the tool visualizes statistics on trending queries rising in interest in a given locale and time.

Screenshot of the output of the tool. Displayed on the left, the top searches related to Covid vaccines during the period Oct 10-16 2022. On the right, the queries that have had rising importance during the same period and compared to the previous week.

To better help identifying the themes of the trending searches, the tool clusters the search queries based on their semantic similarity. This is done by applying a custom-designed k-means–based algorithm run over search data that has been anonymized using the DP Gaussian mechanism to add noise and remove low-count queries (thus resulting in a differentially clustering). The method ensures strong differential privacy guarantees for the protection of the user data.

This tool provided fine-grained data on COVID vaccine perception in the population at unprecedented scales of granularity, something that is especially relevant to understand the needs of the marginalized communities disproportionately affected by COVID. This project highlights the impact of our investment in research in differential privacy, and unsupervised ML methods. We are looking to other important areas where we can apply these clustering techniques to help guide decision making around global health challenges, like search queries on climate change–related challenges such as air quality or extreme heat.

Acknowledgements

We thank our co-authors Silvio Lattanzi, Vahab Mirrokni, Andres Munoz Medina, Shyam Narayanan, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii, Peilin Zhong and our colleagues from the Health AI team that made the VSI launch possible Shailesh Bavadekar, Adam Boulanger, Tague Griffith, Mansi Kansal, Chaitanya Kamath, Akim Kumok, Yael Mayer, Tomer Shekel, Megan Shum, Charlotte Stanton, Mimi Sun, Swapnil Vispute, and Mark Young.

For more information on the Graph Mining team (part of Algorithm and Optimization) visit our pages.

Evaluating TTS with SQuId

Experimental results

Conclusion and future work

Acknowledgements

Design space for augmenting verbal communication with dynamic visuals

Visual intent prediction model

Performance

Quantitative and qualitative evaluation: User studies

Conclusions and future directions

Acknowledgements

Injecting vision using lightweight modules

Curriculum learning for vision injection

Results in zero-shot AV-ASR

Conclusion

Acknowledgements

Memory construction from multimodal knowledge corpora

Scaling memory using compression

Large-scale pre-training on image-text pairs

REVEAL workflow

Results

Conclusion

Acknowledgements

A treasure trove of data about the software engineering process

A multi-task model for software engineering

An ML peer programmer

Conclusion

Acknowledgements

Chart de-rendering

Math reasoning

End-to-end results

Derendering plus large language model chains

Acknowledgements

Barkour benchmark

Learning agile locomotion skills

Evaluation

Conclusion

Acknowledgments

Differentially private hierarchical clustering

Large-scale differentially private clustering

Vaccination search insights via DP clustering

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.