LIMoE: Learning Multiple Modalities with One Sparse Mixture of Experts Model

Sparse models stand out among the most promising approaches for the future of deep learning. Instead of every part of a model processing every input (“dense” modeling), sparse models employing conditional computation learn to route individual inputs to different “experts” in a potentially huge network. This has many benefits. First, model size can increase while keeping computational cost constant — an effective and environmentally friendlier way to scale models, which is often key to high performance. Sparsity also naturally compartmentalizes neural networks. Dense models that learn many different tasks simultaneously (multitask) or sequentially (continual learning) often suffer negative interference, where too much task variety means it is better to just train one model per task, or catastrophic forgetting, where the model becomes worse at earlier tasks as new ones are added. Sparse models help avoid both these phenomena — by not applying the whole model to all inputs, “experts” in the model can specialize on different tasks or data types while still taking advantage of shared parts of the model.

Research on sparsity has long been pursued at Google Research. Pathways summarizes the research vision of building one single large model that diligently handles thousands of tasks and numerous data modalities. So far there has been considerable progress in sparse unimodal models for language (Switch, Task-MoE, GLaM) and computer vision (Vision MoE). Today, we take another important step towards the Pathways vision by studying large sparse models that simultaneously handle images and text with modality-agnostic routing. A relevant approach is multimodal contrastive learning, which requires a solid understanding of both images and text in order to align pictures with their correct text description. The strongest models that tackle this task to date rely on independent networks for each modality (a “two-tower” approach).

In “Multimodal Contrastive Learning with LIMoE: the Language Image Mixture of Experts”, we present the first large-scale multimodal architecture using a sparse mixture of experts. It simultaneously processes both images and text, but uses sparsely activated experts that naturally specialize. On zero-shot image classification, LIMoE outperforms both comparable dense multimodal models and two-tower approaches. The largest LIMoE achieves 84.1% zero-shot ImageNet accuracy, comparable to more expensive state-of-the-art models. Sparsity enables LIMoE to scale up gracefully and learn to handle very different inputs, addressing the tension between being a jack-of-all-trades generalist and a master-of-one specialist.

The LIMoE architecture contains many “experts” and routers decide which tokens (parts of an image or sentence) go to which experts. After being processed by expert layers (gray) and shared dense layers (brown), a final output layer computes a single vector representation for either an image or a text.

Sparse Mixture of Expert Models
Transformers represent data as a sequence of vectors (or tokens). Though originally developed for text, they can be applied to most things that are representable as a sequence of tokens, e.g., images, videos, and audio. Recent large-scale MoE models add expert layers to the Transformer architecture (e.g., gShard and ST-MoE in natural language processing, and Vision MoE for vision tasks).

A standard Transformer consists of many “blocks”, each containing various different layers. One of these layers is a feed-forward network (FFN). For LIMoE and the works cited above, this single FFN is replaced by an expert layer that contains many parallel FFNs, each of which is an expert. Given a sequence of tokens to process, a simple router learns to predict which experts should handle which tokens. Only a small number of experts are activated per token, meaning although the model capacity is significantly increased by virtue of having so many experts, the actual computational cost is controlled by using them sparsely. If only one expert is activated, the model’s cost is roughly equivalent to the standard Transformer model.

LIMoE does precisely that, activating one expert per example, thereby matching the computational cost of the dense baselines. What’s different is that the LIMoE router might see tokens of either image or text data.

A unique failure mode of MoE models occurs when they try to send all tokens to the same expert. Typically this is addressed with auxiliary losses, extra training objectives that encourage balanced expert usage. We found that dealing with multiple modalities interacted with sparsity to cause new failure modes that existing auxiliary losses could not address. To overcome this, we developed new auxiliary losses (more details in the paper) and used routing prioritization (BPR) during training, two innovations that resulted in stable and high performance multimodal models.

The new auxiliary losses (LIMoE aux) and routing prioritization (BPR) stabilized and improved overall performance (left) and increased the success rate of routing behavior (middle and right). A low success rate means the router does not use all the experts available and drops many tokens due to individual expert capacity being reached, which usually indicates the sparse model is not learning well. The combination introduced for LIMoE ensures high routing success rates for both images and text and consequently leads to significantly better performance.

Contrastive Learning with LIMoE
In multimodal contrastive learning, models are trained on paired image-text data (e.g., a photo and its caption). Typically, an image model extracts a representation of images, and a different text model extracts a representation of text. The contrastive learning objective encourages the image and text representations to be close for the same image-text pair and far away for content from different pairs. Such models with aligned representations can be adapted to new tasks without extra training data (“zero-shot”), e.g., an image will be classified as a dog if its representation is closer to the representation of the word “dog” than the word “cat”. This idea scales to thousands of classes and is referred to as zero-shot image classification.

CLIP and ALIGN (both two-tower models) scaled this process to achieve 76.2% and 76.4% zero-shot classification accuracy on the popular ImageNet dataset. We study one-tower models which compute both image and text representations. We find this reduces performance for dense models, likely due to negative interference or insufficient capacity. However, a compute-matched LIMoE not only improves over the one-tower dense model, but also outperforms two-tower dense models. We trained a series of models in a comparable training regimen to CLIP. Our dense L/16 model achieves 73.5% zero-shot accuracy, whereas LIMoE-L/16 gets to 78.6%, even outperforming CLIP’s more expensive, two-tower L/14 model (76.2%). As shown below, LIMoE’s use of sparsity provides a remarkable performance boost over dense models with equivalent cost.

For a given compute cost (x-axis), LIMoE models (circles, solid line) are significantly better than their dense baselines (triangles, dashed line). The architecture indicates the size of the underlying transformer, increasing from left (S/32) to right (L/16). Following standard convention, S (small), B (base), and L (large) refer to model scale. The number refers to the patch size, where smaller patches imply a larger architecture.

LiT and BASIC pushed zero-shot accuracy for dense two-tower models to 84.5% and 85.6% respectively. In addition to scaling, these approaches made use of specialized pre-training methods, repurposing image models that were already of exceptionally high quality. LIMoE-H/14 does not benefit from any pre-training or modality-specific components, but still achieved a comparable 84.1% zero-shot accuracy training from scratch. The scale of these models is also interesting to compare: LiT and BASIC are 2.1B and 3B parameter models. LIMoE-H/14 has 5.6B parameters in total, but via sparsity it only applies 675M parameters per token making it significantly more lightweight.

Data seen during training
Model   Pre-training     Image-text     Total      Parameters per token     ImageNet accuracy  
CLIP 12.8B 12.8B ~200M 76.2%
ALIGN 19.8B 19.8B ~410M 76.4%
LiT 25.8B 18.2B 44.0B 1.1B 84.5%
BASIC 19.7B 32.8B 52.5B 1.5B 85.6%
LIMoE H/14    23.3B 23.3B 675M 84.1%

Understanding LIMoE’s Behavior
LIMoE was motivated by the intuition that sparse conditional computation enables a generalist multimodal model to still develop the specialization needed to excel at understanding each modality. We analyzed LIMoE’s expert layers and uncovered a few interesting phenomena.

First, we see the emergence of modality-specialized experts. In our training setup there are many more image tokens than text tokens, so all experts tend to process at least some images, but some experts process either mostly images, mostly text, or both.

Distributions for an eight expert LIMoE; percentages indicate the amount of image tokens processed by the expert. There are one or two experts clearly specialized on text (shown by the mostly blue experts), usually two to four image specialists (mostly red), and the remainder are somewhere in the middle.

There are also some clear qualitative patterns among the image experts — e.g., in most LIMoE models, there is an expert that processes all image patches that contain text. In the example below, one expert processes fauna and greenery, and another processes human hands.

LIMoE chooses an expert for each token. Here we show which image tokens go to which experts on one of the layers of LIMoE-H/14. Despite not being trained to do so, we observe the emergence of semantic experts that specialize in specific topics such as plants or wheels.

Moving Forward
Multimodal models that handle many tasks are a promising route forward, and there are two key ingredients for success: scale, and the ability to avoid interference between distinct tasks and modalities while taking advantage of synergies. Sparse conditional computation is an excellent way of doing both. It enables performant and efficient generalist models that also have the capacity and flexibility for the specialization necessary to excel at individual tasks, as demonstrated by LIMoE’s solid performance with less compute.

Acknowledgements
We thank our co-authors on this work: Joan Puigcerver, Rodolphe Jenatton and Neil Houlsby. We also thank Andreas Steiner, Xiao Wang and Xiaohua Zhai, who led early explorations into dense single-tower models for contrastive multimodal learning, and also were instrumental in providing data access. We enjoyed useful discussions with André Susano Pinto, Maxim Neumann, Barret Zoph, Liam Fedus, Wei Han and Josip Djolonga. Finally, we would also like to thank and acknowledge Tom Small for the awesome animated figure used in this post.

Read More

Building a more helpful browser with machine learning

At Google we use technologies like machine learning (ML) to build more useful products — from filtering out email spam, to keeping maps up to date, to offering more relevant search results. Chrome is no exception: We use ML to make web images more accessible to people who are blind or have low vision, and we also generate real-time captions for online videos, in service of people in noisy environments, and those who are hard of hearing.

This work in Chrome continues, so we wanted to share some recent and future ML improvements that offer a safer, more accessible and more personalized browsing experience. Importantly: these updates are powered by on-device ML models, which means your data stays private, and never leaves your device.

More peace of mind, less annoying prompts

Safe Browsing in Chrome helps protect billions of devices every day, by showing warnings when people try to navigate to dangerous sites or download dangerous files (see the big red example below). Starting in March of this year, we rolled out a new ML model that identifies 2.5 times more potentially malicious sites and phishing attacks as the previous model – resulting in a safer and more secure web.

To further improve the browsing experience, we’re also evolving how people interact with web notifications. On the one hand, page notifications help deliver updates from sites you care about; on the other hand, notification permission prompts can become a nuisance. To help people browse the web with minimal interruption, Chrome predicts when permission prompts are unlikely to be granted based on how the user previously interacted with similar permission prompts, and silences these undesired prompts. In the next release of Chrome, we’re launching an ML model that makes these predictions entirely on-device.

Two separate images side by side. The first on the left is a smartphone showing a red screen and a warning message about phishing. The image on the right shows a Chrome browser window showing a pop-up message saying “Notifications blocked”.

With the next release of Chrome, this is what you will see if a phishing attempt is detected (Left) and Chrome will show permission requests quietly when the user is unlikely to grant them (Right).

Finding what’s important, always in your language

Earlier this year we launched Journeys to help people retrace their steps online. For example: You might spend weeks planning a national park visit – researching attractions, comparing flights and shopping for gear. With ML and Journeys, Chrome brings together the pages you’ve visited about a given topic, and makes it easy to pick up where you left off (vs. scr o o o l l ling through your browser history).

When you return to those hiking boots and camping guides, we’re also using ML to make those websites available in your preferred language. In particular, we’ve launched an updated language identification model to figure out the language of the page, and whether it needs to be translated to match your preferences. As a result, we’re seeing tens of millions more successful translations every day.

A Chrome browser showing Journeys related to travel. The user can see a cluster of recent searches they did related to a trip to Yosemite.

The Journeys feature of Chrome groups together your search history based on topic or intent.

A browser built just for you

Maybe you like to read news articles in the morning – phone in one hand, cereal spoon in the other – so you share lots of links from Chrome. Or maybe voice search is more your thing, as you sneak in a few questions during your transit ride to work. Either way, we want to make sure Chrome is meeting you where you’re at, so in the near future, we’ll be using ML to adjust the toolbar in real-time – highlighting the action that’s most useful in that moment (e.g., share link, voice search, etc.). Of course, you’ll be able to customize it manually as well.

A Chrome browser with a highlighted square around an icon to the right of the address bar. At the top is a share icon, and at the bottom is a microphone icon.

The toolbar in Chrome on Android will adapt based on your needs.

Our goal is to build a browser that’s genuinely and continuously helpful, and we’re excited about the possibilities that ML provides. At the end of the day, though, your experience is what really matters, so please tweet @googlechrome to send us your feedback.

Read More

End-to-end Generative Pre-training for Multimodal Video Captioning

Multimodal video captioning systems utilize both the video frames and speech to generate natural language descriptions (captions) of videos. Such systems are stepping stones towards the longstanding goal of building multimodal conversational systems that effortlessly communicate with users while perceiving environments through multimodal input streams.

Unlike video understanding tasks (e.g., video classification and retrieval) where the key challenge lies in processing and understanding multimodal input videos, the task of multimodal video captioning includes the additional challenge of generating grounded captions. The most widely adopted approach for this task is to train an encoder-decoder network jointly using manually annotated data. However, due to a lack of large-scale, manually annotated data, the task of annotating grounded captions for videos is labor intensive and, in many cases, impractical. Previous research such as VideoBERT and CoMVT pre-train their models on unlabelled videos by leveraging automatic speech recognition (ASR). However, such models often cannot generate natural language sentences because they lack a decoder, and thus only the video encoder is transferred to the downstream tasks.

In “End-to-End Generative Pre-training for Multimodal Video Captioning”, published at CVPR 2022, we introduce a novel pre-training framework for multimodal video captioning. This framework, which we call multimodal video generative pre-training or MV-GPT, jointly trains a multimodal video encoder and a sentence decoder from unlabelled videos by leveraging a future utterance as the target text and formulating a novel bi-directional generation task. We demonstrate that MV-GPT effectively transfers to multimodal video captioning, achieving state-of-the-art results on various benchmarks. Additionally, the multimodal video encoder is competitive for multiple video understanding tasks, such as VideoQA, text-video retrieval, and action recognition.

Future Utterance as an Additional Text Signal
Typically, each training video clip for multimodal video captioning is associated with two different texts: (1) a speech transcript that is aligned with the clip as a part of the multimodal input stream, and (2) a target caption, which is often manually annotated. The encoder learns to fuse information from the transcript with visual contents, and the target caption is used to train the decoder for generation. However, in the case of unlabelled videos, each video clip comes only with a transcript from ASR, without a manually annotated target caption. Moreover, we cannot use the same text (the ASR transcript) for the encoder input and decoder target, since the generation of the target would then be trivial.

MV-GPT circumvents this challenge by leveraging a future utterance as an additional text signal and enabling joint pre-training of the encoder and decoder. However, training a model to generate future utterances that are often not grounded in the input content is not ideal. So we apply a novel bi-directional generation loss to reinforce the connection to the input.

Bi-directional Generation Loss
The issue of non-grounded text generation is mitigated by formulating a bi-directional generation loss that includes forward and backward generation. Forward generation produces future utterances given visual frames and their corresponding transcripts and allows the model to learn to fuse the visual content with its corresponding transcript. Backward generation takes the visual frames and future utterances to train the model to generate a transcript that contains more grounded text of the video clip. Bi-directional generation loss in MV-GPT allows the encoder and the decoder to be trained to handle visually grounded texts.

Bi-directional generation in MV-GPT. A model is trained with two generation losses. In forward generation, the model generates a future utterance (blue boxes) given the frames and the present utterance (red boxes), whereas the present is generated from the future utterance in backward generation. Two special beginning-of-sentence tokens ([BOS-F] and [BOS-B]) initiate forward and backward generation for the decoder.

Results on Multimodal Video Captioning
We compare MV-GPT to existing pre-training losses using the same model architecture, on YouCook2 with standard evaluation metrics (Bleu-4, Cider, Meteor and Rouge-L). While all pre-training techniques improve captioning performances, it is critical to pre-train the decoder jointly to improve model performance. We demonstrate that MV-GPT outperforms the previous state-of-the-art joint pre-training method by over 3.5% with relative gains across all four metrics.

Pre-training Loss Pre-trained Parts Bleu-4 Cider Meteor Rouge-L
No Pre-training N/A 13.25 1.03 17.56 35.48
CoMVT Encoder 14.46 1.24 18.46 37.17
UniVL Encoder + Decoder 19.95 1.98 25.27 46.81
MV-GPT (ours) Encoder + Decoder 21.26 2.14 26.36 48.58
MV-GPT performance across four metrics (Bleu-4, Cider, Meteor and Rouge-L) of different pre-training losses on YouCook2. “Pre-trained parts” indicates which parts of the model are pre-trained — only the encoder or both the encoder and decoder. We reimplement the loss functions of existing methods but use our model and training strategies for a fair comparison.

We transfer a model pre-trained by MV-GPT to four different captioning benchmarks: YouCook2, MSR-VTT, ViTT and ActivityNet-Captions. Our model achieves state-of-the-art performance on all four benchmarks by significant margins. For instance on the Meteor metric, MV-GPT shows over 12% relative improvements in all four benchmarks.

YouCook2 MSR-VTT ViTT ActivityNet-Captions
Best Baseline 22.35 29.90 11.00 10.90
MV-GPT (ours) 27.09 38.66 26.75 12.31
Meteor metric scores of the best baseline methods and MV-GPT on four benchmarks.

Results on Non-generative Video Understanding Tasks
Although MV-GPT is designed to train a generative model for multimodal video captioning, we also find that our pre-training technique learns a powerful multimodal video encoder that can be applied to multiple video understanding tasks, including VideoQA, text-video retrieval and action classification. When compared to the best comparable baseline models, the model transferred from MV-GPT shows superior performance in five video understanding benchmarks on their primary metrics — i.e., top-1 accuracy for VideoQA and action classification benchmarks, and recall at 1 for the retrieval benchmark.

Task Benchmark Best Comparable Baseline MV-GPT
VideoQA MSRVTT-QA 41.5 41.7
ActivityNet-QA 38.9 39.1
Text-Video Retrieval MSR-VTT 33.7 37.3
Action Recognition Kinetics-400 78.9 80.4
Kinetics-600 80.6 82.4
Comparisons of MV-GPT to best comparable baseline models on five video understanding benchmarks. For each dataset we report the widely used primary metric, i.e., MSRVTT-QA and ActivityNet-QA: Top-1 answer accuracy; MSR-VTT: Recall at 1; and Kinetics: Top-1 classification accuracy.

Summary
We introduce MV-GPT, a new generative pre-training framework for multimodal video captioning. Our bi-directional generative objective jointly pre-trains a multimodal encoder and a caption decoder by using utterances sampled at different times in unlabelled videos. Our pre-trained model achieves state-of-the-art results on multiple video captioning benchmarks and other video understanding tasks, namely VideoQA, video retrieval and action classification.

Acknowledgements
This research was conducted by Paul Hongsuck Seo, Arsha Nagrani, Anurag Arnab and Cordelia Schmid.

Read More

Deep Learning with Label Differential Privacy

Over the last several years, there has been an increased focus on developing differential privacy (DP) machine learning (ML) algorithms. DP has been the basis of several practical deployments in industry — and has even been employed by the U.S. Census — because it enables the understanding of system and algorithm privacy guarantees. The underlying assumption of DP is that changing a single user’s contribution to an algorithm should not significantly change its output distribution.

In the standard supervised learning setting, a model is trained to make a prediction of the label for each input given a training set of example pairs {[input1,label1], …, [inputn, labeln]}. In the case of deep learning, previous work introduced a DP training framework, DP-SGD, that was integrated into TensorFlow and PyTorch. DP-SGD protects the privacy of each example pair [input, label] by adding noise to the stochastic gradient descent (SGD) training algorithm. Yet despite extensive efforts, in most cases, the accuracy of models trained with DP-SGD remains significantly lower than that of non-private models.

DP algorithms include a privacy budget, ε, which quantifies the worst-case privacy loss for each user. Specifically, ε reflects how much the probability of any particular output of a DP algorithm can change if one replaces any example of the training set with an arbitrarily different one. So, a smaller ε corresponds to better privacy, as the algorithm is more indifferent to changes of a single example. However, since smaller ε tends to hurt model utility more, it is not uncommon to consider ε up to 8 in deep learning applications. Notably, for the widely used multiclass image classification dataset, CIFAR-10, the highest reported accuracy (without pre-training) for DP models with ε = 3 is 69.3%, a result that relies on handcrafted visual features. In contrast, non-private scenarios (ε = ∞) with learned features have shown to achieve >95% accuracy while using modern neural network architectures. This performance gap remains a roadblock for many real-world applications to adopt DP. Moreover, despite recent advances, DP-SGD often comes with increased computation and memory overhead due to slower convergence and the need to compute the norm of the per-example gradient.

In “Deep Learning with Label Differential Privacy”, presented at NeurIPS 2021, we consider a more relaxed, but important, special case called label differential privacy (LabelDP), where we assume the inputs (input1, …, inputn) are public, and only the privacy of the training labels (label1, …, labeln) needs to be protected. With this relaxed guarantee, we can design novel algorithms that utilize a prior understanding of the labels to improve the model utility. We demonstrate that LabelDP achieves 20% higher accuracy than DP-SGD on the CIFAR-10 dataset. Our results across multiple tasks confirm that LabelDP could significantly narrow the performance gap between private models and their non-private counterparts, mitigating the challenges in real world applications. We also present a multi-stage algorithm for training deep neural networks with LabelDP. Finally, we are excited to release the code for this multi-stage training algorithm.

LabelDP
The notion of LabelDP has been studied in the Probably Approximately Correct (PAC) learning setting, and captures several practical scenarios. Examples include: (i) computational advertising, where impressions are known to the advertiser and thus considered non-sensitive, but conversions reveal user interest and are thus private; (ii) recommendation systems, where the choices are known to a streaming service provider, but the user ratings are considered sensitive; and (iii) user surveys and analytics, where demographic information (e.g., age, gender) is non-sensitive, but income is sensitive.

We make several key observations in this scenario. (i) When only the labels need to be protected, much simpler algorithms can be applied for data preprocessing to achieve LabelDP without any modifications to the existing deep learning training pipeline. For example, the classic Randomized Response (RR) algorithm, designed to eliminate evasive answer biases in survey aggregation, achieves LabelDP by simply flipping the label to a random one with a probability that depends on ε. (ii) Conditioned on the (public) input, we can compute a prior probability distribution, which provides a prior belief of the likelihood of the class labels for the given input. With a novel variant of RR, RR-with-prior, we can incorporate prior information to reduce the label noise while maintaining the same privacy guarantee as classical RR.

The figure below illustrates how RR-with-prior works. Assume a model is built to classify an input image into 10 categories. Consider a training example with the label “airplane”. To guarantee LabelDP, classical RR returns a random label sampled according to a given distribution (see the top-right panel of the figure below). The smaller the targeted privacy budget ε is, the larger the probability of sampling an incorrect label has to be. Now assume we have a prior probability showing that the given input is “likely an object that flies” (lower left panel). With the prior, RR-with-prior will discard all labels with small prior and only sample from the remaining labels. By dropping these unlikely labels, the probability of returning the correct label is significantly increased, while maintaining the same privacy budget ε (lower right panel).

Randomized response: If no prior information is given (top-left), all classes are sampled with equal probability. The probability of sampling the true class (P[airplane] ≈ 0.5) is higher if the privacy budget is higher (top-right). RR-with-prior: Assuming a prior distribution (bottom-left), unlikely classes are “suppressed” from the sampling distribution (bottom-right). So the probability of sampling the true class (P[airplane] ≈ 0.9) is increased under the same privacy budget.

A Multi-stage Training Algorithm
Based on the RR-with-prior observations, we present a multi-stage algorithm for training deep neural networks with LabelDP. First, the training set is randomly partitioned into multiple subsets. An initial model is then trained on the first subset using classical RR. Finally, the algorithm divides the data into multiple parts, and at each stage, a single part is used to train the model. The labels are produced using RR-with-prior, and the priors are based on the prediction of the model trained so far.

An illustration of the multi-stage training algorithm. The training set is partitioned into t disjoint subsets. An initial model is trained on the first subset using classical RR. Then the trained model is used to provide prior predictions in the RR-with-prior step and in the training of the later stages.

Results
We benchmark the multi-stage training algorithm’s empirical performance on multiple datasets, domains, and architectures. On the CIFAR-10 multi-class classification task for the same privacy budget ε, the multi-stage training algorithm (blue in the figure below) guaranteeing LabelDP achieves 20% higher accuracy than DP-SGD. We emphasize that LabelDP protects only the labels while DP-SGD protects both the inputs and labels, so this is not a strictly fair comparison. Nonetheless, this result demonstrates that for specific application scenarios where only the labels need to be protected, LabelDP could lead to significant improvements in the model utility while narrowing the performance gap between private models and public baselines.

Comparison of the model utility (test accuracy) of different algorithms under different privacy budgets.

In some domains, prior knowledge is naturally available or can be built using publicly available data only. For example, many machine learning systems have historical models which could be evaluated on new data to provide label priors. In domains where unsupervised or self-supervised learning algorithms work well, priors could also be built from models pre-trained on unlabeled (therefore public with respect to LabelDP) data. Specifically, we demonstrate two self-supervised learning algorithms in our CIFAR-10 evaluation (orange and green traces in the figure above). We use self-supervised learning models to compute representations for the training examples and run k-means clustering on the representations. Then, we spend a small amount of privacy budget (ε ≤ 0.05) to query a histogram of the label distribution of each cluster and use that as the label prior for the points in each cluster. This prior significantly boosts the model utility in the low privacy budget regime (ε < 1).

Similar observations hold across multiple datasets such as MNIST, Fashion-MNIST and non-vision domains, such as the MovieLens-1M movie rating task. Please see our paper for the full report on the empirical results.

The empirical results suggest that protecting the privacy of the labels can be significantly easier than protecting the privacy of both the inputs and labels. This can also be mathematically proven under specific settings. In particular, we can show that for convex stochastic optimization, the sample complexity of algorithms privatizing the labels is much smaller than that of algorithms privatizing both labels and inputs. In other words, to achieve the same level of model utility under the same privacy budget, LabelDP requires fewer training examples.

Conclusion
We demonstrated that both empirical and theoretical results suggest that LabelDP is a promising relaxation of the full DP guarantee. In applications where the privacy of the inputs does not need to be protected, LabelDP could reduce the performance gap between a private model and the non-private baseline. For future work, we plan to design better LabelDP algorithms for other tasks beyond multi-class classification. We hope that the release of the multi-stage training algorithm code provides researchers with a useful resource for DP research.

Acknowledgements
This work was carried out in collaboration with Badih Ghazi, Noah Golowich, and Ravi Kumar. We also thank Sami Torbey for valuable feedback on our work.

Read More

Image-Text Pre-training with Contrastive Captioners

Oftentimes, machine learning (ML) model developers begin their design using a generic backbone model that is trained at scale and with capabilities transferable to a wide range of downstream tasks. In natural language processing, a number of popular backbone models, including BERT, T5, GPT-3 (sometimes also referred to as “foundation models”), are pre-trained on web-scale data and have demonstrated generic multi-tasking capabilities through zero-shot, few-shot or transfer learning. Compared with training over-specialized individual models, pre-training backbone models for a large number of downstream tasks can amortize the training costs, allowing one to overcome resource limitations when building large scale models.

In computer vision, pioneering work has shown the effectiveness of single-encoder models pre-trained for image classification to capture generic visual representations that are effective for other downstream tasks. More recently, contrastive dual-encoder (CLIP, ALIGN, Florence) and generative encoder-decoder (SimVLM) approaches trained using web-scale noisy image-text pairs have been explored. Dual-encoder models exhibit remarkable zero-shot image classification capabilities but are less effective for joint vision-language understanding. On the other hand, encoder-decoder methods are good at image captioning and visual question answering but cannot perform retrieval-style tasks.

In “CoCa: Contrastive Captioners are Image-Text Foundation Models”, we present a unified vision backbone model called Contrastive Captioner (CoCa). Our model is a novel encoder-decoder approach that simultaneously produces aligned unimodal image and text embeddings and joint multimodal representations, making it flexible enough to be directly applicable for all types of downstream tasks. Specifically, CoCa achieves state-of-the-art results on a series of vision and vision-language tasks spanning vision recognition, cross-modal alignment, and multimodal understanding. Furthermore, it learns highly generic representations so that it can perform as well or better than fully fine-tuned models with zero-shot learning or frozen encoders.

Overview of Contrastive Captioners (CoCa) compared to single-encoder, dual-encoder and encoder-decoder models.

Method
We propose CoCa, a unified training framework that combines contrastive loss and captioning loss on a single training data stream consisting of image annotations and noisy image-text pairs, effectively merging single-encoder, dual-encoder and encoder-decoder paradigms.

To this end, we present a novel encoder-decoder architecture where the encoder is a vision transformer (ViT), and the text decoder transformer is decoupled into two parts, a unimodal text decoder and a multimodal text decoder. We skip cross-attention in unimodal decoder layers to encode text-only representations for contrastive loss, and cascade multimodal decoder layers with cross-attention to image encoder outputs to learn multimodal image-text representations for captioning loss. This design maximizes the model’s flexibility and universality in accommodating a wide spectrum of tasks, and at the same time, it can be efficiently trained with a single forward and backward propagation for both training objectives, resulting in minimal computational overhead. Thus, the model can be trained end-to-end from scratch with training costs comparable to a naïve encoder-decoder model.

Illustration of forward propagation used by CoCa for both contrastive and captioning losses.

Benchmark Results
The CoCa model can be directly fine-tuned on many tasks with minimal adaptation. By doing so, our model achieves a series of state-of-the-art results on popular vision and multimodal benchmarks, including (1) visual recognition: ImageNet, Kinetics-400/600/700, and MiT; (2) cross-modal alignment: MS-COCO, Flickr30K, and MSR-VTT; and (3) multimodal understanding: VQA, SNLI-VE, NLVR2, and NoCaps.

Comparison of CoCa with other image-text backbone models (without task-specific customization) and multiple state-of-the-art task-specialized models.

It is noteworthy that CoCa attains these results as a single model adapted for all tasks while often lighter than prior top-performing specialized models. For example, CoCa obtains 91.0% ImageNet top-1 accuracy while using less than half the parameters of prior state-of-the-art models. In addition, CoCa also obtains strong generative capability of high-quality image captions.

Image classification scaling performance comparing fine-tuned ImageNet top-1 accuracy versus model size.
Text captions generated by CoCa with NoCaps images as input.

Zero-Shot Performance
Besides achieving excellent performance with fine-tuning, CoCa also outperforms previous state-of-the-art models on zero-shot learning tasks, including image classification,and cross-modal retrieval. CoCa obtains 86.3% zero-shot accuracy on ImageNet while also robustly outperforming prior models on challenging variant benchmarks, such as ImageNet-A, ImageNet-R, ImageNet-V2, and ImageNet-Sketch. As shown in the figure below, CoCa obtains better zero-shot accuracy with smaller model sizes compared to prior methods.

Image classification scaling performance comparing zero-shot ImageNet top-1 accuracy versus model size.

Frozen Encoder Representation
One particularly exciting observation is that CoCa achieves results comparable to the best fine-tuned models using only a frozen visual encoder, in which features extracted after model training are used to train a classifier, rather than the more computationally intensive effort of fine-tuning a model. On ImageNet, a frozen CoCa encoder with a learned classification head obtains 90.6% top-1 accuracy, which is better than the fully fine-tuned performance of existing backbone models (90.1%). We also find this setup to work extremely well for video recognition. We feed sampled video frames into the CoCa frozen image encoder individually, and fuse output features by attentional pooling before applying a learned classifier. This simple approach using a CoCa frozen image encoder achieves video action recognition top-1 accuracy of 88.0% on Kinetics-400 dataset and demonstrates that CoCa learns a highly generic visual representation with the combined training objectives.

Comparison of Frozen CoCa visual encoder with (multiple) best-performing fine-tuned models.

Conclusion
We present Contrastive Captioner (CoCa), a novel pre-training paradigm for image-text backbone models. This simple method is widely applicable to many types of vision and vision-language downstream tasks, and obtains state-of-the-art performance with minimal or even no task-specific adaptations.

Acknowledgements
We would like to thank our co-authors Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu who have been involved in all aspects of the project. We also would like to thank Yi-Ting Chen, Kaifeng Chen, Ye Xia, Zhen Li, Chao Jia, Yinfei Yang, Zhengdong Zhang, Wei Han, Yuan Cao, Tao Zhu, Futang Peng, Soham Ghosh, Zihang Dai, Xin Li, Anelia Angelova, Jason Baldridge, Izhak Shafran, Shengyang Dai, Abhijit Ogale, Zhifeng Chen, Claire Cui, Paul Natsev, Tom Duerig for helpful discussions, Andrew Dai for help with contrastive models, Christopher Fifty and Bowen Zhang for help with video models, Yuanzhong Xu for help with model scaling, Lucas Beyer for help with data preparation, Andy Zeng for help with MSR-VTT evaluation, Hieu Pham and Simon Kornblith for help with zero-shot evaluations, Erica Moreira and Victor Gomes for help with resource coordination, Liangliang Cao for proofreading, Tom Small for creating the animations used in this blogpost, and others in the Google Brain team for support throughout this project.

Read More

How we build with and for people with disabilities

Editor’s note: Today is Global Accessibility Awareness Day. We’re also sharing how we’re making education more accessibleand launching a newAndroid accessibility feature.

Over the past nine years, my job has focused on building accessible products and supporting Googlers with disabilities. Along the way, I’ve been constantly reminded of how vast and diverse the disability community is, and how important it is to continue working alongside this community to build technology and solutions that are truly helpful.

Before delving into some of the accessibility features our teams have been building, I want to share how we’re working to be more inclusive of people with disabilities to create more accessible tools overall.

Nothing about us, without us

In the disability community, people often say “nothing about us without us.” It’s a sentiment that I find sums up what disability inclusion means. The types of barriers that people with disabilities face in society vary depending on who they are, where they live and what resources they have access to. No one’s experience is universal. That’s why it’s essential to include a wide array of people with disabilities at every stage of the development process for any of our accessibility products, initiatives or programs.

We need to work to make sure our teams at Google are reflective of the people we’re building for. To do so, last year we launched our hiring site geared toward people with disabilities — including our Autism Career Program to further grow and strengthen our autistic community. Most recently, we helped launch the Neurodiversity Career Connector along with other companies to create a job portal that connects neurodiverse candidates to companies that are committed to hiring more inclusively.

Beyond our internal communities, we also must partner with communities outside of Google so we can learn what is truly useful to different groups and parlay that understanding into the improvement of current products or the creation of new ones. Those partnerships have resulted in the creation of Project Relate, a communication tool for people with speech impairments, the development of a completely new TalkBack, Android’s built-in screen reader, and the improvement of Select-to-Speak, a Chromebook tool that lets you hear selected text on your screen spoken out loud.

Equitable experiences for everyone

Engaging and listening to these communities — inside and outside of Google — make it possible to create tools and features like the ones we’re sharing today.

The ability to add alt-text, which is a short description of an image that is read aloud by screen readers, directly to images sent through Gmail starts rolling out today. With this update, people who use screen readers will know what’s being sent to them, whether it’s a GIF celebrating the end of the week or a screenshot of an important graph.

Communication tools that are inclusive of everyone are especially important as teams have shifted to fully virtual or hybrid meetings. Again, everyone experiences these changes differently. We’ve heard from some people who are deaf or hard of hearing, that this shift has made it easier to identify who is speaking — something that is often more difficult in person. But, in the case of people who use ASL, we’ve heard that it can be difficult to be in a virtual meeting and simultaneously see their interpreter and the person speaking to them.

Multi-pin, a new feature in Google Meet, helps solve this. Now you can pin multiple video tiles at once, for example, the presenter’s screen and the interpreter’s screen. And like many accessibility features, the usefulness extends beyond people with disabilities. The next time someone is watching a panel and wants to pin multiple people to the screen, this feature makes that possible.

We’ve also been working to make video content more accessible to those who are blind or low-vision through audio descriptions that describe verbally what is on the screen visually. All of our English language YouTube Originals content from the past year — and moving forward — will now have English audio descriptions available globally. To turn on the audio description track, at the bottom right of the video player, click on “Settings”, select “Audio track”, and choose “English descriptive”.

For many people with speech impairments, being understood by the technology that powers tools like voice typing or virtual assistants can be difficult. In 2019, we started work to change that through Project Euphonia, a research initiative that works with community organizations and people with speech impairments to create more inclusive speech recognition models. Today, we’re expanding Project Euphonia’s research to include four more languages: French, Hindi, Japanese and Spanish. With this expansion, we can create even more helpful technology for more people — no matter where they are or what language they speak.

I’ve learned so much in my time working in this space and among the things I’ve learned is the absolute importance of building right alongside the very people who will most use these tools in the end. We’ll continue to do that as we work to create a more inclusive and accessible world, both physically and digitally.

Read More

Vector-Quantized Image Modeling with Improved VQGAN

In recent years, natural language processing models have dramatically improved their ability to learn general-purpose representations, which has resulted in significant performance gains for a wide range of natural language generation and natural language understanding tasks. In large part, this has been accomplished through pre-training language models on extensive unlabeled text corpora.

This pre-training formulation does not make assumptions about input signal modality, which can be language, vision, or audio, among others. Several recent papers have exploited this formulation to dramatically improve image generation results through pre-quantizing images into discrete integer codes (represented as natural numbers), and modeling them autoregressively (i.e., predicting sequences one token at a time). In these approaches, a convolutional neural network (CNN) is trained to encode an image into discrete tokens, each corresponding to a small patch of the image. A second stage CNN or Transformer is then trained to model the distribution of encoded latent variables. The second stage can also be applied to autoregressively generate an image after the training. But while such models have achieved strong performance for image generation, few studies have evaluated the learned representation for downstream discriminative tasks (such as image classification).

In “Vector-Quantized Image Modeling with Improved VQGAN”, we propose a two-stage model that reconceives traditional image quantization techniques to yield improved performance on image generation and image understanding tasks. In the first stage, an image quantization model, called VQGAN, encodes an image into lower-dimensional discrete latent codes. Then a Transformer model is trained to model the quantized latent codes of an image. This approach, which we call Vector-quantized Image Modeling (VIM), can be used for both image generation and unsupervised image representation learning. We describe multiple improvements to the image quantizer and show that training a stronger image quantizer is a key component for improving both image generation and image understanding.

Vector-Quantized Image Modeling with ViT-VQGAN
One recent, commonly used model that quantizes images into integer tokens is the Vector-quantized Variational AutoEncoder (VQVAE), a CNN-based auto-encoder whose latent space is a matrix of discrete learnable variables, trained end-to-end. VQGAN is an improved version of this that introduces an adversarial loss to promote high quality reconstruction. VQGAN uses transformer-like elements in the form of non-local attention blocks, which allows it to capture distant interactions using fewer layers.

In our work, we propose taking this approach one step further by replacing both the CNN encoder and decoder with ViT. In addition, we introduce a linear projection from the output of the encoder to a low-dimensional latent variable space for lookup of the integer tokens. Specifically, we reduced the encoder output from a 768-dimension vector to a 32- or 8-dimension vector per code, which we found encourages the decoder to better utilize the token outputs, improving model capacity and efficiency.

Overview of the proposed ViT-VQGAN (left) and VIM (right), which, when working together, is capable of both image generation and image understanding. In the first stage, ViT-VQGAN converts images into discrete integers, which the autoregressive Transformer (Stage 2) then learns to model. Finally, the Stage 1 decoder is applied to these tokens to enable generation of high quality images from scratch.

With our trained ViT-VQGAN, images are encoded into discrete tokens represented by integers, each of which encompasses an 8×8 patch of the input image. Using these tokens, we train a decoder-only Transformer to predict a sequence of image tokens autoregressively. This two-stage model, VIM, is able to perform unconditioned image generation by simply sampling token-by-token from the output softmax distribution of the Transformer model.

VIM is also capable of performing class-conditioned generation, such as synthesizing a specific image of a given class (e.g., a dog or a cat). We extend the unconditional generation to class-conditioned generation by prepending a class-ID token before the image tokens during both training and sampling.

Uncurated set of dog samples from class-conditioned image generation trained on ImageNet. Conditioned classes: Irish terrier, Norfolk terrier, Norwich terrier, Yorkshire terrier, wire-haired fox terrier, Lakeland terrier.

To test the image understanding capabilities of VIM, we also fine-tune a linear projection layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Similar to ImageGPT, we take a layer output at a specific block, average over the sequence of token features (frozen) and insert a softmax layer (learnable) projecting averaged features to class logits. This allows us to capture intermediate features that provide more information useful for representation learning.

Experimental Results
We train all ViT-VQGAN models with a training batch size of 256 distributed across 128 CloudTPUv4 cores. All models are trained with an input image resolution of 256×256. On top of the pre-learned ViT-VQGAN image quantizer, we train Transformer models for unconditional and class-conditioned image synthesis and compare with previous work.

We measure the performance of our proposed methods for class-conditioned image synthesis and unsupervised representation learning on the widely used ImageNet benchmark. In the table below we demonstrate the class-conditioned image synthesis performance measured by the Fréchet Inception Distance (FID). Compared to prior work, VIM improves the FID to 3.07 (lower is better), a relative improvement of 58.6% over the VQGAN model (FID 7.35). VIM also improves the capacity for image understanding, as indicated by the Inception Score (IS), which goes from 188.6 to 227.4, a 20.6% improvement relative to VQGAN.

Model Acceptance
Rate
FID IS
Validation data 1.0 1.62 235.0
DCTransformer 1.0 36.5 N/A
BigGAN 1.0 7.53 168.6
BigGAN-deep 1.0 6.84 203.6
IDDPM 1.0 12.3 N/A
ADM-G, 1.0 guid. 1.0 4.59 186.7
VQVAE-2 1.0 ~31 ~45
VQGAN 1.0 17.04 70.6
VQGAN 0.5 10.26 125.5
VQGAN 0.25 7.35 188.6
ViT-VQGAN (Ours) 1.0 4.17 175.1
ViT-VQGAN (Ours) 0.5 3.04 227.4
Fréchet Inception Distance (FID) comparison between different models for class-conditional image synthesis and Inception Score (IS) for image understanding, both on ImageNet with resolution 256×256. The acceptance rate shows results filtered by a ResNet-101 classification model, similar to the process in VQGAN.

After training a generative model, we test the learned image representations by fine-tuning a linear layer to perform ImageNet classification, a standard benchmark for measuring image understanding abilities. Our model outperforms previous generative models on the image understanding task, improving classification accuracy through linear probing (i.e., training a single linear classification layer, while keeping the rest of the model frozen) from 60.3% (iGPT-L) to 73.2%. These results showcase VIM’s strong generation results as well as image representation learning abilities.

Conclusion
We propose Vector-quantized Image Modeling (VIM), which pretrains a Transformer to predict image tokens autoregressively, where discrete image tokens are produced from improved ViT-VQGAN image quantizers. With our proposed improvements on image quantization, we demonstrate superior results on both image generation and understanding. We hope our results can inspire future work towards more unified approaches for image generation and understanding.

Acknowledgements
We would like to thank Xin Li, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, Yonghui Wu for the preparation of the VIM paper. We thank Wei Han, Yuan Cao, Jiquan Ngiam‎, Vijay Vasudevan, Zhifeng Chen and Claire Cui for helpful discussions and feedback, and others on the Google Research and Brain Team for support throughout this project.

Read More

Contextual Rephrasing in Google Assistant

When people converse with one another, context and references play a critical role in driving their conversation more efficiently. For instance, if one asks the question “Who wrote Romeo and Juliet?” and, after receiving an answer, asks “Where was he born?”, it is clear that ‘he’ is referring to William Shakespeare without the need to explicitly mention him. Or if someone mentions “python” in a sentence, one can use the context from the conversation to determine whether they are referring to a type of snake or a computer language. If a virtual assistant cannot robustly handle context and references, users would be required to adapt to the limitation of the technology by repeating previously shared contextual information in their follow-up queries to ensure that the assistant understands their requests and can provide relevant answers.

In this post, we present a technology currently deployed on Google Assistant that allows users to speak in a natural manner when referencing context that was defined in previous queries and answers. The technology, based on the latest machine learning (ML) advances, rephrases a user’s follow-up query to explicitly mention the missing contextual information, thus enabling it to be answered as a stand-alone query. While Assistant considers many types of context for interpreting the user input, in this post we are focusing on short-term conversation history.

Context Handling by Rephrasing
One of the approaches taken by Assistant to understand contextual queries is to detect if an input utterance is referring to previous context and then rephrase it internally to explicitly include the missing information. Following on from the previous example in which the user asked who wrote Romeo and Juliet, one may ask follow-up questions like “When?”. Assistant recognizes that this question is referring to both the subject (Romeo and Juliet) and answer from the previous query (William Shakespeare) and can rephrase “When?” to “When did William Shakespeare write Romeo and Juliet?”

While there are other ways to handle context, for instance, by applying rules directly to symbolic representations of the meaning of queries, like intents and arguments, the advantage of the rephrasing approach is that it operates horizontally at the string level across any query answering, parsing, or action fulfillment module.

Conversation on a smart display device, where Assistant understands multiple contextual follow-up queries, allowing the user to have a more natural conversation. The phrases appearing at the bottom of the display are suggestions for follow-up questions that the user can select. However, the user can still ask different questions.

A Wide Variety of Contextual Queries
The natural language processing field, traditionally, has not put much emphasis on a general approach to context, focusing on the understanding of stand-alone queries that are fully specified. Accurately incorporating context is a challenging problem, especially when considering the large variety of contextual query types. The table below contains example conversations that illustrate query variability and some of the many contextual challenges that Assistant’s rephrasing method can resolve (e.g., differentiating between referential and non-referential cases or identifying what context a query is referencing). We demonstrate how Assistant is now able to rephrase follow-up queries, adding contextual information before providing an answer.

System Architecture
At a high level, the rephrasing system generates rephrasing candidates by using different types of candidate generators. Each rephrasing candidate is then scored based on a number of signals, and the one with the highest score is selected.

High level architecture of Google Assistant contextual rephraser.

Candidate Generation
To generate rephrasing candidates we use a hybrid approach that applies different techniques, which we classify into three categories:

  1. Generators based on the analysis of the linguistic structure of the queries use grammatical and morphological rules to perform specific operations — for instance, the replacement of pronouns or other types of referential phrases with antecedents from the context.
  2. Generators based on query statistics combine key terms from the current query and its context to create candidates that match popular queries from historical data or common query patterns.
  3. Generators based on Transformer technologies, such as MUM, learn to generate sequences of words according to a number of training samples. LaserTagger and FELIX are technologies suitable for tasks with high overlap between the input and output texts, are very fast at inference time, and are not vulnerable to hallucination (i.e., generating text that is not related to the input texts). Once presented with a query and its context, they can generate a sequence of text edits to transform the input queries into a rephrasing candidate by indicating which portions of the context should be preserved and which words should be modified.

Candidate Scoring
We extract a number of signals for each rephrasing candidate and use an ML model to select the most promising candidate. Some of the signals depend only on the current query and its context. For example, is the topic of the current query similar to the topic of the previous query? Or, is the current query a good stand-alone query or does it look incomplete? Other signals depend on the candidate itself: How much of the information of the context does the candidate preserve? Is the candidate well-formed from a linguistic point of view? Etc.

Recently, new signals generated by BERT and MUM models have significantly improved the performance of the ranker, fixing about one-third of the recall headroom while minimizing false positives on query sequences that are not contextual (and therefore do not require a rephrasing).

Example conversation on a phone where Assistant understands a sequence of contextual queries.

Conclusion
The solution described here attempts to resolve contextual queries by rephrasing them in order to make them fully answerable in a stand-alone manner, i.e., without having to relate to other information during the fulfillment phase. The benefit of this approach is that it is agnostic to the mechanisms that would fulfill the query, thus making it usable as a horizontal layer to be deployed before any further processing.

Given the variety of contexts naturally used in human languages, we adopted a hybrid approach that combines linguistic rules, large amounts of historic data through logs, and ML models based on state-of-the-art Transformer approaches. By generating a number of rephrasing candidates for each query and its context, and then scoring and ranking them using a variety of signals, Assistant can rephrase and thus correctly interpret most contextual queries. As Assistant can handle most types of linguistic references, we are empowering users to have more natural conversations. To make such multi-turn conversations even less cumbersome, Assistant users can turn on Continued Conversation mode to enable asking follow-up queries without the need to repeat “Hey Google” between each query. We are also using this technology in other virtual assistant settings, for instance, interpreting context from something shown on a screen or playing on a speaker.

Acknowledgements
This post reflects the combined work of Aliaksei Severyn, André Farias, Cheng-Chun Lee, Florian Thöle, Gabriel Carvajal, Gyorgy Gyepesi, Julien Cretin, Liana Marinescu, Martin Bölle, Patrick Siegler, Sebastian Krause, Victor Ähdel, Victoria Fossum, Vincent Zhao. We also thank Amar Subramanya, Dave Orr, Yury Pinsky for helpful discussions and support.

Read More

Challenges in Multi-objective Optimization for Automatic Wireless Network Planning

Economics, combinatorics, physics, and signal processing conspire to make it difficult to design, build, and operate high-quality, cost-effective wireless networks. The radio transceivers that communicate with our mobile phones, the equipment that supports them (such as power and wired networking), and the physical space they occupy are all expensive, so it’s important to be judicious in choosing sites for new transceivers. Even when the set of available sites is limited, there are exponentially many possible networks that can be built. For example, given only 50 sites, there are 250 (over a million billion) possibilities!

Further complicating things, for every location where service is needed, one must know which transceiver provides the strongest signal and how strong it is. However, the physical characteristics of radio propagation in an environment containing buildings, hills, foliage, and other clutter are incredibly complex, so accurate predictions require sophisticated, computationally-intensive models. Building all possible sites would yield the best coverage and capacity, but even if this were not prohibitively expensive, it would create unacceptable interference among nearby transceivers. Balancing these trade-offs is a core mathematical difficulty.

The goal of wireless network planning is to decide where to place new transceivers to maximize coverage and capacity while minimizing cost and interference. Building an automatic network planning system (a.k.a., auto-planner) that quickly solves national-scale problems at fine-grained resolution without compromising solution quality has been among the most important and difficult open challenges in telecom research for decades.

To address these issues, we are piloting network planning tools built using detailed geometric models derived from high-resolution geographic data, that feed into radio propagation models powered by distributed computing. This system provides fast, high-accuracy predictions of signal strength. Our optimization algorithms then intelligently sift through the exponential space of possible networks to output a small menu of candidate networks that each achieve different desirable trade-offs among cost, coverage, and interference, while ensuring enough capacity to meet demand.

Example auto-planning project in Charlotte, NC. Blue dots denote selected candidate sites. The heat map indicates signal strength from the propagation engine. The inset emphasizes the non-isotropic path loss in downtown.

<!–

Example auto-planning project in Charlotte, NC. Blue dots denote selected candidate sites. The heat map indicates signal strength from the propagation engine. The inset emphasizes the non-isotropic path loss in downtown.

–>

Radio Propagation
The propagation of radio waves near Earth’s surface is complicated. Like ripples in a pond, they decay with distance traveled, but they can also penetrate, bounce off, or bend around obstacles, further weakening the signal. Computing radio wave attenuation across a real-world landscape (called path loss) is a hybrid process combining traditional physics-based calculations with learned corrections accounting for obstruction, diffraction, reflection, and scattering of the signal by clutter (e.g., trees and buildings).

We have developed a radio propagation modeling engine that leverages the same high-res geodata that powers Google Earth, Maps and Street View to map the 3D distribution of vegetation and buildings. While accounting for signal origin, frequency, broadcast strength, etc., we train signal correction models using extensive real-world measurements, which account for diverse propagation environments — from flat to hilly terrain and from dense urban to sparse rural areas.

While such hybrid approaches are common, using detailed geodata enables accurate path loss predictions below one-meter resolution. Our propagation engine provides fast point-to-point path loss calculations and scales massively via distributed computation. For instance, computing coverage for 25,000 transceivers scattered across the continental United States can be done at 4 meter resolution in only 1.5 hours, using 1000 CPU cores.

Photorealistic 3D model in Google Earth (top-left) and corresponding clutter height model (top-right). Path profile through buildings and trees from a source to destination in the clutter model (bottom). Gray denotes buildings and green denotes trees.

Auto-Planning Inputs
Once accurate coverage estimates are available, we can use them to optimize network planning, for example, deciding where to place hundreds of new sites to maximize network quality. The auto-planning solver addresses large-scale combinatorial optimization problems such as these, using a fast, robust, scalable approach.

Formally, an auto-planning input instance contains a set of demand points (usually a square grid) where service is to be provided, a set of candidate transceiver sites, predicted signal strengths from candidate sites to demand points (supplied by the propagation model), and a cost budget. Each demand point includes a demand quantity (e.g., estimated from the population of wireless users), and each site includes a cost and capacity. Signal strengths below some threshold are omitted. Finally, the input may include an overall cost budget.

Data Summarization for Large Instances
Auto-planning inputs can be huge, not just because of the number of candidate sites (tens of thousands), and demand points (billions), but also because it requires signal strengths to all demand points from all nearby candidate sites. Simple downsampling is insufficient because population density may vary widely over a given region. Therefore, we apply methods like priority sampling to shrink the data. This technique produces a low-variance, unbiased estimate of the original data, preserving an accurate view of the network traffic and interference statistics, and shrinking the input data enough that a city-size instance fits into memory on one machine.

Multi-objective Optimization via Local Search
Combinatorial optimization remains a difficult task, so we created a domain-specific local search algorithm to optimize network quality. The local search algorithmic paradigm is widely applied to address computationally-hard optimization problems. Such algorithms move from one solution to another through a search space of candidate solutions by applying small local changes, stopping at a time limit or when the solution is locally optimal. To evaluate the quality of a candidate network, we combine the different objective functions into a single one, as described in the following section.

The number of local steps to reach a local optimum, number of candidate moves we evaluate per step, and time to evaluate each candidate can all be large when dealing with realistic networks. To achieve a high-quality algorithm that finishes within hours (rather than days), we must address each of these components. Fast candidate evaluation benefits greatly from dynamic data structures that maintain the mapping between each demand point and the site in the candidate solution that provides the strongest signal to it. We update this “strongest-signal” map efficiently as the candidate solution evolves during local search. The following observations help limit both the number of steps to convergence and evaluations per step.

Bipartite graph representing candidate sites (left) and demand points (right). Selected sites are circled in red, and each demand point is assigned to its strongest available connection. The topmost demand point has no service because the only site that can reach it was not selected. The third and fourth demand points from the top may have high interference if the signal strengths attached to their gray edges are only slightly lower than the ones on their red edges. The bottommost site has high congestion because many demand points get their service from that site, possibly exceeding its capacity.

Selecting two nearby sites is usually not ideal because they interfere. Our algorithm explicitly forbids such pairs of sites, thereby steering the search toward better solutions while greatly reducing the number of moves considered per step. We identify pairs of forbidden sites based on the demand points they cover, as measured by the weighted Jaccard index. This captures their functional proximity much better than simple geographic distance does, especially in urban or hilly areas where radio propagation is highly non-isotropic.

Breaking the local search into epochs also helps. The first epoch mostly adds sites to increase the coverage area while avoiding forbidden pairs. As we approach the cost budget, we begin a second epoch that includes swap moves between forbidden pairs to fine-tune the interference. This restriction limits the number of candidate moves per step, while focusing on those that improve interference with less change to coverage.

Three candidate local search moves. Red circles indicate selected sites and the orange edge indicates a forbidden pair.

Outputting a Diverse Set of Good Solutions
As mentioned before, auto-planning must balance three competing objectives: maximizing coverage, while minimizing interference and capacity violations, subject to a cost budget. There is no single correct tradeoff, so our algorithm delegates the final decision to the user by providing a small menu of candidate networks with different emphases. We apply a multiplier to each objective and optimize the sum. Raising the multiplier for a component guides the algorithm to emphasize it. We perform grid search over multipliers and budgets, generating a large number of solutions, filter out any that are worse than another solution along all four components (including cost), and finally select a small subset that represent different tradeoffs.

Menu of candidate solutions, one per row, displaying metrics. Clicking on a solution turns the selected sites pink and displays a plot of the interference distribution across covered area and demand. Sites not selected are blue.

Conclusion
We described our efforts to address the most vexing challenges facing telecom network operators. Using combinatorial optimization in concert with geospatial and radio propagation modeling, we built a scalable auto-planner for wireless telecommunication networks. We are actively exploring how to expand these capabilities to best meet the needs of our customers. Stay tuned!

For questions and other inquiries, please reach out to wireless-network-interest@google.com.

Acknowledgements
These technological advances were enabled by the tireless work of our collaborators: Aaron Archer, Serge Barbosa Da Torre, Imad Fattouch, Danny Liberty, Pishoy Maksy, Zifei Tong, and Mat Varghese. Special thanks to Corinna Cortes, Mazin Gilbert, Rob Katcher, Michael Purdy, Bea Sebastian, Dave Vadasz, Josh Williams, and Aaron Yonas, along with Serge and especially Aaron Archer for their assistance with this blog post.

Read More

Language Models Perform Reasoning via Chain of Thought

In recent years, scaling up the size of language models has been shown to be a reliable way to improve performance on a range of natural language processing (NLP) tasks. Today’s language models at the scale of 100B or more parameters achieve strong performance on tasks like sentiment analysis and machine translation, even with little or no training examples. Even the largest language models, however, can still struggle with certain multi-step reasoning tasks, such as math word problems and commonsense reasoning. How might we enable language models to perform such reasoning tasks?

In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” we explore a prompting method for improving the reasoning abilities of language models. Called chain of thought prompting, this method enables models to decompose multi-step problems into intermediate steps. With chain of thought prompting, language models of sufficient scale (~100B parameters) can solve complex reasoning problems that are not solvable with standard prompting methods.

Comparison to Standard Prompting
With standard prompting (popularized by GPT-3) the model is given examples of input–output pairs (formatted as questions and answers) before being asked to predict the answer for a test-time example (shown below on the left). In chain of thought prompting (below, right), the model is prompted to produce intermediate reasoning steps before giving the final answer to a multi-step problem. The idea is that a model-generated chain of thought would mimic an intuitive thought process when working through a multi-step reasoning problem. While producing a thought process has been previously accomplished via fine-tuning, we show that such thought processes can be elicited by including a few examples of chain of thought via prompting only, which does not require a large training dataset or modifying the language model’s weights.

Whereas standard prompting asks the model to directly give the answer to a multi-step reasoning problem, chain of thought prompting induces the model to decompose the problem into intermediate reasoning steps, in this case leading to a correct final answer.

Chain of thought reasoning allows models to decompose complex problems into intermediate steps that are solved individually. Moreover, the language-based nature of chain of thought makes it applicable to any task that a person could solve via language. We find through empirical experiments that chain of thought prompting can improve performance on various reasoning tasks, and that successful chain of thought reasoning is an emergent property of model scale — that is, the benefits of chain of thought prompting only materialize with a sufficient number of model parameters (around 100B).

Arithmetic Reasoning
One class of tasks where language models typically struggle is arithmetic reasoning (i.e., solving math word problems). Two benchmarks in arithmetic reasoning are MultiArith and GSM8K, which test the ability of language models to solve multi-step math problems similar to the one shown in the figure above. We evaluate both the LaMDA collection of language models ranging from 422M to 137B parameters, as well as the PaLM collection of language models ranging from 8B to 540B parameters. We manually compose chains of thought to include in the examples for chain of thought prompting.

For these two benchmarks, using standard prompting leads to relatively flat scaling curves: increasing the scale of the model does not substantially improve performance (shown below). However, we find that when using chain of thought prompting, increasing model scale leads to improved performance that substantially outperforms standard prompting for large model sizes.

Employing chain of thought prompting enables language models to solve arithmetic reasoning problems for which standard prompting has a mostly flat scaling curve.

On the GSM8K dataset of math word problems, PaLM shows remarkable performance when scaled to 540B parameters. As shown in the table below, combining chain of thought prompting with the 540B parameter PaLM model leads to new state-of-the-art performance of 58%, surpassing the prior state of the art of 55% achieved by fine-tuning GPT-3 175B on a large training set and then ranking potential solutions via a specially trained verifier. Moreover, follow-up work on self-consistency shows that the performance of chain of thought prompting can be improved further by taking the majority vote of a broad set of generated reasoning processes, which results in 74% accuracy on GSM8K.

Chain of thought prompting with PaLM achieves a new state of the art on the GSM8K benchmark of math word problems. For a fair comparison against fine-tuned GPT-3 baselines, the chain of thought prompting results shown here also use an external calculator to compute basic arithmetic functions (i.e., addition, subtraction, multiplication and division).

Commonsense Reasoning
In addition to arithmetic reasoning, we consider whether the language-based nature of chain of thought prompting also makes it applicable to commonsense reasoning, which involves reasoning about physical and human interactions under the presumption of general background knowledge. For these evaluations, we use the CommonsenseQA and StrategyQA benchmarks, as well as two domain-specific tasks from BIG-Bench collaboration regarding date understanding and sports understanding. Example questions are below:

As shown below, for CommonsenseQA, StrategyQA, and Date Understanding, performance improved with model scale, and employing chain of thought prompting led to additional small improvements. Chain of thought prompting had the biggest improvement on sports understanding, for which PaLM 540B’s chain of thought performance surpassed that of an unaided sports enthusiast (95% vs 84%).

Chain of thought prompting also improves performance on various types of commonsense reasoning tasks.

Conclusions
Chain of thought prompting is a simple and broadly applicable method for improving the ability of language models to perform various reasoning tasks. Through experiments on arithmetic and commonsense reasoning, we find that chain of thought prompting is an emergent property of model scale. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.

Acknowledgements
It was an honor and privilege to work with Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Quoc Le on this project.

Read More