Inspired by progress in large-scale language modelling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.Read More
Coding for the cosmos
The post Coding for the cosmos appeared first on The AI Blog.
Using Machine Learning to Help Protect the Great Barrier Reef in Partnership with Australia’s CSIRO
Posted by Megha Malpani, Google Product Manager and Ard Oerlemans, Google Software Engineer
Coral reefs are some of the most diverse and important ecosystems in the world, both for marine life and society more broadly. Not only are healthy reefs critical to fisheries and food security, they also protect coastlines from storm surge, support tourism-based economies, and advance drug discovery research, among other countless benefits.
Reefs face a number of rising threats, most notably climate change, pollution, and overfishing. In the past 30 years alone, there have been dramatic losses in coral cover and habitat in the Great Barrier Reef (GBR), with other reefs experiencing similar declines. In Australia, outbreaks of the coral-eating crown of thorns starfish (COTS) have been shown to cause major coral loss. While COTS naturally exist in the Indo-Pacific, reductions in the abundance of natural predators and excess run-off nutrients have led to massive outbreaks that are devastating already vulnerable coral communities. Controlling COTS populations is critical to promoting coral growth and resilience.
The Great Barrier Reef Foundation established an innovation program to develop new survey and intervention methods that radically improve COTS control. Google teamed up with CSIRO, Australia’s national science agency, to develop innovative machine learning technology that can analyze video sequences accurately, efficiently, and in near real-time. The goal is to transform the underwater survey, monitoring and mapping reefs at scale to help rapidly identify and prioritize COTS outbreaks. This project is part of a broader partnership with CSIRO under Google’s Digital Future Initiative in Australia.
CSIRO developed an edge ML platform (built on top of the NVIDIA Jetson AGX Xavier) that can analyze underwater image sequences and map out detections in near real-time. Our goal was to use the annotated dataset CSIRO had built over multiple field trips to develop the most accurate object detection model (across a variety of environments, weather conditions, and COTS populations) within a set of performance constraints, most notably, processing more than 10 frames per second (FPS) on a <30 watt device.
We hosted a Kaggle competition, leveraging insights from the open source community to drive our experimentation plan. With over 2,000 teams and 61,000 submissions, we were able to learn from the successes and failures of far more experiments than we could hope to execute on our own. We used these insights to define our experimentation roadmap and ended up running hundreds of experiments on Google TPUs.
We used TensorFlow 2’s Model Garden library as our foundation, making use of its scaled YOLOv4 model and corresponding training pipeline implementations. Our team of modeling experts then got to work, modifying the pipeline, experimenting with different image resolutions and model sizes, and applying various data augmentation and quantization techniques to create the most accurate model within our performance constraints.
Due to the limited amount of annotated data, a key part of this problem was figuring out the most effective data augmentation techniques. We ran hundreds of experiments based on what we learned from the Kaggle submissions to determine which techniques in combination were most effective in increasing our model’s accuracy.
In parallel with our modeling workstream, we experimented with batching, XLA, and auto mixed precision (which converts parts of the model to fp16) to try and improve our performance, all of which resulted in increasing our FPS by 3x. We found however, that on the Jetson module, using TensorFlow-TensorRT (converting the entire model to fp16) by itself actually resulted in a 4x total speed up, so we used TF-TRT exclusively moving forward.
After the starfish are detected in specific frames, a tracker is applied that links detections over time. This means that every detected starfish will be assigned a unique ID that it keeps as long as it stays visible in the video. We link detections in subsequent frames to each other by first using optical flow to predict where the starfish will be in the next frame, and then matching detections to predictions based on their Intersection over Union (IoU) score.
In a task like this where recall is more important than precision (i.e. we care more about not missing COTS than false positives), it is useful to consider the F2 metric to assess model accuracy. This metric can be used to evaluate a model’s performance on individual frames. However, our ultimate goal was to determine the total number of COTS present in the video stream. Thus, we cared more about evaluating the entire pipeline’s accuracy (model + tracker) than frame-by-frame performance (i.e. it’s okay if the model has inaccurate predictions on a frame or two as long as the pipeline correctly identifies the starfish’s overall existence and location). We ended up using a sequence-based F2 metric that determines how many “tracks” are found at a certain average IoU threshold.
Our current 1080p model using TensorFlow TensorRT runs at 11 FPS on the Jetson AGX Xavier, reaching a sequence-based F2 score of 0.80! We additionally trained a 720p model that runs at 22 FPS on the Jetson module, with a sequence-based F2 score of 0.78.
Google & CSIRO are thrilled to announce that we are open-sourcing both COTS Object Detection models and have created a Colab notebook to demonstrate the server-side inference workflow. Our Colab tutorial allows students, marine researchers, or data scientists to evaluate our COTS ML models on image sequences with zero configuration/ML knowledge. Additionally, it provides a blueprint for implementing an optimized inference pipeline for edge ML platforms, such as the Jetson module. Please stay tuned as we plan to continue updating our models & trackers, ultimately open-sourcing a full TFX pipeline and dataset so that conservation organizations and other governments around the world can retrain and modify our model with their own datasets. Please reach out to us if you have a specific use case you’d like to collaborate on!
Acknowledgements
A huge thank you to everyone who’s hard work made this project possible!
We couldn’t have done this without our partners at CSIRO – Brano Kusy, Jiajun Liu, Yang Li, Lachlan Tychsen-Smith, David Ahmedt-Aristizabal, Ross Marchant, Russ Babcock, Mick Haywood, Brendan Do, Jeremy Oorloff, Lars Andersson, and Joey Crosswell, the amazing Kaggle community, and last but not least, the team at Google – Glenn Cameron, Scott Riddle, Di Zhu, Abdullah Rashwan, Rebecca Borgo, Evan Rosen, Wolff Dobson, Tei Jeong, Addison Howard, Will Cukierski, Sohier Dane, Mark McDonald, Phil Culliton, Ryan Holbrook, Khanh LeViet, Mark Daoust, George Karpenkov, and Swati Singh.
Language Models Perform Reasoning via Chain of Thought
In recent years, scaling up the size of language models has been shown to be a reliable way to improve performance on a range of natural language processing (NLP) tasks. Today’s language models at the scale of 100B or more parameters achieve strong performance on tasks like sentiment analysis and machine translation, even with little or no training examples. Even the largest language models, however, can still struggle with certain multi-step reasoning tasks, such as math word problems and commonsense reasoning. How might we enable language models to perform such reasoning tasks?
In “Chain of Thought Prompting Elicits Reasoning in Large Language Models,” we explore a prompting method for improving the reasoning abilities of language models. Called chain of thought prompting, this method enables models to decompose multi-step problems into intermediate steps. With chain of thought prompting, language models of sufficient scale (~100B parameters) can solve complex reasoning problems that are not solvable with standard prompting methods.
Comparison to Standard Prompting
With standard prompting (popularized by GPT-3) the model is given examples of input–output pairs (formatted as questions and answers) before being asked to predict the answer for a test-time example (shown below on the left). In chain of thought prompting (below, right), the model is prompted to produce intermediate reasoning steps before giving the final answer to a multi-step problem. The idea is that a model-generated chain of thought would mimic an intuitive thought process when working through a multi-step reasoning problem. While producing a thought process has been previously accomplished via fine-tuning, we show that such thought processes can be elicited by including a few examples of chain of thought via prompting only, which does not require a large training dataset or modifying the language model’s weights.
Chain of thought reasoning allows models to decompose complex problems into intermediate steps that are solved individually. Moreover, the language-based nature of chain of thought makes it applicable to any task that a person could solve via language. We find through empirical experiments that chain of thought prompting can improve performance on various reasoning tasks, and that successful chain of thought reasoning is an emergent property of model scale — that is, the benefits of chain of thought prompting only materialize with a sufficient number of model parameters (around 100B).
Arithmetic Reasoning
One class of tasks where language models typically struggle is arithmetic reasoning (i.e., solving math word problems). Two benchmarks in arithmetic reasoning are MultiArith and GSM8K, which test the ability of language models to solve multi-step math problems similar to the one shown in the figure above. We evaluate both the LaMDA collection of language models ranging from 422M to 137B parameters, as well as the PaLM collection of language models ranging from 8B to 540B parameters. We manually compose chains of thought to include in the examples for chain of thought prompting.
For these two benchmarks, using standard prompting leads to relatively flat scaling curves: increasing the scale of the model does not substantially improve performance (shown below). However, we find that when using chain of thought prompting, increasing model scale leads to improved performance that substantially outperforms standard prompting for large model sizes.
Employing chain of thought prompting enables language models to solve arithmetic reasoning problems for which standard prompting has a mostly flat scaling curve. |
On the GSM8K dataset of math word problems, PaLM shows remarkable performance when scaled to 540B parameters. As shown in the table below, combining chain of thought prompting with the 540B parameter PaLM model leads to new state-of-the-art performance of 58%, surpassing the prior state of the art of 55% achieved by fine-tuning GPT-3 175B on a large training set and then ranking potential solutions via a specially trained verifier. Moreover, follow-up work on self-consistency shows that the performance of chain of thought prompting can be improved further by taking the majority vote of a broad set of generated reasoning processes, which results in 74% accuracy on GSM8K.
Chain of thought prompting with PaLM achieves a new state of the art on the GSM8K benchmark of math word problems. For a fair comparison against fine-tuned GPT-3 baselines, the chain of thought prompting results shown here also use an external calculator to compute basic arithmetic functions (i.e., addition, subtraction, multiplication and division). |
Commonsense Reasoning
In addition to arithmetic reasoning, we consider whether the language-based nature of chain of thought prompting also makes it applicable to commonsense reasoning, which involves reasoning about physical and human interactions under the presumption of general background knowledge. For these evaluations, we use the CommonsenseQA and StrategyQA benchmarks, as well as two domain-specific tasks from BIG-Bench collaboration regarding date understanding and sports understanding. Example questions are below:
As shown below, for CommonsenseQA, StrategyQA, and Date Understanding, performance improved with model scale, and employing chain of thought prompting led to additional small improvements. Chain of thought prompting had the biggest improvement on sports understanding, for which PaLM 540B’s chain of thought performance surpassed that of an unaided sports enthusiast (95% vs 84%).
Chain of thought prompting also improves performance on various types of commonsense reasoning tasks. |
Conclusions
Chain of thought prompting is a simple and broadly applicable method for improving the ability of language models to perform various reasoning tasks. Through experiments on arithmetic and commonsense reasoning, we find that chain of thought prompting is an emergent property of model scale. Broadening the range of reasoning tasks that language models can perform will hopefully inspire further work on language-based approaches to reasoning.
Acknowledgements
It was an honor and privilege to work with Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Quoc Le on this project.
On-device Text-to-Image Search with TensorFlow Lite Searcher Library
Posted by Zonglin Li, Lu Wang, Maxime Brénon, and Yuqi Li, Software Engineers
Today, we’re excited to announce a new on-device embedding-based search library that allows you to quickly find similar images, text or audio from millions of data samples in a few milliseconds.
It works by using a model to embed the search query into a high-dimensional vector representing the semantic meaning of the query. Then it uses ScaNN (Scalable Nearest Neighbors) to search for similar items from a predefined database. In order to apply it to your dataset, you need to use Model Maker Searcher API (vision/text) to build a custom TFLite Searcher model, and then deploy it onto devices using Task Library Searcher API (vision/text).
For example, with the Searcher model trained on COCO, searching the query, A passenger plane on the runway
, will return the following images:
Figure 1: All images are from COCO 2014 train and validation datasets. Image 1 by Mark Jones Jr. under Attribution License. Image 2 by 305 Seahill under Attribution-NoDerivs License. Image 3 by tataquax under Attribution-ShareAlike License. |
- Train a dual encoder model for image and text query encoding using the COCO dataset.
- Create a text-to-image Searcher model using the Model Maker Searcher API.
- Retrieve images with text queries using the Task Library Searcher API.
Train a Dual Encoder Model
Figure 2: Train the dual encoder model with dot product similarity distance. The loss encourages related images and text to have larger dot products (the shaded green squares). |
The dual encoder model consists of an image encoder and a text encoder. The two encoders map the images and text, respectively, to embeddings in a high-dimensional space. The model computes the dot product between the image and text embeddings, and the loss encourages relevant image and text to have larger dot product (closer), and unrelated ones to have smaller dot product (farther apart).
The training procedure is inspired by the CLIP paper and this Keras example. The image encoder is based on a pre-trained EfficientNet model and the text encoder is based on a pre-trained Universal Sentence Encoder model. The outputs from both encoders are then projected to a 128 dimensional space and are L2 normalized. For the dataset, we chose to use COCO, as its train and validation splits have human generated captions for each image. Please take a look at the companion Colab notebook for the details of the training process.
The dual encoder model makes it possible to retrieve images from a database without captions because once trained, the image embedder can directly extract the semantic meaning from the image without any need for human-generated captions.
Create the text-to-image Searcher model using Model Maker
Figure 3: Generate image embeddings using the image encoder and use Model Maker to create the TFLite Searcher model. |
Once the dual encoder model is trained, we can use it to create the TFLite Searcher model that searches for the most relevant images from an image dataset based on the text queries. This can be done by the following three steps:
- Generate the embeddings of the image dataset using the TensorFlow image encoder. ScaNN is capable of searching through a very large dataset, hence we combined the train and validation splits of COCO 2014 totaling 123K+ images to demonstrate its capabilities. See the code here.
- Convert the TensorFlow text encoder model into TFLite format. See the code here.
- Use Model Maker to create the TFLite Searcher model from the TFLite text encoder and the image embeddings using the code below:
# Configure ScaNN options. See the API doc for how to configure ScaNN.
scann_options = searcher.ScaNNOptions(
distance_measure='dot_product',
tree=searcher.Tree(num_leaves=351, num_leaves_to_search=4),
score_ah=searcher.ScoreAH(1, anisotropic_quantization_threshold=0.2))
# Load the image embeddings and corresponding metadata if any.
data = searcher.DataLoader(tflite_embedder_path, image_embeddings, metadata)
# Create the TFLite Searcher model.
model = searcher.Searcher.create_from_data(data, scann_options)
# Export the TFLite Searcher model.
model.export(
export_filename='searcher.tflite',
userinfo='',
export_format=searcher.ExportFormat.TFLITE)
When creating the Searcher model, Model Maker leverages ScaNN to index the embedding vectors. The embedding dataset is first partitioned into multiple subsets. In each of the subsets, ScaNN stores the quantized representation of the embedding vectors. At retrieval time, ScaNN selects a few most relevant partitions and scores the quantized representations with fast, approximate distances. This procedure saves both the model size (through quantization) and achieves speed up (through partition selection). See the in-depth examination to learn more about the ScaNN algorithm.
In the above example, we divide the dataset into 351 partitions (roughly the square root of the number of embeddings we have), and search 4 of them during retrieval, which is roughly 1% of the dataset. We also quantize the 128 dimensional float embeddings to 128 int8 values to save space.
Run inference using Task Library
Figure 4: Run inference using Task Library with the TFLite Searcher model. It takes the query text and returns the top neighbor’s metadata. From there we can find the corresponding images. |
To query images using the Searcher model, you only need a couple of lines of code like the following using Task Library:
from tflite_support.task import text
# Initialize a TextSearcher object
searcher = text.TextSearcher.create_from_file('searcher.tflite')
# Search the input query
results = searcher.search(query_text)
# Show the results
for rank in range(len(results.nearest_neighbors)):
print('Rank #', rank, ':')
image_id = results.nearest_neighbors[rank].metadata
print('image_id: ', image_id)
print('distance: ', results.nearest_neighbors[rank].distance)
show_image_by_id(image_id)
Try the code from the Colab. Also, see more information on how to integrate the model using the Task Library Java and C++ API, especially on Android. Each query in general takes only 6 milliseconds on Pixel 6.
Here are some example results:
Query: A man riding a bike
Results are ranked according to the approximate similarity distance. Here is a sample of retrieved images. Note that we are only showing images if their licenses allow.
Figure 5: All images are from COCO 2014 train and validation datasets. Image 1 by Reuel Mark Delez under Attribution License. Image 2 by Richard Masoner / Cyclelicious under Attribution-ShareAlike License. Image 3 by Julia under Attribution-ShareAlike License. Image 4 by Aaron Fulkerson under Attribution-ShareAlike License. Image 5 by Richard Masoner / Cyclelicious under Attribution-ShareAlike License. Image 6 by Richard Masoner / Cyclelicious under Attribution-ShareAlike License. |
Future work
We’ll be working on enabling more search types beyond image and text, such as audio clips.
Contact odml-pipelines-team@google.com if you want to leave any feedback. Our goal is to make on-device ML even easier for you and we value your input!
In this post, we will walk you through an end-to-end example of building a text-to-image search feature (retrieve the images given textual queries) using the new TensorFlow Lite Searcher Library. Here are the major steps:
Acknowledgements
We would like to thank Khanh LeViet, Chuo-Ling Chang, Ruiqi Guo, Lawrence Chan, Laurence Moroney, Yu-Cheng Ling, Matthias Grundmann, as well as Robby Neale, Chung-Ching Chang, Tom Small and Khalid Salama for their active support of this work. We would also like to thank the entire ScaNN team: David Simcha, Erik Lindgren, Felix Chern, Phil Sun and Sanjiv Kumar.
Improving skin tone representation across Google
Seeing yourself reflected in the world around you — in real life, media or online — is so important. And we know that challenges with image-based technologies and representation on the web have historically left people of color feeling overlooked and misrepresented. Last year, we announced Real Tone for Pixel, which is just one example of our efforts to improve representation of diverse skin tones across Google products.
Today, we’re introducing a next step in our commitment to image equity and improving representation across our products. In partnership with Harvard professor and sociologist Dr. Ellis Monk, we’re releasing a new skin tone scale designed to be more inclusive of the spectrum of skin tones we see in our society. Dr. Monk has been studying how skin tone and colorism affect people’s lives for more than 10 years.
The culmination of Dr. Monk’s research is the Monk Skin Tone (MST) Scale, a 10-shade scale that will be incorporated into various Google products over the coming months. We’re openly releasing the scale so anyone can use it for research and product development. Our goal is for the scale to support inclusive products and research across the industry — we see this as a chance to share, learn and evolve our work with the help of others.
The 10 shades of the Monk Skin Tone Scale.
This scale was designed to be easy-to-use for development and evaluation of technology while representing a broader range of skin tones. In fact, our research found that amongst participants in the U.S., people found the Monk Skin Tone Scale to be more representative of their skin tones compared to the current tech industry standard. This was especially true for people with darker skin tones.
“In our research, we found that a lot of the time people feel they’re lumped into racial categories, but there’s all this heterogeneity with ethnic and racial categories,” Dr. Monk says. “And many methods of categorization, including past skin tone scales, don’t pay attention to this diversity. That’s where a lack of representation can happen…we need to fine-tune the way we measure things, so people feel represented.”
Using the Monk Skin Tone Scale to improve Google products
Updating our approach to skin tone can help us better understand representation in imagery, as well as evaluate whether a product or feature works well across a range of skin tones. This is especially important for computer vision, a type of AI that allows computers to see and understand images. When not built and tested intentionally to include a broad range of skin-tones, computer vision systems have been found to not perform as well for people with darker skin.
The MST Scale will help us and the tech industry at large build more representative datasets so we can train and evaluate AI models for fairness, resulting in features and products that work better for everyone — of all skin tones. For example, we use the scale to evaluate and improve the models that detect faces in images.
Here are other ways you’ll see this show up in Google products.
Improving skin tone representation in Search
Every day, millions of people search the web expecting to find images that reflect their specific needs. That’s why we’re also introducing new features using the MST Scale to make it easier for people of all backgrounds to find more relevant and helpful results.
For example, now when you search for makeup related queries in Google Images, you’ll see an option to further refine your results by skin tone. So if you’re looking for “everyday eyeshadow” or “bridal makeup looks” you’ll more easily find results that work better for your needs.
Seeing yourself represented in results can be key to finding information that’s truly relevant and useful, which is why we’re also rolling out improvements to show a greater range of skin tones in image results for broad searches about people, or ones where people show up in the results. In the future, we’ll incorporate the MST Scale to better detect and rank images to include a broader range of results, so everyone can find what they’re looking for.
Creating a more representative Search experience isn’t something we can do alone, though. How content is labeled online is a key factor in how our systems surface relevant results. In the coming months, we’ll also be developing a standardized way to label web content. Creators, brands and publishers will be able to use this new inclusive schema to label their content with attributes like skin tone, hair color and hair texture. This will make it possible for content creators or online businesses to label their imagery in a way that search engines and other platforms can easily understand.
Improving skin tone representation in Google Photos
We’ll also be using the MST Scale to improve Google Photos. Last year, we introduced an improvement to our auto enhance feature in partnership with professional image makers. Now we’re launching a new set of Real Tone filters that are designed to work well across skin tones and evaluated using the MST Scale. We worked with a diverse range of renowned image makers, like Kennedi Carter and Joshua Kissi, who are celebrated for beautiful and accurate depictions of their subjects, to evaluate, test and build these filters. These new Real Tone filters allow you to choose from a wider assortment of looks and find one that reflects your style. Real Tone filters will be rolling out on Google Photos across Android, iOS and Web in the coming weeks.
What’s next?
We’re openly releasing the Monk Skin Tone Scale so that others can use it in their own products, and learn from this work —and so that we can partner with and learn from them. We want to get feedback, drive more interdisciplinary research, and make progress together. We encourage you to share your thoughts here. We’re continuing to collaborate with Dr. Monk to evaluate the MST Scale across different regions and product applications, and we’ll iterate and improve on it to make sure the scale works for people and use cases all over the world. And, we’ll continue our efforts to make Google’s products work even better for every user.
The best part of working on this project is that it isn’t just ours — while we’re committed to making Google products better and more inclusive, we’re also excited about all the possibilities that exist as we work together to build for everyone across the web.
Unlocking Zero-Resource Machine Translation to Support New Languages in Google Translate
Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas.
There are two key bottlenecks towards building functioning translation models for the long tail of languages. The first arises from data scarcity; digitized data for many languages is limited and can be difficult to find on the web due to quality issues with Language Identification (LangID) models. The second challenge arises from modeling limitations. MT models usually train on large amounts of parallel (translated) text, but without such data, models must learn to translate from limited amounts of monolingual text, which is a novel area of research. Both of these challenges need to be addressed for translation models to reach sufficient quality.
In “Building Machine Translation Systems for the Next Thousand Languages”, we describe how to build high-quality monolingual datasets for over a thousand languages that do not have translation datasets available and demonstrate how one can use monolingual data alone to train MT models. As part of this effort, we are expanding Google Translate to include 24 under-resourced languages. For these languages, we created monolingual datasets by developing and using specialized neural language identification models combined with novel filtering approaches. The techniques we introduce supplement massively multilingual models with a self supervised task to enable zero-resource translation. Finally, we highlight how native speakers have helped us realize this accomplishment.
Meet the Data
Automatically gathering usable textual data for under-resourced languages is much more difficult than it may seem. Tasks like LangID, which work well for high-resource languages, are unsuccessful for under-resourced languages, and many publicly available datasets crawled from the web often contain more noise than usable data for the languages they attempt to support. In our early attempts to identify under-resourced languages on the web by training a standard Compact Language Detector v3 (CLD3) LangID model, we too found that the dataset was too noisy to be usable.
As an alternative, we trained a Transformer-based, semi-supervised LangID model on over 1000 languages. This model supplements the LangID task with the MAsked Sequence-to-Sequence (MASS) task to better generalize over noisy web data. MASS simply garbles the input by randomly removing sequences of tokens from it, and trains the model to predict these sequences. We applied the Transformer-based model to a dataset that had been filtered with a CLD3 model and trained to recognize clusters of similar languages.
We then applied the open sourced Term Frequency-Inverse Internet Frequency (TF-IIF) filtering to the resulting dataset to find and discard sentences that were actually in related high-resource languages, and developed a variety of language-specific filters to eliminate specific pathologies. The result of this effort was a dataset with monolingual text in over 1000 languages, of which 400 had over 100,000 sentences. We performed human evaluations on samples of 68 of these languages and found that the majority (>70%) reflected high-quality, in-language content.
Meet the Models
Once we had a dataset of monolingual text in over 1000 languages, we then developed a simple yet practical approach for zero-resource translation, i.e., translation for languages with no in-language parallel text and no language-specific translation examples. Rather than limiting our model to an artificial scenario with only monolingual text, we also include all available parallel text data with millions of examples for higher resource languages to enable the model to learn the translation task. Simultaneously, we train the model to learn representations of under-resourced languages directly from monolingual text using the MASS task. In order to solve this task, the model is forced to develop a sophisticated representation of the language in question, developing a complex understanding of how words relate to other words in a sentence.
Relying on the benefits of transfer learning in massively multilingual models, we train a single giant translation model on all available data for over 1000 languages. The model trains on monolingual text for all 1138 languages and on parallel text for a subset of 112 of the higher-resourced languages.
At training time, any input the model sees has a special token indicating which language the output should be in, exactly like the standard formulation for multilingual translation. Our additional innovation is to use the same special tokens for both the monolingual MASS task and the translation task. Therefore, the token translate_to_french may indicate that the source is in English and needs to be translated to French (the translation task), or it may mean that the source is in garbled French and needs to be translated to fluent French (the MASS task). By using the same tags for both tasks, a translate_to_french tag takes on the meaning, “Produce a fluent output in French that is semantically close to the input,” regardless of whether the input is garbled in the same language or in another language entirely. From the model’s perspective, there is not much difference between the two.
Surprisingly, this simple procedure produces high quality zero-shot translations. The BLEU and ChrF scores for the resulting model are in the 10–40 and 20–60 ranges respectively, indicating mid- to high-quality translation. We observed meaningful translations even for highly inflected languages like Quechua and Kalaallisut, despite these languages being linguistically dissimilar to all other languages in the model. However, we only computed these metrics on the small subset of languages with human-translated evaluation sets. In order to understand the quality of translation for the remaining languages, we developed an evaluation metric based on round-trip translation, which allowed us to see that several hundred languages are reaching high translation quality.
To further improve quality, we use the model to generate large amounts of synthetic parallel data, filter the data based on round-trip translation (comparing a sentence translated into another language and back again), and continue training the model on this filtered synthetic data via back-translation and self-training. Finally, we fine-tune the model on a smaller subset of 30 languages and distill it into a model small enough to be served.
Contributions from Native Speakers
Regular communication with native speakers of these languages was critical for our research. We collaborated with over 100 people at Google and other institutions who spoke these languages. Some volunteers helped develop specialized filters to remove out-of-language content overlooked by automatic methods, for instance Hindi mixed with Sanskrit. Others helped with transliterating between different scripts used by the languages, for instance between Meetei Mayek and Bengali, for which sufficient tools didn’t exist; and yet others helped with a gamut of tasks related to evaluation. Native speakers were also key for advising in matters of political sensitivity, like the appropriate name for the language, and the appropriate writing system to use for it. And only native speakers could answer the ultimate question: given the current quality of translation, would it be valuable to the community for Google Translate to support this language?
Closing Notes
This advance is an exciting first step toward supporting more language technologies in under-resourced languages. Most importantly, we want to stress that the quality of translations produced by these models still lags far behind that of the higher-resource languages supported by Google Translate. These models are certainly a useful first tool for understanding content in under-resourced languages, but they will make mistakes and exhibit their own biases. As with any ML-driven tool, one should consider the output carefully.
The complete list of new languages added to Google Translate in this update:
Acknowledgements
We would like to thank Julia Kreutzer, Orhan Firat, Daan van Esch, Aditya Siddhant, Mengmeng Niu, Pallavi Baljekar, Xavier Garcia, Wolfgang Macherey, Theresa Breiner, Vera Axelrod, Jason Riesa, Yuan Cao, Mia Xu Chen, Klaus Macherey, Maxim Krikun, Pidong Wang, Alexander Gutkin, Apurva Shah, Yanping Huang, Zhifeng Chen, Yonghui Wu, and Macduff Hughes for their contributions to the research, engineering, and leadership of this project.
We would also like to extend our deepest gratitude to the following native speakers and members of affected communities, who helped us in a wide variety of ways: Yasser Salah Eddine Bouchareb (Algerian Arabic); Mfoniso Ukwak (Anaang); Bhaskar Borthakur, Kishor Barman, Rasika Saikia, Suraj Bharech (Assamese); Ruben Hilare Quispe (Aymara); Devina Suyanto (Balinese); Allahserix Auguste Tapo, Bakary Diarrassouba, Maimouna Siby (Bambara); Mohammad Jahangir (Baluchi); Subhajit Naskar (Bengali); Animesh Pathak, Ankur Bapna, Anup Mohan, Chaitanya Joshi, Chandan Dubey, Kapil Kumar, Manish Katiyar, Mayank Srivastava, Neeharika, Saumya Pathak, Tanya Sinha, Vikas Singh (Bhojpuri); Bowen Liang, Ellie Chio, Eric Dong, Frank Tang, Jeff Pitman, John Wong, Kenneth Chang, Manish Goregaokar, Mingfei Lau, Ryan Li, Yiwen Luo (Cantonese); Monang Setyawan (Caribbean Javanese); Craig Cornelius (Cherokee); Anton Prokopyev (Chuvash); Rajat Dogra, Sid Dogra (Dogri); Mohamed Kamagate (Dyula); Chris Assigbe, Dan Ameme, Emeafa Doe, Irene Nyavor, Thierry Gnanih, Yvonne Dumor (Ewe); Abdoulaye Barry, Adama Diallo, Fauzia van der Leeuw, Ibrahima Barry (Fulfulde); Isabel Papadimitriou (Greek); Alex Rudnick (Guarani); Mohammad Khdeir (Gulf Arabic); Paul Remollata (Hiligaynon); Ankur Bapna (Hindi); Mfoniso Ukwak (Ibibio); Nze Lawson (Igbo); D.J. Abuy, Miami Cabansay (Ilocano); Archana Koul, Shashwat Razdan, Sujeet Akula (Kashmiri); Jatin Kulkarni, Salil Rajadhyaksha, Sanjeet Hegde Desai, Sharayu Shenoy, Shashank Shanbhag, Shashi Shenoy (Konkani); Ryan Michael, Terrence Taylor (Krio); Bokan Jaff, Medya Ghazizadeh, Roshna Omer Abdulrahman, Saman Vaisipour, Sarchia Khursheed (Kurdish (Sorani));Suphian Tweel (Libyan Arabic); Doudou Kisabaka (Lingala); Colleen Mallahan, John Quinn (Luganda); Cynthia Mboli (Luyia); Abhishek Kumar, Neeraj Mishra, Priyaranjan Jha, Saket Kumar, Snehal Bhilare (Maithili); Lisa Wang (Mandarin Chinese); Cibu Johny (Malayalam); Viresh Ratnakar (Marathi); Abhi Sanoujam, Gautam Thockchom, Pritam Pebam, Sam Chaomai, Shangkar Mayanglambam, Thangjam Hindustani Devi (Meiteilon (Manipuri)); Hala Ajil (Mesopotamian Arabic); Hamdanil Rasyid (Minangkabau); Elizabeth John, Remi Ralte, S Lallienkawl Gangte,Vaiphei Thatsing, Vanlalzami Vanlalzami (Mizo); George Ouais (MSA); Ahmed Kachkach, Hanaa El Azizi (Morrocan Arabic); Ujjwal Rajbhandari (Newari); Ebuka Ufere, Gabriel Fynecontry, Onome Ofoman, Titi Akinsanmi (Nigerian Pidgin); Marwa Khost Jarkas (North Levantine Arabic); Abduselam Shaltu, Ace Patterson, Adel Kassem, Mo Ali, Yonas Hambissa (Oromo); Helvia Taina, Marisol Necochea (Quechua); AbdelKarim Mardini (Saidi Arabic); Ishank Saxena, Manasa Harish, Manish Godara, Mayank Agrawal, Nitin Kashyap, Ranjani Padmanabhan, Ruchi Lohani, Shilpa Jindal, Shreevatsa Rajagopalan, Vaibhav Agarwal, Vinod Krishnan (Sanskrit); Nabil Shahid (Saraiki); Ayanda Mnyakeni (Sesotho, Sepedi); Landis Baker (Seychellois Creole); Taps Matangira (Shona); Ashraf Elsharif (Sudanese Arabic); Sakhile Dlamini (Swati); Hakim Sidahmed (Tamazight); Melvin Johnson (Tamil); Sneha Kudugunta (Telugu); Alexander Tekle, Bserat Ghebremicael, Nami Russom, Naud Ghebre (Tigrinya); Abigail Annkah, Diana Akron, Maame Ofori, Monica Opoku-Geren, Seth Duodu-baah, Yvonne Dumor (Twi); Ousmane Loum (Wolof); and Daniel Virtheim (Yiddish).
Google I/O 2022: Advancing knowledge and computing
[TL;DR]
Nearly 24 years ago, Google started with two graduate students, one product, and a big mission: to organize the world’s information and make it universally accessible and useful. In the decades since, we’ve been developing our technology to deliver on that mission.
The progress we’ve made is because of our years of investment in advanced technologies, from AI to the technical infrastructure that powers it all. And once a year — on my favorite day of the year 🙂 — we share an update on how it’s going at Google I/O.
Today, I talked about how we’re advancing two fundamental aspects of our mission — knowledge and computing — to create products that are built to help. It’s exciting to build these products; it’s even more exciting to see what people do with them.
Thank you to everyone who helps us do this work, and most especially our Googlers. We are grateful for the opportunity.
– Sundar
Editor’s note: Below is an edited transcript of Sundar Pichai’s keynote address during the opening of today’s Google I/O Developers Conference.
Hi, everyone, and welcome. Actually, let’s make that welcome back! It’s great to return to Shoreline Amphitheatre after three years away. To the thousands of developers, partners and Googlers here with us, it’s great to see all of you. And to the millions more joining us around the world — we’re so happy you’re here, too.
Last year, we shared how new breakthroughs in some of the most technically challenging areas of computer science are making Google products more helpful in the moments that matter. All this work is in service of our timeless mission: to organize the world’s information and make it universally accessible and useful.
I’m excited to show you how we’re driving that mission forward in two key ways: by deepening our understanding of information so that we can turn it into knowledge; and advancing the state of computing, so that knowledge is easier to access, no matter who or where you are.
Today, you’ll see how progress on these two parts of our mission ensures Google products are built to help. I’ll start with a few quick examples. Throughout the pandemic, Google has focused on delivering accurate information to help people stay healthy. Over the last year, people used Google Search and Maps to find where they could get a COVID vaccine nearly two billion times.
Google’s flood forecasting technology sent flood alerts to 23 million people in India and Bangladesh last year.
We’ve also expanded our flood forecasting technology to help people stay safe in the face of natural disasters. During last year’s monsoon season, our flood alerts notified more than 23 million people in India and Bangladesh. And we estimate this supported the timely evacuation of hundreds of thousands of people.
In Ukraine, we worked with the government to rapidly deploy air raid alerts. To date, we’ve delivered hundreds of millions of alerts to help people get to safety. In March I was in Poland, where millions of Ukrainians have sought refuge. Warsaw’s population has increased by nearly 20% as families host refugees in their homes, and schools welcome thousands of new students. Nearly every Google employee I spoke with there was hosting someone.
Adding 24 more languages to Google Translate
In countries around the world, Google Translate has been a crucial tool for newcomers and residents trying to communicate with one another. We’re proud of how it’s helping Ukrainians find a bit of hope and connection until they are able to return home again.
With machine learning advances, we’re able to add languages like Quechua to Google Translate.
Real-time translation is a testament to how knowledge and computing come together to make people’s lives better. More people are using Google Translate than ever before, but we still have work to do to make it universally accessible. There’s a long tail of languages that are underrepresented on the web today, and translating them is a hard technical problem. That’s because translation models are usually trained with bilingual text — for example, the same phrase in both English and Spanish. However, there’s not enough publicly available bilingual text for every language.
So with advances in machine learning, we’ve developed a monolingual approach where the model learns to translate a new language without ever seeing a direct translation of it. By collaborating with native speakers and institutions, we found these translations were of sufficient quality to be useful, and we’ll continue to improve them.
We’re adding 24 new languages to Google Translate.
Today, I’m excited to announce that we’re adding 24 new languages to Google Translate, including the first indigenous languages of the Americas. Together, these languages are spoken by more than 300 million people. Breakthroughs like this are powering a radical shift in how we access knowledge and use computers.
Taking Google Maps to the next level
So much of what’s knowable about our world goes beyond language — it’s in the physical and geospatial information all around us. For more than 15 years, Google Maps has worked to create rich and useful representations of this information to help us navigate. Advances in AI are taking this work to the next level, whether it’s expanding our coverage to remote areas, or reimagining how to explore the world in more intuitive ways.
Advances in AI are helping to map remote and rural areas.
Around the world, we’ve mapped around 1.6 billion buildings and over 60 million kilometers of roads to date. Some remote and rural areas have previously been difficult to map, due to scarcity of high-quality imagery and distinct building types and terrain. To address this, we’re using computer vision and neural networks to detect buildings at scale from satellite images. As a result, we have increased the number of buildings on Google Maps in Africa by 5X since July 2020, from 60 million to nearly 300 million.
We’ve also doubled the number of buildings mapped in India and Indonesia this year. Globally, over 20% of the buildings on Google Maps have been detected using these new techniques. We’ve gone a step further, and made the dataset of buildings in Africa publicly available. International organizations like the United Nations and the World Bank are already using it to better understand population density, and to provide support and emergency assistance.
Immersive view in Google Maps fuses together aerial and street level images.
We’re also bringing new capabilities into Maps. Using advances in 3D mapping and machine learning, we’re fusing billions of aerial and street level images to create a new, high-fidelity representation of a place. These breakthrough technologies are coming together to power a new experience in Maps called immersive view: it allows you to explore a place like never before.
Let’s go to London and take a look. Say you’re planning to visit Westminster with your family. You can get into this immersive view straight from Maps on your phone, and you can pan around the sights… here’s Westminster Abbey. If you’re thinking of heading to Big Ben, you can check if there’s traffic, how busy it is, and even see the weather forecast. And if you’re looking to grab a bite during your visit, you can check out restaurants nearby and get a glimpse inside.
What’s amazing is that isn’t a drone flying in the restaurant — we use neural rendering to create the experience from images alone. And Google Cloud Immersive Stream allows this experience to run on just about any smartphone. This feature will start rolling out in Google Maps for select cities globally later this year.
Another big improvement to Maps is eco-friendly routing. Launched last year, it shows you the most fuel-efficient route, giving you the choice to save money on gas and reduce carbon emissions. Eco-friendly routes have already rolled out in the U.S. and Canada — and people have used them to travel approximately 86 billion miles, helping save an estimated half million metric tons of carbon emissions, the equivalent of taking 100,000 cars off the road.
Eco-friendly routes will expand to Europe later this year.
I’m happy to share that we’re expanding this feature to more places, including Europe later this year. In this Berlin example, you could reduce your fuel consumption by 18% taking a route that’s just three minutes slower. These small decisions have a big impact at scale. With the expansion into Europe and beyond, we estimate carbon emission savings will double by the end of the year.
And we’ve added a similar feature to Google Flights. When you search for flights between two cities, we also show you carbon emission estimates alongside other information like price and schedule, making it easy to choose a greener option. These eco-friendly features in Maps and Flights are part of our goal to empower 1 billion people to make more sustainable choices through our products, and we’re excited about the progress here.
New YouTube features to help people easily access video content
Beyond Maps, video is becoming an even more fundamental part of how we share information, communicate, and learn. Often when you come to YouTube, you are looking for a specific moment in a video and we want to help you get there faster.
Last year we launched auto-generated chapters to make it easier to jump to the part you’re most interested in.
This is also great for creators because it saves them time making chapters. We’re now applying multimodal technology from DeepMind. It simultaneously uses text, audio and video to auto-generate chapters with greater accuracy and speed. With this, we now have a goal to 10X the number of videos with auto-generated chapters, from eight million today, to 80 million over the next year.
Often the fastest way to get a sense of a video’s content is to read its transcript, so we’re also using speech recognition models to transcribe videos. Video transcripts are now available to all Android and iOS users.
Auto-translated captions on YouTube.
Next up, we’re bringing auto-translated captions on YouTube to mobile. Which means viewers can now auto-translate video captions in 16 languages, and creators can grow their global audience. We’ll also be expanding auto-translated captions to Ukrainian YouTube content next month, part of our larger effort to increase access to accurate information about the war.
Helping people be more efficient with Google Workspace
Just as we’re using AI to improve features in YouTube, we’re building it into our Workspace products to help people be more efficient. Whether you work for a small business or a large institution, chances are you spend a lot of time reading documents. Maybe you’ve felt that wave of panic when you realize you have a 25-page document to read ahead of a meeting that starts in five minutes.
At Google, whenever I get a long document or email, I look for a TL;DR at the top — TL;DR is short for “Too Long, Didn’t Read.” And it got us thinking, wouldn’t life be better if more things had a TL;DR?
That’s why we’ve introduced automated summarization for Google Docs. Using one of our machine learning models for text summarization, Google Docs will automatically parse the words and pull out the main points.
This marks a big leap forward for natural language processing. Summarization requires understanding of long passages, information compression and language generation, which used to be outside of the capabilities of even the best machine learning models.
And docs are only the beginning. We’re launching summarization for other products in Workspace. It will come to Google Chat in the next few months, providing a helpful digest of chat conversations, so you can jump right into a group chat or look back at the key highlights.
We’re bringing summarization to Google Chat in the coming months.
And we’re working to bring transcription and summarization to Google Meet as well so you can catch up on some important meetings you missed.
Visual improvements on Google Meet
Of course there are many moments where you really want to be in a virtual room with someone. And that’s why we continue to improve audio and video quality, inspired by Project Starline. We introduced Project Starline at I/O last year. And we’ve been testing it across Google offices to get feedback and improve the technology for the future. And in the process, we’ve learned some things that we can apply right now to Google Meet.
Starline inspired machine learning-powered image processing to automatically improve your image quality in Google Meet. And it works on all types of devices so you look your best wherever you are.
Machine learning-powered image processing automatically improves image quality in Google Meet.
We’re also bringing studio quality virtual lighting to Meet. You can adjust the light position and brightness, so you’ll still be visible in a dark room or sitting in front of a window. We’re testing this feature to ensure everyone looks like their true selves, continuing the work we’ve done with Real Tone on Pixel phones and the Monk Scale.
These are just some of the ways AI is improving our products: making them more helpful, more accessible, and delivering innovative new features for everyone.
Today at I/O Prabhakar Raghavan shared how we’re helping people find helpful information in more intuitive ways on Search.
Making knowledge accessible through computing
We’ve talked about how we’re advancing access to knowledge as part of our mission: from better language translation to improved Search experiences across images and video, to richer explorations of the world using Maps.
Now we’re going to focus on how we make that knowledge even more accessible through computing. The journey we’ve been on with computing is an exciting one. Every shift, from desktop to the web to mobile to wearables and ambient computing has made knowledge more useful in our daily lives.
As helpful as our devices are, we’ve had to work pretty hard to adapt to them. I’ve always thought computers should be adapting to people, not the other way around. We continue to push ourselves to make progress here.
Here’s how we’re making computing more natural and intuitive with the Google Assistant.
Introducing LaMDA 2 and AI Test Kitchen
A demo of LaMDA, our generative language model for dialogue application, and the AI Test Kitchen.
We’re continually working to advance our conversational capabilities. Conversation and natural language processing are powerful ways to make computers more accessible to everyone. And large language models are key to this.
Last year, we introduced LaMDA, our generative language model for dialogue applications that can converse on any topic. Today, we are excited to announce LaMDA 2, our most advanced conversational AI yet.
We are at the beginning of a journey to make models like these useful to people, and we feel a deep responsibility to get it right. To make progress, we need people to experience the technology and provide feedback. We opened LaMDA up to thousands of Googlers, who enjoyed testing it and seeing its capabilities. This yielded significant quality improvements, and led to a reduction in inaccurate or offensive responses.
That’s why we’ve made AI Test Kitchen. It’s a new way to explore AI features with a broader audience. Inside the AI Test Kitchen, there are a few different experiences. Each is meant to give you a sense of what it might be like to have LaMDA in your hands and use it for things you care about.
The first is called “Imagine it.” This demo tests if the model can take a creative idea you give it, and generate imaginative and relevant descriptions. These are not products, they are quick sketches that allow us to explore what LaMDA can do with you. The user interfaces are very simple.
Say you’re writing a story and need some inspirational ideas. Maybe one of your characters is exploring the deep ocean. You can ask what that might feel like. Here LaMDA describes a scene in the Mariana Trench. It even generates follow-up questions on the fly. You can ask LaMDA to imagine what kinds of creatures might live there. Remember, we didn’t hand-program the model for specific topics like submarines or bioluminescence. It synthesized these concepts from its training data. That’s why you can ask about almost any topic: Saturn’s rings or even being on a planet made of ice cream.
Staying on topic is a challenge for language models. Say you’re building a learning experience — you want it to be open-ended enough to allow people to explore where curiosity takes them, but stay safely on topic. Our second demo tests how LaMDA does with that.
In this demo, we’ve primed the model to focus on the topic of dogs. It starts by generating a question to spark conversation, “Have you ever wondered why dogs love to play fetch so much?” And if you ask a follow-up question, you get an answer with some relevant details: it’s interesting, it thinks it might have something to do with the sense of smell and treasure hunting.
You can take the conversation anywhere you want. Maybe you’re curious about how smell works and you want to dive deeper. You’ll get a unique response for that too. No matter what you ask, it will try to keep the conversation on the topic of dogs. If I start asking about cricket, which I probably would, the model brings the topic back to dogs in a fun way.
This challenge of staying on-topic is a tricky one, and it’s an important area of research for building useful applications with language models.
These experiences show the potential of language models to one day help us with things like planning, learning about the world, and more.
Of course, there are significant challenges to solve before these models can truly be useful. While we have improved safety, the model might still generate inaccurate, inappropriate, or offensive responses. That’s why we are inviting feedback in the app, so people can help report problems.
We will be doing all of this work in accordance with our AI Principles. Our process will be iterative, opening up access over the coming months, and carefully assessing feedback with a broad range of stakeholders — from AI researchers and social scientists to human rights experts. We’ll incorporate this feedback into future versions of LaMDA, and share our findings as we go.
Over time, we intend to continue adding other emerging areas of AI into AI Test Kitchen. You can learn more at: g.co/AITestKitchen.
Advancing AI language models
LaMDA 2 has incredible conversational capabilities. To explore other aspects of natural language processing and AI, we recently announced a new model. It’s called Pathways Language Model, or PaLM for short. It’s our largest model to date and trained on 540 billion parameters.
PaLM demonstrates breakthrough performance on many natural language processing tasks, such as generating code from text, answering a math word problem, or even explaining a joke.
It achieves this through greater scale. And when we combine that scale with a new technique called chain-of- thought prompting, the results are promising. Chain-of-thought prompting allows us to describe multi-step problems as a series of intermediate steps.
Let’s take an example of a math word problem that requires reasoning. Normally, how you use a model is you prompt it with a question and answer, and then you start asking questions. In this case: How many hours are in the month of May? So you can see, the model didn’t quite get it right.
In chain-of-thought prompting, we give the model a question-answer pair, but this time, an explanation of how the answer was derived. Kind of like when your teacher gives you a step-by-step example to help you understand how to solve a problem. Now, if we ask the model again — how many hours are in the month of May — or other related questions, it actually answers correctly and even shows its work.
Chain-of-thought prompting leads to better reasoning and more accurate answers.
Chain-of-thought prompting increases accuracy by a large margin. This leads to state-of-the-art performance across several reasoning benchmarks, including math word problems. And we can do it all without ever changing how the model is trained.
PaLM is highly capable and can do so much more. For example, you might be someone who speaks a language that’s not well-represented on the web today — which makes it hard to find information. Even more frustrating because the answer you are looking for is probably out there. PaLM offers a new approach that holds enormous promise for making knowledge more accessible for everyone.
Let me show you an example in which we can help answer questions in a language like Bengali — spoken by a quarter billion people. Just like before we prompt the model with two examples of questions in Bengali with both Bengali and English answers.
That’s it, now we can start asking questions in Bengali: “What is the national song of Bangladesh?” The answer, by the way, is “Amar Sonar Bangla” — and PaLM got it right, too. This is not that surprising because you would expect that content to exist in Bengali.
You can also try something that is less likely to have related information in Bengali such as: “What are popular pizza toppings in New York City?” The model again answers correctly in Bengali. Though it probably just stirred up a debate amongst New Yorkers about how “correct” that answer really is.
What’s so impressive is that PaLM has never seen parallel sentences between Bengali and English. Nor was it ever explicitly taught to answer questions or translate at all! The model brought all of its capabilities together to answer questions correctly in Bengali. And we can extend the techniques to more languages and other complex tasks.
We’re so optimistic about the potential for language models. One day, we hope we can answer questions on more topics in any language you speak, making knowledge even more accessible, in Search and across all of Google.
Introducing the world’s largest, publicly available machine learning hub
The advances we’ve shared today are possible only because of our continued innovation in our infrastructure. Recently we announced plans to invest $9.5 billion in data centers and offices across the U.S.
One of our state-of-the-art data centers is in Mayes County, Oklahoma. I’m excited to announce that, there, we are launching the world’s largest, publicly-available machine learning hub for our Google Cloud customers.
One of our state-of-the-art data centers in Mayes County, Oklahoma.
This machine learning hub has eight Cloud TPU v4 pods, custom-built on the same networking infrastructure that powers Google’s largest neural models. They provide nearly nine exaflops of computing power in aggregate — bringing our customers an unprecedented ability to run complex models and workloads. We hope this will fuel innovation across many fields, from medicine to logistics, sustainability and more.
And speaking of sustainability, this machine learning hub is already operating at 90% carbon-free energy. This is helping us make progress on our goal to become the first major company to operate all of our data centers and campuses globally on 24/7 carbon-free energy by 2030.
Even as we invest in our data centers, we are working to innovate on our mobile platforms so more processing can happen locally on device. Google Tensor, our custom system on a chip, was an important step in this direction. It’s already running on Pixel 6 and Pixel 6 Pro, and it brings our AI capabilities — including the best speech recognition we’ve ever deployed — right to your phone. It’s also a big step forward in making those devices more secure. Combined with Android’s Private Compute Core, it can run data-powered features directly on device so that it’s private to you.
People turn to our products every day for help in moments big and small. Core to making this possible is protecting your private information each step of the way. Even as technology grows increasingly complex, we keep more people safe online than anyone else in the world, with products that are secure by default, private by design and that put you in control.
We also spent time today sharing updates to platforms like Android. They’re delivering access, connectivity, and information to billions of people through their smartphones and other connected devices like TVs, cars and watches.
And we shared our new Pixel Portfolio, including the Pixel 6a, Pixel Buds Pro, Google Pixel Watch, Pixel 7, and Pixel tablet all built with ambient computing in mind. We’re excited to share a family of devices that work better together — for you.
The next frontier of computing: augmented reality
Today we talked about all the technologies that are changing how we use computers and access knowledge. We see devices working seamlessly together, exactly when and where you need them and with conversational interfaces that make it easier to get things done.
Looking ahead, there’s a new frontier of computing, which has the potential to extend all of this even further, and that is augmented reality. At Google, we have been heavily invested in this area. We’ve been building augmented reality into many Google products, from Google Lens to multisearch, scene exploration, and Live and immersive views in Maps.
These AR capabilities are already useful on phones and the magic will really come alive when you can use them in the real world without the technology getting in the way.
That potential is what gets us most excited about AR: the ability to spend time focusing on what matters in the real world, in our real lives. Because the real world is pretty amazing!
It’s important we design in a way that is built for the real world — and doesn’t take you away from it. And AR gives us new ways to accomplish this.
Let’s take language as an example. Language is just so fundamental to connecting with one another. And yet, understanding someone who speaks a different language, or trying to follow a conversation if you are deaf or hard of hearing can be a real challenge. Let’s see what happens when we take our advancements in translation and transcription and deliver them in your line of sight in one of the early prototypes we’ve been testing.
You can see it in their faces: the joy that comes with speaking naturally to someone. That moment of connection. To understand and be understood. That’s what our focus on knowledge and computing is all about. And it’s what we strive for every day, with products that are built to help.
Each year we get a little closer to delivering on our timeless mission. And we still have so much further to go. At Google, we genuinely feel a sense of excitement about that. And we are optimistic that the breakthroughs you just saw will help us get there. Thank you to all of the developers, partners and customers who joined us today. We look forward to building the future with all of you.
Achieve in-vehicle comfort using personalized machine learning and Amazon SageMaker
This blog post is co-written by Rudra Hota and Esaias Pech from Continental AG.
Many drivers have had the experience of trying to adjust temperature settings in their vehicle while attempting to keep their eyes on the road. Whether the previous driver preferred a warmer cabin temperature, or you’re now wearing warmer clothing, or the sun just emerged from the clouds, multiple conditions can make a driver uncomfortable and force their attention to the vehicle temperature dial. Wouldn’t it be convenient if your vehicle’s heating, ventilation, and air conditioning (HVAC) system could learn your individual preferences, and automatically make these adjustments for you?
Continental AG, a German multinational automotive technology conglomerate and Tier 1 supplier of automotive parts and technology, recently embarked on an initiative to develop in-vehicle human machine interface (HMI) capabilities enabled by machine learning (ML) technologies that will deliver personalization features for its OEM (original equipment manufacturer) automotive customers.
To help bring that vision to life, Continental Automotive Systems partnered with the Amazon Machine Learning Solutions Lab to develop personalization algorithms that learn from user behavior and automatically adjust the temperature for the driver to experience optimal in-vehicle thermal comfort. This is a challenging task to accomplish because people have different thermal preferences, and these preferences can also vary significantly depending on external environmental factors and multiple thermal loads affecting the temperature of the vehicle. The multi-contextual personalization system developed during the ML Solutions Lab engagement uses Amazon SageMaker, and is a first step on a broader journey by Continental AG toward transforming the automotive driving experience using ML to create and deliver an innovative suite of on-board personalization features for drivers.
Exploring the data
To prototype this solution, Continental Automotive Systems collected several hours of real-world data using a test vehicle equipped with multiple sensors that measured, among other things, the outside temperature, cabin temperature and humidity, and sunlight. Subjects were asked to adjust the temperature as they drove the test vehicle and record their level of comfort on a 7-point scale, where 0 indicates thermal comfort and -3/+3 indicate very cold/hot, respectively. The following figure shows an example session with nine sensor measurements, the temperature setting (hvac_set
), and comfort status (comfort
).
Because subjects were asked to explore the temperature range, they didn’t remain at a steady-state for long. Consequently, if we look at the correlation matrix across all timepoints from all sessions, there isn’t a clear relationship between the sensor readings and the set temperature (hvac_set
). To learn a cleaner relationship, we detected and extracted the steady-state periods, defined as the periods of thermal comfort during which the sensor readings are stable within some threshold. We then used these steady-state periods to regenerate the correlation plot. By doing so, the amount of clothing worn (clothing
) and cloudiness (cloudiness
) or degree of darkness (light_voltage
) reveal themselves as well-correlated variables. (Note that light_voltage
is actually a measure of the voltage required to control ambient light and headlights, so higher values mean it’s darker outside.) Intuitively, as clothing
increases, the temperature is set lower in order to achieve comfort, and as cloudiness
and light_voltage
increase, the temperature is set higher.
The following table shows the correlation between set temperature (hvac_set
) and sensor readings, for all data (left) and steady-states only (right).
In the following figure, steady-states are detected (bottom; value=1.0) during periods of thermal comfort (top), and when sensor measurements are relatively stable.
Given this finding, we chose to focus on clothing and solar load (collapsing cloudiness
and light_voltage
into a single variable, and inverting the sign such that higher values mean more sunlight) as the primary input variables to the ML model. Given that only a small amount of real-world data was collected for this prototype, we decided to first develop this modeling approach using a simulated environment, where we could generate as much data as we want and more robustly evaluate the advantages and disadvantages of each approach.
For this data exploration, as well as the simulation and modeling described in later sections, we used SageMaker. With SageMaker, we could securely access the provided data stored on Amazon Simple Storage Service (Amazon S3) and quickly explore it within fully managed Jupyter notebooks via one-click notebook instances. Next, we used a pre-loaded data science environment within our notebooks (complete with ML frameworks like PyTorch and MXNet) to prototype and implement the model and simulator, which are described in the next sections.
Simulating in-vehicle temperature dynamics
Our aim with the simulator was to capture the key variables and dynamics governing thermal comfort. Our assumption was if our chosen model was sophisticated enough to learn the relevant relationships within this environment, it could also learn the true relationships in the real-world data (an assumption that is validated by the real-world experimentation at the end of the post).
The simulation scheme we devised is pictured in the following figure. Crucially, as will be the case in production, this is a closed loop system where the temperature-adjusting behavior affects the cabin temperature, which then indirectly affects the temperature-setting behavior. Both exogenous factors (orange) and endogenous factors (green) play a role, and exponential temporal dynamics determine how these impact the cabin air temperature, near-skin temperature, and ultimately the set temperature.
The interactions across these variables through time allow for a complex and dynamic system that we can use to generate sessions. Moreover, we found that our system could be tuned to mimic real-world sessions, which gives us confidence that we captured the relevant variables and dynamics. In the following figure, the simulator (solid) is able to mimic real-world (dashed) dynamics.
Building a contextual, personalized model
Before setting about building our personalization system, we first established some baselines of increasing complexity. The simplest baseline (“non-personalized baseline”) is to use the average temperature setting as the prediction. A slightly more sophisticated baseline uses the average temperature per person as the prediction for that person (“personalized baseline”). Finally, we have a trained baseline (“non-personalized model”) that learns from user behavior in an undifferentiated way. It learns about situations that are generally true (such as the more clothing a driver or passenger is wearing, the lower the temperature should be to achieve comfort), but doesn’t learn any of the personal preferences. This model can be used for new or guest users who don’t have any data.
In building the personalization system, we wanted to leverage the commonalities across users, as well as the differences among them. In that regard, we adopted a two-step approach. First, we trained a neural network using all data in an undifferentiated way. The inputs included both the exogenous variables as well as demographic information. Second, we fine-tuned the final layer of the neural network for each person, using just their data.
In addition to making sense intuitively, this approach (“personalized model”) outperformed all of the baselines, as measured by the mean-squared error (MSE) between the predicted and actual temperature setting in the steady-state periods. This approach was also more accurate than similar setups, like training from scratch per person without the initial pre-training or fine-tuning all parameters, rather than just the final layer. It was computationally more efficient than these approaches, because only the final layers need to be trained and stored per person, rather than having an entirely distinct model per person. For both stages, we detected and used just the steady-state periods for training, because these have a clearer relationship between the sensor readings and the temperature to set to achieve comfort. Steady-state detection was done using only sensor measurements, because the comfort variable won’t actually be available in production. We also experimented with temporal sequence modeling using the entire time series, but found that performance was consistently worse with this approach.
The following table shows the personalized model beats baselines on simulated data, in terms of MSE.
Model | MSE (lower is better) |
Non-personalized baseline | 9.681 |
Personalized baseline | 7.270 |
Non-personalized model | 1.518 |
Personalized model | 0.555 |
For the simulated data, we could also evaluate the model as if it were in production, by replacing the manual control of the temperature with the automatic control of the model (model-as-controller). With this setup, we can simulate sessions with each model as the controller and evaluate how often the user is in a comfortable state with this automatic control, where a comfort ratio of 1.0 indicates that the user is perfectly comfortable throughout the whole session. As with the MSE metric, the personalized model outperforms the non-personalized baseline (see the following table). Moreover, the personalized model outperforms manual control, because the automated system responds immediately to changing conditions, whereas the persona takes some time to react.
Model | Comfort Ratio (higher is better) |
Manual control | .575 |
Non-personalized model | .531 |
Personalized model | .700 |
Having established our modeling approach on the simulated data, we returned to the real-world data to see if we could beat the baselines. As before, we pre-trained the model on all data and then fine-tuned the last layer of the model for the two participants with greater than seven steady-state periods (one participant was excluded because they seldom adjusted the temperature setting). Once again, the personalized model beat the baselines (see the following table), reinforcing the conclusion that the personalized model is best.
Model | MSE (lower is better) |
Non-personalized baseline | 60.885 |
Personalized baseline | 69.902 |
Non-personalized model | 24.823 |
Personalized model | 18.059 |
Conclusion
In this post, we demonstrated how to apply machine learning to achieve personalized in-vehicle thermal comfort. With the simulation environment that was developed, we were able to prototype and evaluate different modeling approaches, which were then successfully applied to real-world data.
Naturally, to scale this solution to production roll-out, more real-world data is needed, and we can use the simulation environment to generate estimates on the amount of data needed for collection. In addition, for the system to perform end to end, ancillary modules are needed to identify the driver (and other occupants) and load their personalized profiles, including detecting the type of clothing on the individuals. These modules will require their own data and ML pipelines and, like the thermal comfort system, can use a cloud-edge architecture, where model training workloads are run in the cloud, while inference and minor updates can be performed at the edge in a privacy-preserving way.
This setup—with smart, ML-powered personalization applications delivered within the vehicle through a hybrid cloud and edge communication architecture—is a powerful paradigm, which can be replicated to bring ever-increasing intelligence at scale to the automotive driving experience.
About the Authors
Joshua Levy is a Senior Applied Scientist in the Amazon Machine Learning Solutions lab, where he helps customers design and build AI/ML solutions to solve key business problems.
Yifu Hu is an Applied Scientist in the Amazon Machine Learning Solutions lab, where he helps design creative ML solutions to address customers’ business problems in various industries.
Shane Rai is a Sr. ML Strategist at the Amazon Machine Learning Solutions Lab. He works with customers across a diverse spectrum of industries to solve their most pressing and innovative business needs using AWS’s breadth of cloud-based AI/ML services.
Boris Aronchik is a Manager in the Amazon AI Machine Learning Solutions Lab, where he leads a team of ML scientists and engineers to help AWS customers realize business goals leveraging AI/ML solutions.
Jennifer Zhu is an Applied Scientist from the Amazon AI Machine Learning Solutions Lab. She works with AWS’s customers to build AI/ML solutions for their high-priority business needs.
Ivan Sosnovik is an Applied Scientist in the Amazon Machine Learning Solutions Lab. He develops ML solutions to help customers to achieve their business goals.
Rudra N. Hota is an Artificial Intelligence Engineer at Holistic Engineering and Technologies at Continental Automotive Systems. With expertise in the field of Computer vision and machine learning, he is working with diverse teams to collaboratively define problem statements and explore fitting solutions.
Esaias Pech is a Software Engineering Manager in the Continental Engineering Services group, where he gets the opportunity to work with Automotive OEMs on Displays, Driver Monitoring Cameras and Infotainment High Performance Computers to improve User Experience in the next generation of vehicles.
Understanding the world through language
Language is at the heart of how people communicate with each other. It’s also proving to be powerful in advancing AI and building helpful experiences for people worldwide.
From the beginning, we set out to connect words in your search to words on a page so we could make the web’s information more accessible and useful. Over 20 years later, as the web changes, and the ways people consume information expand from text to images to videos and more — the one constant is that language remains a surprisingly powerful tool for understanding information.
In recent years, we’ve seen an incredible acceleration in the field of natural language understanding. While our systems still don’t understand language the way people do, they’re increasingly able to spot patterns in information, identify complex concepts and even draw implicit connections between them. We’re even finding that many of our advanced models can understand information across languages or in non-language-based formats like images and videos.
Building the next generation of language models
In 2017, Google researchers developed the Transformer, the neural network that underlies major advancements like MUM and LaMDA. Last year, we shared our thinking on a new architecture called Pathways, which is loosely inspired by the sparse patterns of neural activity in the brain. When you read a blog post like this one, only the critical parts of your brain needed to process this information fire up — not every single neuron. With Pathways, we’re now able to train AI models to be similarly effective.
Using this system, we recently introduced PaLM, a new model that achieves state-of-the-art performance on challenging language modeling tasks. It can solve complex math word problems, and answer questions in new languages with very little additional training data.
PaLM also shows improvements in understanding and expressing logic. This is significant because it allows the model to express its reasoning through words. Remember your algebra problem sets? It wasn’t enough to just get the right answer — you had to explain how you got there. PaLM is able to prompt a “Chain of Thought” to explain its thought process, step-by-step. This emerging capability helps improve accuracy and our understanding of how a model arrives at answers.
Translating the languages of the world
Pathways-related models are enabling us to break down language barriers in a way never before possible. Nowhere is this clearer than in our recently added support for 24 new languages in Google Translate, spoken by over 300 million people worldwide — including the first indigenous languages of the Americas. The amazing part is that the neural model did this using only monolingual text with no translation pairs — which allows us to help communities and languages underrepresented by technology. Machine translation at this level helps the world feel a bit smaller, while allowing us to dream bigger.
Unlocking knowledge about the world across modalities
Today, people consume information through webpages, images, videos, and more. Our advanced language and Pathways-related models are learning to make sense of information stemming from these different modalities through language. With these multimodal capabilities, we’re expanding multisearch in the Google app so you can search more naturally than ever before. As the saying goes — “a picture is worth a thousand words” — it turns out, words are really the key to sharing information about the world.
Improving conversational AI
Despite these advancements, human language continues to be one of the most complex undertakings for computers.
In everyday conversation, we all naturally say “um,” pause to find the right words, or correct ourselves — and yet other people have no trouble understanding what we’re saying. That’s because people can react to conversational cues in as little as 200 milliseconds. Moving our speech model from data centers to run on the device made things faster, but we wanted to push the envelope even more.
Computers aren’t there yet — so we’re introducing improvements to responsiveness on the Assistant with unified neural networks, combining many models into smarter ones capable of understanding more — like when someone pauses but is not finished speaking. Getting closer to the fluidity of real-time conversation is finally possible with Google’s Tensor chip, which is custom-engineered to handle on-device machine learning tasks super fast.
We’re also investing in building models that are capable of carrying more natural, sensible and specific conversations. Since introducing LaMDA to the world last year, we’ve made great progress, improving the model in key areas of quality, safety and groundedness — areas where we know conversational AI models can struggle. We’ll be releasing the next iteration, LaMDA 2, as a part of the AI Test Kitchen, which we’ll be opening up to small groups of people gradually. Our goal with AI Test Kitchen is to learn, improve, and innovate responsibly on this technology together. It’s still early days for LaMDA, but we want to continue to make progress and do so responsibly with feedback from the community.
Responsible development of AI models
While language is a remarkably powerful and versatile tool for understanding the world around us, we also know it comes with its limitations and challenges. In 2018, we published our AI Principles as guidelines to help us avoid bias, test rigorously for safety, design with privacy top of mind and make technology accountable to people. We’re investing in research across disciplines to understand the types of harms language models can affect, and to develop the frameworks and methods to ensure we bring in a diversity of perspectives and make meaningful improvements. We also build and use tools that can help us better understand our models (e.g., identifying how different words affect a prediction, tracing an error back to training data and even measuring correlations within a model). And while we work to improve underlying models, we also test rigorously before and after any kind of product deployment.
We’ve come a long way since introducing the world to the Transformer. We’re proud of the tremendous value that it and its predecessors have brought not only to everyday Google products like Search and Translate, but also the breakthroughs they’ve powered in natural language understanding. Our work advancing the future of AI is driven by something as old as time: the power language has to bring people together.