Using AutoML for Time Series Forecasting

Using AutoML for Time Series Forecasting

Posted by Chen Liang and Yifeng Lu, Software Engineers, Google Research, Brain Team

Time series forecasting is an important research area for machine learning (ML), particularly where accurate forecasting is critical, including several industries such as retail, supply chain, energy, finance, etc. For example, in the consumer goods domain, improving the accuracy of demand forecasting by 10-20% can reduce inventory by 5% and increase revenue by 2-3%. Current ML-based forecasting solutions are usually built by experts and require significant manual effort, including model construction, feature engineering and hyper-parameter tuning. However, such expertise may not be broadly available, which can limit the benefits of applying ML towards time series forecasting challenges.

To address this, automated machine learning (AutoML) is an approach that makes ML more widely accessible by automating the process of creating ML models, and has recently accelerated both ML research and the application of ML to real-world problems. For example, the initial work on neural architecture search enabled breakthroughs in computer vision, such as NasNet, AmoebaNet, and EfficientNet, and in natural language processing, such as Evolved Transformer. More recently, AutoML has also been applied to tabular data.

Today we introduce a scalable end-to-end AutoML solution for time series forecasting, which meets three key criteria:

  • Fully automated: The solution takes in data as input, and produces a servable TensorFlow model as output with no human intervention.
  • Generic: The solution works for most time series forecasting tasks and automatically searches for the best model configuration for each task.
  • High-quality: The produced models have competitive quality compared to those manually crafted for specific tasks.

We demonstrate the success of this approach through participation in the M5 forecasting competition, where this AutoML solution achieved competitive performance against hand-crafted models with moderate compute cost.

Challenges in Time Series Forecasting
Time series forecasting presents several challenges to machine learning models. First, the uncertainty is often high since the goal is to predict the future based on historical data. Unlike other machine learning problems, the test set, for example, future product sales, might have a different distribution from the training and validation set, which are extracted from the historical data. Second, the time series data from the real world often suffers from missing data and high intermittency (i.e., when a high fraction of the time series has the value of zero). Some time series tasks may not have historical data available and suffer from the cold start problem, for example, when predicting the sales of a new product. Third, since we aim to build a fully automated generic solution, the same solution needs to apply to a variety of datasets, which can vary significantly in the domain (product sales, web traffic, etc), the granularity (daily, hourly, etc), the history length, the types of features (categorical, numerical, date time, etc), and so on.

An AutoML Solution
To tackle these challenges, we designed an end-to-end TensorFlow pipeline with a specialized search space for time series forecasting. It is based on an encoder-decoder architecture, in which an encoder transforms the historical information in a time series into a set of vectors, and a decoder generates the future predictions based on these vectors. Inspired by the state-of-the-art sequence models, such as Transformer and WaveNet, and best practices in time series forecasting, our search space included components such as attention, dilated convolution, gating, skip connections, and different feature transformations. The resulting AutoML solution searches for the best combination of these components as well as core hyperparameters.

To combat the uncertainty in predicting the future of a time series, an ensemble of the top models discovered in the search is used to make final predictions. The diversity in the top models made the predictions more robust to uncertainty and less prone to overfitting the historical data. To handle time series with missing data, we fill in the gaps with a trainable vector and let the model learn to adapt to the missing time steps. To address intermittency, we predict, for each future time step, not only the value, but also the probability that the value at this time step is non-zero, and combine the two predictions. Finally, we found that the automated search is able to adjust the architecture and hyperparameter choices for different datasets, which makes the AutoML solution generic and automates the modeling efforts.

Benchmarking in Forecasting Competitions
To benchmark our AutoML solution, we participated in the M5 forecasting competition, the latest in the M-competition series, which is one of the most important competitions in the forecasting community, with a long history spanning nearly 40 years. This most recent competition was hosted on Kaggle and used a dataset from Walmart product sales, the real-world nature of which makes the problem quite challenging.

We participated in the competition with our fully automated solution and achieved a rank of 138 out of 5558 participants (top 2.5%) on the final leaderboard, which is in the silver medal zone. Participants in the competition had almost four months to produce their models. While many of the competitive forecasting models required months of manual effort to create, our AutoML solution found the model in a short time with only a moderate compute cost (500 CPUs for 2 hours) and no human intervention.

We also benchmarked our AutoML forecasting solution on several other Kaggle datasets and found that on average it outperforms 92% of hand-crafted models, despite its limited resource use.

Evaluation of the AutoML Forecasting solution on other Kaggle Datasets (Rossman Store Sales, Web Traffic, Favorita Grocery Sales) besides M5.

This work demonstrates the strength of an end-to-end AutoML solution for time series forecasting, and we are excited about its potential impact on real-world applications.

Acknowledgements
This project was a joint effort of Google Brain team members Chen Liang, Da Huang, Yifeng Lu and Quoc V. Le. We also thank Junwei Yuan, Xingwei Yang, Dawei Jia, Chenyu Zhao, Tin-yun Ho, Meng Wang, Yaguang Li, Nicolas Loeff, Manish Kurse, Kyle Anderson and Nishant Patil for their collaboration.

Read More

Transformers for Image Recognition at Scale

Transformers for Image Recognition at Scale

Posted by Neil Houlsby and Dirk Weissenborn, Research Scientists, Google Research

While convolutional neural networks (CNNs) have been used in computer vision since the 1980s, they were not at the forefront until 2012 when AlexNet surpassed the performance of contemporary state-of-the-art image recognition methods by a large margin. Two factors helped enable this breakthrough: (i) the availability of training sets like ImageNet, and (ii) the use of commoditized GPU hardware, which provided significantly more compute for training. As such, since 2012, CNNs have become the go-to model for vision tasks.

The benefit of using CNNs was that they avoided the need for hand-designed visual features, instead learning to perform tasks directly from data “end to end”. However, while CNNs avoid hand-crafted feature-extraction, the architecture itself is designed specifically for images and can be computationally demanding. Looking forward to the next generation of scalable vision models, one might ask whether this domain-specific design is necessary, or if one could successfully leverage more domain agnostic and computationally efficient architectures to achieve state-of-the-art results.

As a first step in this direction, we present the Vision Transformer (ViT), a vision model based as closely as possible on the Transformer architecture originally designed for text-based tasks. ViT represents an input image as a sequence of image patches, similar to the sequence of word embeddings used when applying Transformers to text, and directly predicts class labels for the image. ViT demonstrates excellent performance when trained on sufficient data, outperforming a comparable state-of-the-art CNN with four times fewer computational resources. To foster additional research in this area, we have open-sourced both the code and models.

The Vision Transformer treats an input image as a sequence of patches, akin to a series of word embeddings generated by a natural language processing (NLP) Transformer.

The Vision Transformer
The original text Transformer takes as input a sequence of words, which it then uses for classification, translation, or other NLP tasks. For ViT, we make the fewest possible modifications to the Transformer design to make it operate directly on images instead of words, and observe how much about image structure the model can learn on its own.

ViT divides an image into a grid of square patches. Each patch is flattened into a single vector by concatenating the channels of all pixels in a patch and then linearly projecting it to the desired input dimension. Because Transformers are agnostic to the structure of the input elements we add learnable position embeddings to each patch, which allow the model to learn about the structure of the images. A priori, ViT does not know about the relative location of patches in the image, or even that the image has a 2D structure — it must learn such relevant information from the training data and encode structural information in the position embeddings.

Scaling Up

We first train ViT on ImageNet, where it achieves a best score of 77.9% top-1 accuracy. While this is decent for a first attempt, it falls far short of the state of the art — the current best CNN trained on ImageNet with no extra data reaches 85.8%. Despite mitigation strategies (e.g., regularization), ViT overfits the ImageNet task due to its lack of inbuilt knowledge about images.

To investigate the impact of dataset size on model performance, we train ViT on ImageNet-21k (14M images, 21k classes) and JFT (300M images, 18k classes), and compare the results to a state-of-the-art CNN, Big Transfer (BiT), trained on the same datasets. As previously observed, ViT performs significantly worse than the CNN equivalent (BiT) when trained on ImageNet (1M images). However, on ImageNet-21k (14M images) performance is comparable, and on JFT (300M images), ViT now outperforms BiT.

Finally, we investigate the impact of the amount of computation involved in training the models. For this, we train several different ViT models and CNNs on JFT. These models span a range of model sizes and training durations. As a result, they require varying amounts of compute for training. We observe that, for a given amount of compute, ViT yields better performance than the equivalent CNNs.

Left: Performance of ViT when pre-trained on different datasets. Right: ViT yields a good performance/compute trade-off.

High-Performing Large-Scale Image Recognition
Our data suggest that (1) with sufficient training ViT can perform very well, and (2) ViT yields an excellent performance/compute trade-off at both smaller and larger compute scales. Therefore, to see if performance improvements carried over to even larger scales, we trained a 600M-parameter ViT model.

This large ViT model attains state-of-the-art performance on multiple popular benchmarks, including 88.55% top-1 accuracy on ImageNet and 99.50% on CIFAR-10. ViT also performs well on the cleaned-up version of the ImageNet evaluations set “ImageNet-Real”, attaining 90.72% top-1 accuracy. Finally, ViT works well on diverse tasks, even with few training data points. For example, on the VTAB-1k suite (19 tasks with 1,000 data points each), ViT attains 77.63%, significantly ahead of the single-model state of the art (SOTA) (76.3%), and even matching SOTA attained by an ensemble of multiple models (77.6%). Most importantly, these results are obtained using fewer compute resources compared to previous SOTA CNNs, e.g., 4x fewer than the pre-trained BiT models.

Vision Transformer matches or outperforms state-of-the-art CNNs on popular benchmarks. Left: Popular image classification tasks (ImageNet, including new validation labels ReaL, and CIFAR, Pets, and Flowers). Right: Average across 19 tasks in the VTAB classification suite.

Visualizations
To gain some intuition into what the model learns, we visualize some of its internal workings. First, we look at the position embeddings — parameters that the model learns to encode the relative location of patches — and find that ViT is able to reproduce an intuitive image structure. Each position embedding is most similar to others in the same row and column, indicating that the model has recovered the grid structure of the original images. Second, we examine the average spatial distance between one element attending to another for each transformer block. At higher layers (depths of 10-20) only global features are used (i.e., large attention distances), but the lower layers (depths 0-5) capture both global and local features, as indicated by a large range in the mean attention distance. By contrast, only local features are present in the lower layers of a CNN. These experiments indicate that ViT can learn features hard-coded into CNNs (such as awareness of grid structure), but is also free to learn more generic patterns, such as a mix of local and global features at lower layers, that can aid generalization.

Left: ViT learns the grid like structure of the image patches via its position embeddings. Right: The lower layers of ViT contain both global and local features, the higher layers contain only global features.

Summary
While CNNs have revolutionized computer vision, our results indicate that models tailor-made for imaging tasks may be unnecessary, or even sub-optimal. With ever-increasing dataset sizes, and the continued development of unsupervised and semi-supervised methods, the development of new vision architectures that train more efficiently on these datasets becomes increasingly important. We believe ViT is a preliminary step towards generic, scalable architectures that can solve many vision tasks, or even tasks from many domains, and are excited for future developments.

A preprint of our work as well as code and models are publically available.

Acknowledgements
We would like to thank our co-authors in Berlin, Zürich, and Amsterdam: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, and Jakob Uszkoreit. We would like to thank Andreas Steiner for crucial help with infrastructure and open-sourcing, Joan Puigcerver and Maxim Neumann for work on large-scale training infrastructure, and Dmitry Lepikhin, Aravindh Mahendran, Daniel Keysers, Mario Lučić, Noam Shazeer, and Colin Raffel for useful discussions. Finally, we thank Tom Small for creating the Visual Transformer animation in this post.

Read More

Can AI make me trendier?

Can AI make me trendier?

As a software engineer and generally analytic type, I like to craft theories for everything. Theories on how to build software, how to stay productive, how to be creative…and even how to dress well. For help with that last one, I decided to hire a personal stylist. As it turned out, I was not my stylist’s first software engineer client. “The problem with you people in tech is that you’re always looking for some sort of theory of fashion,” she told me. “But there is no formula–it’s about taste.”

Unfortunately my stylist’s taste was a bit outside of my price range (I drew the line at a $300 hoodie). But I knew she was right. It’s true that computers (and maybe the people who program them) are better at solving problems with clear-cut answers than they are at navigating touchy-feely matters, like taste. Fashion trends are not set by data-crunching CPUs, they’re made by human tastemakers and fashionistas and their modern-day equivalents, social media influencers. 

I found myself wondering if I could build an app that combined trendsetters’ sense of style with AI’s efficiency to help me out a little. I started getting fashion inspiration from Instagram influencers who matched my style. When I saw an outfit I liked, I’d try to recreate it using items I already owned. It was an effective strategy, so I set out to automate it with AI.

First, I partnered up with one of my favorite programmers, who just so happened to also be an Instagram influencer, Laura Medalia (or codergirl_ on Instagram). With her permission, I uploaded all of Laura’s pictures to Google Cloud to serve as my outfit inspiration.

Image showing a screenshot of the Instagram profile of "codergirl."

Next, I painstakingly photographed every single item of clothing I owned, creating a digital archive of my closet.

Animated GIF showing a woman in a white room placing different clothing items on a mannequin and taking photos of them.

To compare my closet with Laura’s, I used Google Cloud Vision Product Search API, which uses computer vision to identify similar products. If you’ve ever seen a “See Similar Items” tab when you’re online shopping, it’s probably powered by a similar technology. I used this API to look through all of Laura’s outfits and all of my clothes to figure out which looks I could recreate. I bundled up all of the recommendations into a web app so that I could browse them on my phone, and voila: I had my own AI-powered stylist. It looks like this:

Animated GIF showing different screens that display items of clothing that can be paired together to create an outfit.

Thanks to Laura’s sense of taste, I have lots of new ideas for styling my own wardrobe. Here’s one look I was able to recreate:

Image showing two screens; on the left, a woman is standing in a room wearing a fashionable outfit with the items that make up that outfit in two panels below her. In the other is another woman, wearing a similar outfit.

If you want to see the rest of my newfound outfits, check out the YouTube video at the top of this post, where I go into all of the details of how I built the app, or read my blog post.

No, I didn’t end up with a Grand Unified Theory of Fashion—but at least I have something stylish to wear while I’m figuring it out.

Read More

Navigating Recorder Transcripts Easily, with Smart Scrolling

Navigating Recorder Transcripts Easily, with Smart Scrolling

Posted by Itay Inbar, Senior Software Engineer, Google Research

Last year we launched Recorder, a new kind of recording app that made audio recording smarter and more useful by leveraging on-device machine learning (ML) to transcribe the recording, highlight audio events, and suggest appropriate tags for titles. Recorder makes editing, sharing and searching through transcripts easier. Yet because Recorder can transcribe very long recordings (up to 18 hours!), it can still be difficult for users to find specific sections, necessitating a new solution to quickly navigate such long transcripts.

To increase the navigability of content, we introduce Smart Scrolling, a new ML-based feature in Recorder that automatically marks important sections in the transcript, chooses the most representative keywords from each section, and then surfaces those keywords on the vertical scrollbar, like chapter headings. The user can then scroll through the keywords or tap on them to quickly navigate to the sections of interest. The models used are lightweight enough to be executed on-device without the need to upload the transcript, thus preserving user privacy.

Smart Scrolling feature UX

Under the hood
The Smart Scrolling feature is composed of two distinct tasks. The first extracts representative keywords from each section and the second picks which sections in the text are the most informative and unique.

For each task, we utilize two different natural language processing (NLP) approaches: a distilled bidirectional transformer (BERT) model pre-trained on data sourced from a Wikipedia dataset, alongside a modified extractive term frequency–inverse document frequency (TF-IDF) model. By using the bidirectional transformer and the TF-IDF-based models in parallel for both the keyword extraction and important section identification tasks, alongside aggregation heuristics, we were able to harness the advantages of each approach and mitigate their respective drawbacks (more on this in the next section).

The bidirectional transformer is a neural network architecture that employs a self-attention mechanism to achieve context-aware processing of the input text in a non-sequential fashion. This enables parallel processing of the input text to identify contextual clues both before and after a given position in the transcript.

Bidirectional Transformer-based model architecture

The extractive TF-IDF approach rates terms based on their frequency in the text compared to their inverse frequency in the trained dataset, and enables the finding of unique representative terms in the text.

Both models were trained on publicly available conversational datasets that were labeled and evaluated by independent raters. The conversational datasets were from the same domains as the expected product use cases, focusing on meetings, lectures, and interviews, thus ensuring the same word frequency distribution (Zipf’s law).

Extracting Representative Keywords
The TF-IDF-based model detects informative keywords by giving each word a score, which corresponds to how representative this keyword is within the text. The model does so, much like a standard TF-IDF model, by utilizing the ratio of the number of occurrences of a given word in the text compared to the whole of the conversational data set, but it also takes into account the specificity of the term, i.e., how broad or specific it is. Furthermore, the model then aggregates these features into a score using a pre-trained function curve. In parallel, the bidirectional transformer model, which was fine tuned on the task of extracting keywords, provides a deep semantic understanding of the text, enabling it to extract precise context-aware keywords.

The TF-IDF approach is conservative in the sense that it is prone to finding uncommon keywords in the text (high bias), while the drawback for the bidirectional transformer model is the high variance of the possible keywords that can be extracted. But when used together, these two models complement each other, forming a balanced bias-variance tradeoff.

Once the keyword scores are retrieved from both models, we normalize and combine them by utilizing NLP heuristics (e.g., the weighted average), removing duplicates across sections, and eliminating stop words and verbs. The output of this process is an ordered list of suggested keywords for each of the sections.

Rating A Section’s Importance
The next task is to determine which sections should be highlighted as informative and unique. To solve this task, we again combine the two models mentioned above, which yield two distinct importance scores for each of the sections. We compute the first score by taking the TF-IDF scores of all the keywords in the section and weighting them by their respective number of appearances in the section, followed by a summation of these individual keyword scores. We compute the second score by running the section text through the bidirectional transformer model, which was also trained on the sections rating task. The scores from both models are normalized and then combined to yield the section score.

Smart Scrolling pipeline architecture

Some Challenges
A significant challenge in the development of Smart Scrolling was how to identify whether a section or keyword is important – what is of great importance to one person can be of less importance to another. The key was to highlight sections only when it is possible to extract helpful keywords from them.

To do this, we configured the solution to select the top scored sections that also have highly rated keywords, with the number of sections highlighted proportional to the length of the recording. In the context of the Smart Scrolling features, a keyword was more highly rated if it better represented the unique information of the section.

To train the model to understand this criteria, we needed to prepare a labeled training dataset tailored to this task. In collaboration with a team of skilled raters, we applied this labeling objective to a small batch of examples to establish an initial dataset in order to evaluate the quality of the labels and instruct the raters in cases where there were deviations from what was intended. Once the labeling process was complete we reviewed the labeled data manually and made corrections to the labels as necessary to align them with our definition of importance.

Using this limited labeled dataset, we ran automated model evaluations to establish initial metrics on model quality, which were used as a less-accurate proxy to the model quality, enabling us to quickly assess the model performance and apply changes in the architecture and heuristics. Once the solution metrics were satisfactory, we utilized a more accurate manual evaluation process over a closed set of carefully chosen examples that represented expected Recorder use cases. Using these examples, we tweaked the model heuristics parameters to reach the desired level of performance using a reliable model quality evaluation.

Runtime Improvements
After the initial release of Recorder, we conducted a series of user studies to learn how to improve the usability and performance of the Smart Scrolling feature. We found that many users expect the navigational keywords and highlighted sections to be available as soon as the recording is finished. Because the computation pipeline described above can take a considerable amount of time to compute on long recordings, we devised a partial processing solution that amortizes this computation over the whole duration of the recording. During recording, each section is processed as soon as it is captured, and then the intermediate results are stored in memory. When the recording is done, Recorder aggregates the intermediate results.

When running on a Pixel 5, this approach reduced the average processing time of an hour long recording (~9K words) from 1 minute 40 seconds to only 9 seconds, while outputting the same results.

Summary
The goal of Recorder is to improve users’ ability to access their recorded content and navigate it with ease. We have already made substantial progress in this direction with the existing ML features that automatically suggest title words for recordings and enable users to search recordings for sounds and text. Smart Scrolling provides additional text navigation abilities that will further improve the utility of Recorder, enabling users to rapidly surface sections of interest, even for long recordings.

Acknowledgments
Bin Zhang, Sherry Lin, Isaac Blankensmith, Henry Liu‎, Vincent Peng‎, Guilherme Santos‎, Tiago Camolesi, Yitong Lin, James Lemieux, Thomas Hall‎, Kelly Tsai‎, Benny Schlesinger, Dror Ayalon, Amit Pitaru, Kelsie Van Deman, Console Chen, Allen Su, Cecile Basnage, Chorong Johnston‎, Shenaz Zack, Mike Tsao, Brian Chen, Abhinav Rastogi, Tracy Wu, Yvonne Yang‎.

Read More

Find your inner poet with help from America's greats

Find your inner poet with help from America’s greats

Behold! the living thrilling lines

That course the blood like madd’ning wines,

And leap with scintillating spray

Across the guards of ecstasy.

The flame that lights the lurid spell

Springs from the soul’s artesian well,

Its fairy filament of art

Entwines the fragments of a heart.

Poetry by Georgia Douglas Johnson

When you write the living thrilling lines of a poem, you put yourself into each verse. Whether you’re writing for family, friends or an audience of thousands, each poem carries a part of you. When composing such a poem, each line is carefully crafted, which requires a lot of creative energy. Verse by Verse can help get those creative juices flowing: it’s our experiment using AI to augment the creative process of composing a poem. It will offer ideas that you can use, alter, or reject as you see fit. Verse by Verse is a creative helper, an inspiration—not a replacement. Here’s how it works.

Your muses

Using Verse by Verse, you can compose a poem with suggestions coming from some of America’s classic poets: Dickinson, Whitman, Poe, Wheatley, Longfellow and others. In order to make this possible, we’ve trained AI systems that provide suggestions in the style of each individual poet to act as your muses while you compose a poem of your own.

Poets featured in this tool

Composing

After choosing which poets to act as your muses and the structure of your poem, you can begin composing. Once you’ve written the first line of verse, Verse by Verse will start to suggest possible next verses.

Writing a poem

We give you full control of this creative process. You can choose to continue writing your own verses, use one of the suggestions, or even edit one of the suggestions to make it more personal. Once you’re satisfied with your poem, give it a title and finalize it. We give you two options: copy the text itself, or download the poem as an image. In either case, you can easily save the poem and share it with others.

Finished poem

Verse suggestions

Verse by Verse’s suggestions are not the original lines of verse the poets had written, but novel verses generated to sound like lines of verse the poets could have written. We did this by first training our generative models on a large collection of classic poetry, then fine tuning the models on each individual poet’s body of work to try to capture their style of writing.

Additionally, to be able to suggest relevant verses, the system was trained to have a general semantic understanding of what lines of verse would best follow a previous line of verse. So even if you write on topics not commonly seen in classic poetry, the system will try its best to make suggestions that are relevant.

Get writing

Verse by Verse can be used as a tool for inspiration, offering suggestions for ways of writing you may have never thought of. You can use it as an aid to learn about these various poets and the styles that they wrote in.

Have fun, and see where it takes you—perhaps down the road less traveled.

Read More

The Language Interpretability Tool (LIT): Interactive Exploration and Analysis of NLP Models

The Language Interpretability Tool (LIT): Interactive Exploration and Analysis of NLP Models

Posted by James Wexler, Software Developer and Ian Tenney, Software Engineer, Google Research

As natural language processing (NLP) models become more powerful and are deployed in more real-world contexts, understanding their behavior is becoming increasingly critical. While advances in modeling have brought unprecedented performance on many NLP tasks, many research questions remain about not only the behavior of these models under domain shift and adversarial settings, but also their tendencies to behave according to social biases or shallow heuristics.

For any new model, one might want to know in which cases a model performs poorly, why a model makes a particular prediction, or whether a model will behave consistently under varying inputs, such as changes to textual style or pronoun gender. But, despite the recent explosion of work on model understanding and evaluation, there is no “silver bullet” for analysis. Practitioners must often experiment with many techniques, looking at local explanations, aggregate metrics, and counterfactual variations of the input to build a better understanding of model behavior, with each of these techniques often requiring its own software package or bespoke tool. Our previously released What-If Tool was built to address this challenge by enabling black-box probing of classification and regression models, thus enabling researchers to more easily debug performance and analyze the fairness of machine learning models through interaction and visualization. But there was still a need for a toolkit that would address challenges specific to NLP models.

With these challenges in mind, we built and open-sourced the Language Interpretability Tool (LIT), an interactive platform for NLP model understanding. LIT builds upon the lessons learned from the What-If Tool with greatly expanded capabilities, which cover a wide range of NLP tasks including sequence generation, span labeling, classification and regression, along with customizable and extensible visualizations and model analysis.

LIT supports local explanations, including salience maps, attention, and rich visualizations of model predictions, as well as aggregate analysis including metrics, embedding spaces, and flexible slicing. It allows users to easily hop between visualizations to test local hypotheses and validate them over a dataset. LIT provides support for counterfactual generation, in which new data points can be added on the fly, and their effect on the model visualized immediately. Side-by-side comparison allows for two models, or two individual data points, to be visualized simultaneously. More details about LIT can be found in our system demonstration paper, which was presented at EMNLP 2020.

Exploring a sentiment classifier with LIT.

Customizability
In order to better address the broad range of users with different interests and priorities that we hope will use LIT, we’ve built the tool to be easily customizable and extensible from the start. Using LIT on a particular NLP model and dataset only requires writing a small bit of Python code. Custom components, such as task-specific metrics calculations or counterfactual generators, can be written in Python and added to a LIT instance through our provided APIs. Also, the front end itself can be customized, with new modules that integrate directly into the UI. For more on extending the tool, check out our documentation on GitHub.

Demos
To illustrate some of the capabilities of LIT, we have created a few demos using pre-trained models. The full list is available on the LIT website, and we describe two of them here:

  • Sentiment analysis: In this example, a user can explore a BERT-based binary classifier that predicts if a sentence has positive or negative sentiment. The demo uses the Stanford Sentiment Treebank of sentences from movie reviews to demonstrate model behavior. One can examine local explanations using saliency maps provided by a variety of techniques (such as LIME and integrated gradients), and can test model behavior with perturbed (counterfactual) examples using techniques such as back-translation, word replacement, or adversarial attacks. These techniques can help pinpoint under what scenarios a model fails, and whether those failures are generalizable, which can then be used to inform how best to improve a model.
    Analyzing token-based salience of an incorrect prediction. The word “laughable” seems to be incorrectly raising the positive sentiment score of this example.
  • Masked word prediction: Masked language modeling is a “fill-in-the-blank” task, where the model predicts different words that could complete a sentence. For example, given the prompt, “I took my ___ for a walk”, the model might predict a high score for “dog.” In LIT one can explore this interactively by typing in sentences or choosing from a pre-loaded corpus, and then clicking specific tokens to see what a model like BERT understands about language, or about the world.
    Interactively selecting a token to mask, and viewing a language model’s predictions.

LIT in Practice and Future Work
Although LIT is a new tool, we have already seen the value that it can provide for model understanding. Its visualizations can be used to find patterns in model behavior, such as outlying clusters in embedding space, or words with outsized importance to the predictions. Exploration in LIT can test for potential biases in models, as demonstrated in our case study of LIT exploring gender bias in a coreference model. This type of analysis can inform next steps in improving model performance, such as applying MinDiff to mitigate systemic bias. It can also be used as an easy and fast way to create an interactive demo for any NLP model.

Check out the tool either through our provided demos, or by bringing up a LIT server for your own models and datasets. The work on LIT has just started, and there are a number of new capabilities and refinements planned, including the addition of new interpretability techniques from cutting edge ML and NLP research. If there are other techniques that you’d like to see added to the tool, please let us know! Join our mailing list to stay up to date as LIT evolves. And as the code is open-source, we welcome feedback on and contributions to the tool.

Acknowledgments
LIT is a collaborative effort between the Google Research PAIR and Language teams. This post represents the work of the many contributors across Google, including Andy Coenen, Ann Yuan, Carey Radebaugh, Ellen Jiang, Emily Reif, Jasmijn Bastings, Kristen Olson, Leslie Lai, Mahima Pushkarna, Sebastian Gehrmann, and Tolga Bolukbasi. We would like to thank all those who contributed to the project, both inside and outside Google, and the teams that have piloted its use and provided valuable feedback.

Read More

Music from the heart, with an AI assist

Music from the heart, with an AI assist

The next time you hear a popular song on the radio, listen to the beat behind the lyrics. Usually, a high-powered production team came up with it—but in the future, that beat could be created with help from artificial intelligence. That’s what Googler MJ Jacob predicts, as he combines his job as an engineer with his love for writing and performing rap music. 

Usually based in Google’s offices in New York City, MJ is working from his Manhattan apartment these days as a customer engineer for Google Cloud, helping companies figure out how to use machine learning and AI to accomplish their business goals. But in his free time, he’s writing lyrics, producing hip-hop tracks and creating YouTube videos detailing how he does it all. 

MJ has balanced an interest in technology with a love for hip-hop since he was a 13-year-old living in Virginia. His family was struggling financially, and he found rappers’ rags-to-riches lyrics to be inspirational. “Almost every rapper I listened to was broke and then they made it,” he recalls. “These rappers had very hard childhoods, whether it was because of money, parental issues or anger from insecurities, and all of that is what I felt in that moment.”

His favorite rappers felt like personal mentors, and he decided to imitate them and try rapping himself. He recorded songs using the microphone on his MP3 player; he says they were a crucial way for him to vent. “From when I was 13 until today, being able to write about my life and how I’m feeling, it’s the most therapeutic thing for me,” he says. 

Around the same time he discovered hip-hop, MJ became fascinated by technology. His family couldn’t afford a computer, but someone at his local church built a computer for them, complete with a see-through CPU tower. MJ first used it just to edit music, but always loved looking at the computer parts light up. One day, he spent six hours taking the tower apart and putting the pieces back together. “It was very overwhelming but exciting the entire time,” he says, “and I think that’s a similar emotion I feel when I make music.”

Most recently, he posted a video showcasing how he used AI to create a hip-hop beat. He collected instrumental tracks that he and his producer friends had created over the years, and uploaded the files to Google Cloud. Then he used Magenta, Google’s open-source tool that uses machine learning to help create music and art. (Musicians like YACHT have used Magenta to create entire albums.) Based on how he identified “hip-hop” in his dataset, the machine learning model created entirely new melodies and drum beats. MJ then used those new sounds to craft his track, and wrote and performed lyrics to go along with it.

Even though it was made with the help of machine learning, the finished product still sounded like his music. And that’s the whole point: MJ wants to show that AI doesn’t take away the human side of his art—it adds to it. “AI never replaced anything,” he explains. “It only assisted.”

Authenticity is important to MJ (whose musical alias is MJx Music), because he sees music as an important emotional outlet. His most popular song, “Time Will Heal,” which has more than a million streams, is inspired by his sister, a survivor of sexual abuse. The lyrics are written from her perspective. “She taught me so much about what it means to be a strong human, to go through hell and back and still be able to make it,” MJ says. “We decided it would be a cool opportunity to not only share her story, but also help anyone who’s ever been abused or felt they’ve been taken advantage of.”

Next, MJ is hoping to take his experiments with music and machine learning to a new level. In fact, he’s so inspired by the combination that he’s looking to create a three or four-track EP co-produced by AI. 

“Both music and tech are so fulfilling for me that they have the ability to intertwine so well,” he says. “Now I’m pushing myself even more musically, and I’m pushing myself even more technically. It’s cool to be able to contribute to a new concept in the world.”

Read More

Haptics with Input: Using Linear Resonant Actuators for Sensing

Haptics with Input: Using Linear Resonant Actuators for Sensing

Posted by Artem Dementyev, Hardware Engineer, Google Research

As wearables and handheld devices decrease in size, haptics become an increasingly vital channel for feedback, be it through silent alerts or a subtle “click” sensation when pressing buttons on a touch screen. Haptic feedback, ubiquitous in nearly all wearable devices and mobile phones, is commonly enabled by a linear resonant actuator (LRA), a small linear motor that leverages resonance to provide a strong haptic signal in a small package. However, the touch and pressure sensing needed to activate the haptic feedback tend to depend on additional, separate hardware which increases the price, size and complexity of the system.

In “Haptics with Input: Back-EMF in Linear Resonant Actuators to Enable Touch, Pressure and Environmental Awareness”, presented at ACM UIST 2020, we demonstrate that widely available LRAs can sense a wide range of external information, such as touch, tap and pressure, in addition to being able to relay information about contact with the skin, objects and surfaces. We achieve this with off-the-shelf LRAs by multiplexing the actuation with short pulses of custom waveforms that are designed to enable sensing using the back-EMF voltage. We demonstrate the potential of this approach to enable expressive discrete buttons and vibrotactile interfaces and show how the approach could bring rich sensing opportunities to integrated haptics modules in mobile devices, increasing sensing capabilities with fewer components. Our technique is potentially compatible with many existing LRA drivers, as they already employ back-EMF sensing for autotuning of the vibration frequency.

Different off-the-shelf LRAs that work using this technique.

Back-EMF Principle in an LRA
Inside the LRA enclosure is a magnet attached to a small mass, both moving freely on a spring. The magnet moves in response to the excitation voltage introduced by the voice coil. The motion of the oscillating mass produces a counter-electromotive force, or back-EMF, which is a voltage proportional to the rate of change of magnetic flux. A greater oscillation speed creates a larger back-EMF voltage, while a stationary mass generates zero back-EMF voltage.

Anatomy of the LRA.

Active Back-EMF for Sensing
Touching or making contact with the LRA during vibration changes the velocity of the interior mass, as energy is dissipated into the contact object. This works well with soft materials that deform under pressure, such as the human body. A finger, for example, absorbs different amounts of energy depending on the contact force as it flattens against the LRA. By driving the LRA with small amounts of energy, we can measure this phenomenon using the back-EMF voltage. Because leveraging the back-EMF behavior for sensing is an active process, the key insight that enabled this work was the design of a custom, off-resonance driver waveform that allows continuous sensing while minimizing vibrations, sound and power consumption.

Touch and pressure sensing on the LRA.

We measure back-EMF from the floating voltage between the two LRA leads, which requires disconnecting the motor driver briefly to avoid disturbances. While the driver is disconnected, the mass is still oscillating inside the LRA, producing an oscillating back-EMF voltage. Because commercial back-EMF sensing LRA drivers do not provide the raw data, we designed a custom circuit that is able to pick up and amplify small back-EMF voltage. We also generated custom drive pulses that minimize vibrations and energy consumption.

Simplified schematic of the LRA driver and the back-EMF measurement circuit for active sensing.
After exciting the LRA with a short drive pulse, the back-EMF voltage fluctuates due to the continued oscillations of the mass on the spring (top, red line). The change in the back-EMF signal when subject to a finger press depends on the pressure applied (middle/bottom, green/blue lines).

Applications
The behavior of the LRAs used in mobile phones is the same, whether they are on a table, on a soft surface, or hand held. This may cause problems, as a vibrating phone could slide off a glass table or emit loud and unnecessary vibrating sounds. Ideally, the LRA on a phone would automatically adjust based on its environment. We demonstrate our approach for sensing using the LRA back-EMF technique by wiring directly to a Pixel 4’s LRA, and then classifying whether the phone is held in hand, placed on a soft surface (foam), or placed on a table.

Sensing phone surroundings.

We also present a prototype that demonstrates how LRAs could be used as combined input/output devices in portable electronics. We attached two LRAs, one on the left and one on the right side of a phone. The buttons provide tap, touch, and pressure sensing. They are also programmed to provide haptic feedback, once the touch is detected.

Pressure-sensitive side buttons.

There are a number of wearable tactile aid devices, such as sleeves, vests, and bracelets. To transmit tactile feedback to the skin with consistent force, the tactor has to apply the right pressure; it can not be too loose or too tight. Currently, the typical way to do so is through manual adjustment, which can be inconsistent and lacks measurable feedback. We show how the LRA back-EMF technique can be used to continuously monitor the fit bracelet device and prompt the user if it’s too tight, too loose, or just right.

Fit sensing bracelet.

Evaluating an LRA as a Sensor
The LRA works well as a pressure sensor, because it has a quadratic response to the force magnitude during touch. Our method works for all five off-the-shelf LRA types that we evaluated. Because the typical power consumption is only 4.27 mA, all-day sensing would only reduce the battery life of a Pixel 4 phone from 25 to 24 hours. The power consumption can be greatly reduced by using low-power amplifiers and employing active sensing only when needed, such as when the phone is active and interacting with the user.

Back-EMF voltage changes when pressure is applied with a finger.

The challenge with active sensing is to minimize vibrations, so they are not perceptible when touching and do not produce audible sound. We optimize the active sensing to produce only 2 dB of sound and 0.45 m/s2 peak-to-peak acceleration, which is just barely perceptible by finger and is quiet, in contrast to the regular 8.49 m/s2 .

Future Work and Conclusion
To see the work presented here in action, please see the video below.

In the future, we plan to explore other sensing techniques, perhaps measuring the current could be an alternative approach. Also, using machine learning could potentially improve the sensing and provide more accurate classification of the complex back-EMF patterns. Our method could be developed further to enable closed-loop feedback with the actuator and sensor, which would allow the actuator to provide the same force, regardless of external conditions.

We believe that this work opens up new opportunities for leveraging existing ubiquitous hardware to provide rich interactions and closed-loop feedback haptic actuators.

Acknowledgments
This work was done by Artem Dementyev, Alex Olwal, and Richard Lyon. Thanks to Mathieu Le Goc and Thad Starner for feedback on the paper.

Read More

Rachel Malarich is planting a better future, tree by tree

Rachel Malarich is planting a better future, tree by tree

Everyone has a tree story, Rachel Malarich says—and one of hers takes place on the limbs of a eucalyptus tree. Rachel and her cousins spent summers in central California climbing the 100-foot tall trees and hanging out between the waxy blue leaves—an experience she remembers as awe-inspiring. 

Now, as Los Angeles first-ever City Forest Officer, Rachel’s work is shaping the tree stories that Angelenos will tell. “I want our communities to go to public spaces and feel that sense of awe,” she says. “That feeling that something was there before them, and it will be there after them…we have to bring that to our cities.”

Part of Rachel’s job is to help the City of Los Angeles reach an ambitious goal: to plant and maintain 90,000 trees by the end of 2021 and to keep planting trees at a rate of 20,000 per year after that. This goal is about more than planting trees, though: It’s about planting the seeds for social, economic and environmental equity. These trees, Rachel says, will help advance citywide sustainability and climate goals, beautify neighborhoods, improve air quality and create shade to combat rising street-level temperatures. 

To make sure every tree has the most impact, Rachel and the City of Los Angeles use Tree Canopy Lab, a tool they helped build with Google that uses AI and aerial imagery to understand current tree cover density, also known as “tree canopy,” right down to street-level data. Tree inventory data, which is typically collected through on-site assessments, helps city officials know where to invest resources for maintaining, preserving and planting trees. It also helps pinpoint where new trees should be planted. In the case of LA, there was a strong correlation between a lack of tree coverage and the city’s underserved communities. 

With Tree Canopy Lab, Rachel and her team overlay data, such as population density and land use data, to understand what’s happening within the 500 square miles of the city and understand where new trees will have the biggest impact on a community. It helps them answer questions like: Where are highly populated residential areas with low tree coverage? Which thoroughfares that people commute along every day have no shade? 

And it also helps Rachel do what she has focused her career on: creating community-led programs. After more than a decade of working at nonprofits, she’s learned that resilient communities are connected communities. 

“This data helps us go beyond assumptions and see where the actual need is,” Rachel says. “And it frees me up to focus on what I know best: listening to the people of LA, local policy and urban forestry.” 

After working with Google on Tree Canopy Lab, she’s found that data gives her a chance to connect with the public. She now has a tool that quickly pools together data and creates a visual to show community leaders what’s happening in specific neighborhoods, what the city is doing and why it’s important. She can also demonstrate ways communities can better manage resources they already have to achieve local goals. And that’s something she thinks every city can benefit from. 

“My entrance into urban forestry was through the lens of social justice and economic inequity. For me, it’s about improving the quality of life for Angelenos,” Rachel says. “I’m excited to work with others to create that impact on a bigger level, and build toward the potential for a better environment in the future.”

And in this case, building a better future starts with one well planned tree at a time.

Read More

Introducing Google News Initiative Conversations

Introducing Google News Initiative Conversations

This year, the way many of us work has changed dramatically. We’ve gone from lunch meetings and large networking conferences to meeting virtually from our makeshift home offices. The COVID-19 pandemic has certainly upended a lot of this, but that doesn’t mean sharing ideas is on hold, too. That’s especially true for the Google News Initiative team; our commitment to helping journalism thrive is still just as strong. 

That’s why we’ve launched Google News Initiative Conversations, a new video series in which we bring together industry experts and our partners from around the world to discuss the successes, challenges and opportunities facing the news industry. Since March 2018, the GNI has worked with more than 6,250 news partners in 118 countries, several of which are featured in the series.

Over the course of four episodes, we cover the themes of business sustainability, quality journalism, diversity, equity and inclusion and a look ahead to 2021 from a global perspective. Take a look at what the series has to offer:

Sustaining the News Industry, featuring: 

Miki King, Chief Marketing Officer of the Washington Post
Gary Liu, CEO of the South China Morning Post
Tara Lajumoke, Managing Director of FT Strategies
Megan Brownlow and Simon Crerar talk about local journalism in Australia.

Quality Journalism, featuring: 

Claire Wardle, U.S. Director, First Draft
Surabhi Malik and Syed Nazakat of FactShala India

Diversity, Equity, and Inclusion, featuring: 

Soledad O’Brien, CEO of Soledad O’Brien Productions
Drew Christie, Chair of BCOMS – the Black Collective of Media in U.K. Sport
Bryan Pollard, Associate Director of Native American Journalists Association
Kalhan Rosenblatt, Youth and Internet Culture Reporter at NBC News
Tania Montalvo, General Editor at Animal Político, Mexico 
Zack Weiner, President of Overtime

Innovation and the Future of News, featuring: 

Brad Bender, VP of Product at Google interviewed by broadcaster Tina Daheley  
Charlie Beckett, Professor in the Dept of Media and Communication at LSE
Agnes Stenborn, Responsible Data and AI Specialist
Christina Elmer, Editorial RnD at Der Spiegel

It’s uncertain when we’ll get to gather together in person again, but until then, we’ll continue learning, collaborating and innovating as we work towards a better future for news.

Read More