Halodoc uses AI to improve how doctors receive feedback

Halodoc uses AI to improve how doctors receive feedback

Due to Indonesia’s vast size and population, timely and reliable access to healthcare can sometimes be a challenge. Halodoc aims to change that with a mobile first-telemedicine platform that connects Indonesians to doctors and helps them arrange appointments, medicine deliveries and tests. 

What’s distinctive about the Halodoc platform is that it draws on human-centered artificial intelligence: a promising new area of research that uses continuous human feedback to improve how AI systems work, and provides a better experience for the people who rely on those systems. 

With support from Google’s Late Stage Accelerator, a program that assists high-potential startups, we assembled a team of doctors, data scientists, engineers, product managers and researchers to determine how technology could support Indonesian doctors’ work. One particular approach the team identified was using AI to replicate the mentoring and feedback that junior doctors receive from more experienced colleagues in hospitals—a process that’s important to improving quality of care, but is hard to reproduce on a larger scale.  

We set out to create an easy way to provide feedback in virtual health, and worked with Google’s machine learning experts in the Late-Stage Accelerator to determine the best approach. With Google’s guidance, Halodoc’s engineers applied Natural Language Processing in Bahasa Indonesia to measure, rank, and provide insights that can inform doctors’ decisions across the country—using thousands of consultations to train their machine learning models. 

When doctors open the Halodoc app, they see information on how they performed based on their response time and quality index metrics, along with suggested actions on how they can improve their consultation quality.  They also have the option of receiving further feedback and coaching from more senior doctors if needed. 

Right now, more than five percent of Indonesians use Halodoc’s platform. As a result of applying AI principles to improve the quality of care that patients experience, our app ratings have increased from 4.5 to 4.8 stars in fewer than six months, while our overall doctor scores have improved by 64 percent.

Halodoc's app interface.

Halodoc’s telemedicine app enables doctors to deliver personalized feedback with assistance from ML-enabled insights that improve patient care.

From here, with Google’s help, we hope to continue simplifying Indonesia’s healthcare infrastructure and advance the application of AI in healthcare globally.

Read More

Background Features in Google Meet, powered by Web ML

Background Features in Google Meet, powered by Web ML

Posted by Tingbo Hou and Tyler Mullen, Software Engineers, Google Research

Video conferencing is becoming ever more critical in people’s work and personal lives. Improving that experience with privacy enhancements or fun visual touches can help center our focus on the meeting itself. As part of this goal, we recently announced ways to blur and replace your background in Google Meet, which use machine learning (ML) to better highlight participants regardless of their surroundings. Whereas other solutions require installing additional software, Meet’s features are powered by cutting-edge web ML technologies built with MediaPipe that work directly in your browser — no extra steps necessary. One key goal in developing these features was to provide real-time, in-browser performance on almost all modern devices, which we accomplished by combining efficient on-device ML models, WebGL-based rendering, and web-based ML inference via XNNPACK and TFLite.

Background blur and background replacement, powered by MediaPipe on the web.

Overview of Our Web ML Solution
The new features in Meet are developed with MediaPipe, Google’s open source framework for cross-platform customizable ML solutions for live and streaming media, which also powers ML solutions like on-device real-time hand, iris and body pose tracking.

A core need for any on-device solution is to achieve high performance. To accomplish this, MediaPipe’s web pipeline leverages WebAssembly, a low-level binary code format designed specifically for web browsers that improves speed for compute-heavy tasks. At runtime, the browser converts WebAssembly instructions into native machine code that executes much faster than traditional JavaScript code. In addition, Chrome 84 recently introduced support for WebAssembly SIMD, which processes multiple data points with each instruction, resulting in a performance boost of more than 2x.

Our solution first processes each video frame by segmenting a user from their background (more about our segmentation model later in the post) utilizing ML inference to compute a low resolution mask. Optionally, we further refine the mask to align it with the image boundaries. The mask is then used to render the video output via WebGL2, with the background blurred or replaced.

WebML Pipeline: All compute-heavy operations are implemented in C++/OpenGL and run within the browser via WebAssembly.

In the current version, model inference is executed on the client’s CPU for low power consumption and widest device coverage. To achieve real-time performance, we designed efficient ML models with inference accelerated by the XNNPACK library, the first inference engine specifically designed for the novel WebAssembly SIMD specification. Accelerated by XNNPACK and SIMD, the segmentation model can run in real-time on the web.

Enabled by MediaPipe’s flexible configuration, the background blur/replace solution adapts its processing based on device capability. On high-end devices it runs the full pipeline to deliver the highest visual quality, whereas on low-end devices it continues to perform at speed by switching to compute-light ML models and bypassing the mask refinement.

Segmentation Model
On-device ML models need to be ultra lightweight for fast inference, low power consumption, and small download size. For models running in the browser, the input resolution greatly affects the number of floating-point operations (FLOPs) necessary to process each frame, and therefore needs to be small as well. We downsample the image to a smaller size before feeding it to the model. Recovering a segmentation mask as fine as possible from a low-resolution image adds to the challenges of model design.

The overall segmentation network has a symmetric structure with respect to encoding and decoding, while the decoder blocks (light green) also share a symmetric layer structure with the encoder blocks (light blue). Specifically, channel-wise attention with global average pooling is applied in both encoder and decoder blocks, which is friendly to efficient CPU inference.

Model architecture with MobileNetV3 encoder (light blue), and a symmetric decoder (light green).

We modified MobileNetV3-small as the encoder, which has been tuned by network architecture search for the best performance with low resource requirements. To reduce the model size by 50%, we exported our model to TFLite using float16 quantization, resulting in a slight loss in weight precision but with no noticeable effect on quality. The resulting model has 193K parameters and is only 400KB in size.

Rendering Effects
Once segmentation is complete, we use OpenGL shaders for video processing and effect rendering, where the challenge is to render efficiently without introducing artifacts. In the refinement stage, we apply a joint bilateral filter to smooth the low resolution mask.

Rendering effects with artifacts reduced. Left: Joint bilateral filter smooths the segmentation mask. Middle: Separable filters remove halo artifacts in background blur. Right: Light wrapping in background replace.

The blur shader simulates a bokeh effect by adjusting the blur strength at each pixel proportionally to the segmentation mask values, similar to the circle-of-confusion (CoC) in optics. Pixels are weighted by their CoC radii, so that foreground pixels will not bleed into the background. We implemented separable filters for the weighted blur, instead of the popular Gaussian pyramid, as it removes halo artifacts surrounding the person. The blur is performed at a low resolution for efficiency, and blended with the input frame at the original resolution.

Background blur examples.

For background replacement, we adopt a compositing technique, known as light wrapping, for blending segmented persons and customized background images. Light wrapping helps soften segmentation edges by allowing background light to spill over onto foreground elements, making the compositing more immersive. It also helps minimize halo artifacts when there is a large contrast between the foreground and the replaced background.

Background replacement examples.

Performance
To optimize the experience for different devices, we provide model variants at multiple input sizes (i.e., 256×144 and 160×96 in the current release), automatically selecting the best according to available hardware resources.

We evaluated the speed of model inference and the end-to-end pipeline on two common devices: MacBook Pro 2018 with 2.2 GHz 6-Core Intel Core i7, and Acer Chromebook 11 with Intel Celeron N3060. For 720p input, the MacBook Pro can run the higher-quality model at 120 FPS and the end-to-end pipeline at 70 FPS, while the Chromebook runs inference at 62 FPS with the lower-quality model and 33 FPS end-to-end.

 Model   FLOPs   Device   Model Inference   Pipeline 
 256×144   64M   MacBook Pro 18   8.3ms (120 FPS)   14.3ms (70 FPS) 
 160×96   27M   Acer Chromebook 11   16.1ms (62 FPS)   30ms (33 FPS) 
Model inference speed and end-to-end pipeline on high-end (MacBook Pro) and low-end (Chromebook) laptops.

For quantitative evaluation of model accuracy, we adopt the popular metrics of intersection-over-union (IOU) and boundary F-measure. Both models achieve high quality, especially for having such a lightweight network:

  Model     IOU     Boundary  
  F-measure  
  256×144     93.58%     0.9024  
  160×96     90.79%     0.8542  
Evaluation of model accuracy, measured by IOU and boundary F-score.

We also release the accompanying Model Card for our segmentation models, which details our fairness evaluations. Our evaluation data contains images from 17 geographical subregions of the globe, with annotations for skin tone and gender. Our analysis shows that the model is consistent in its performance across the various regions, skin-tones, and genders, with only small deviations in IOU metrics.

Conclusion
We introduced a new in-browser ML solution for blurring and replacing your background in Google Meet. With this, ML models and OpenGL shaders can run efficiently on the web. The developed features achieve real-time performance with low power consumption, even on low-power devices.

Acknowledgments
Special thanks to the people who worked on this project, in particular Sebastian Jansson, Rikard Lundmark, Stephan Reiter, Fabian Bergmark, Ben Wagner, Stefan Holmer, Dan Gunnarson, Stéphane Hulaud and to all our team members who worked on the technology with us: Siargey Pisarchyk, Karthik Raveendran, Chris McClanahan, Marat Dukhan, Frank Barchard, Ming Guang Yong, Chuo-Ling Chang, Michael Hays, Camillo Lugaresi, Gregory Karpiak, Siarhei Kazakou, Matsvei Zhdanovich, and Matthias Grundmann.

Read More

Experimenting with Automatic Video Creation From a Web Page

Experimenting with Automatic Video Creation From a Web Page

Posted by Peggy Chi, Senior Research Scientist, and Irfan Essa, Senior Staff Research Scientist, Google Research

At Google, we’re actively exploring how people can use creativity tools powered by machine learning and computational methods when producing multimedia content, from creating music and reframing videos, to drawing and more. One creative process in particular, video production, can especially benefit from such tools, as it requires a series of decisions about what content is best suited to a target audience, how to position the available assets within the field of view, and what temporal arrangement will yield the most compelling narrative. But what if one could leverage existing assets, such as a website, to get a jump-start on video creation? Businesses commonly host websites that contain rich visual representations about their services or products, all of which could be repurposed for other multimedia formats, such as videos, potentially enabling those without extensive resources the ability to reach a broader audience.

In “Automatic Video Creation From a Web Page”, published at UIST 2020, we introduce URL2Video, a research prototype pipeline to automatically convert a web page into a short video, given temporal and visual constraints provided by the content owner. URL2Video extracts assets (text, images, or videos) and their design styles (including fonts, colors, graphical layouts, and hierarchy) from HTML sources and organizes the visual assets into a sequence of shots, while maintaining a look-and-feel similar to the source page. Given a user-specified aspect ratio and duration, it then renders the repurposed materials into a video that is ideal for product and service advertising.

URL2Video Overview
Assume a user provides an URL to a web page that illustrates their business. The URL2Video pipeline automatically selects key content from the page and decides the temporal and visual presentation of each asset, based on a set of heuristics derived from an interview study with designers who were familiar with web design and video ad creation. These designer-informed heuristics capture common video editing styles, including content hierarchy, constraining the amount of information in a shot and its time duration, providing consistent color and style for branding, and more. Using this information, the URL2Video pipeline parses a web page, analyzing the content and selecting visually salient text or images while preserving their design styles, which it organizes according to the video specifications provided by the user.

By extracting the structural content and design from the input web page, URL2Video makes automatic editing decisions to present key messages in a video. It considers the temporal (e.g., the duration in seconds) and spatial (e.g., the aspect ratio) constraints of the output video defined by users.

Webpage Analysis
Given a webpage URL, URL2Video extracts document object model (DOM) information and multimedia materials. For the purposes of our research prototype, we limited the domain to static web pages that contain salient assets and headings preserved in an HTML hierarchy that follows recent web design principles, which encourage the use of prominent elements, distinct sections, and an order of visual focus that guides readers in perceiving information. URL2Video identifies such visually-distinguishable elements as a candidate list of asset groups, each of which may contain a heading, a product image, detailed descriptions, and call-to-action buttons, and captures both the raw assets (text and multimedia files) and detailed design specifications (HTML tags, CSS styles, and rendered locations) for each element. It then ranks the asset groups by assigning each a priority score based on their visual appearance and annotations, including their HTML tags, rendered sizes, and ordering shown on the page. In this way, an asset group that occupies a larger area at the top of the page receives a higher score.

Constraints-Based Asset Selection
We consider two goals when composing a video: (1) each video shot should provide concise information, and (2) the visual design should be consistent with the source page. Based on these goals and the video constraints provided by the user, including the intended video duration (in seconds) and aspect ratio (commonly 16:9, 4:3, 1:1, etc.), URL2Video automatically selects and orders the asset groups to optimize the total priority score. To make the content concise, it presents only dominant elements from a page, such as a headline and a few multimedia assets. It constrains the duration of each visual element for viewers to perceive the content. In this way, a short video highlights the most salient information from the top of the page, and a longer video contains more campaigns or products.

Scene Composition & Video Rendering
Given an ordered list of assets based on the DOM hierarchy, URL2Video follows the design heuristics obtained from interview studies to make decisions about both the temporal and spatial arrangement to present the assets in individual shots. It transfers the graphical layout of elements into the video’s aspect ratio, and applies the style choices including fonts and colors. To make a video more dynamic and engaging, it adjusts the presentation timing of assets. Finally, it renders the content into a video in the MPEG-4 container format.

User Control
The interface to the research prototype allows the user to review the design attributes in each video shot extracted from the source page, reorder the materials, change the detailed design, such as colors and fonts, and adjust the constraints to generate a new video.

In URL2Video’s authoring interface (left), users specify the input URL to a source page, size of the target page view, and the output video parameters. URL2Video analyzes the web page and extracts major visual components. It composes a series of scenes and visualizes the key frames as a storyboard. These components are rendered into an output video that satisfies the input temporal and spatial constraints. Users can playback the video, examine the design attributes (bottom-right), and make adjustments to generate video variation, such as reordering the scenes (top-right).

URL2Video Use Cases
We demonstrate the performance of the end-to-end URL2Video pipeline on a variety of existing web pages. Below we highlight an example result where URL2Video converts a page that embeds multiple short video clips into a 12-second output video. Note how the pipeline makes automatic editing decisions on font and color choices, timing, and content ordering in a video captured from the source page.

URL2Video identifies key content from our Google Search introduction page (top), including headings and video assets. It converts them into a video by considering the presentation flow, the source design and the output constraints (a 12-second landscape video; bottom).

The video below provides further demonstration:

To evaluate the automatically-generated videos, we conducted a user study with designers at Google. Our results show that URL2Video effectively extracted design elements from a web page and supported designers by bootstrapping the video creation process.

Next steps
While this current research focuses on the visual presentation, we are developing new techniques that support the audio track and a voiceover in video editing. All in all, we envision a future where creators focus on making high-level decisions and an ML model interactively suggests detailed temporal and graphical edits for a final video creation on multiple platforms.

Acknowledgments
We greatly thank our paper co-authors, Zheng Sun (Research) and Katrina Panovich (YouTube). We would also like to thank our colleagues who contributed to URL2Video, (in alphabetical order of last name) Jordan Canedy, Brian Curless, Nathan Frey, Madison Le, Alireza Mahdian, Justin Parra, Emily Ryan, Mogan Shieh, Sandor Szego, and Weilong Yang. We are grateful to receive the support from our leadership, Tomas Izo, Rahul Sukthankar, and Jay Yagnik.

Read More

Estimating the Impact of Training Data with Reinforcement Learning

Estimating the Impact of Training Data with Reinforcement Learning

Posted by Jinsung Yoon and Sercan O. Arik, Research Scientists, Cloud AI Team, Google Research

Recent work suggests that not all data samples are equally useful for training, particularly for deep neural networks (DNNs). Indeed, if a dataset contains low-quality or incorrectly labeled data, one can often improve performance by removing a significant portion of training samples. Moreover, in cases where there is a mismatch between the train and test datasets (e.g., due to difference in train and test location or time), one can also achieve higher performance by carefully restricting samples in the training set to those most relevant for the test scenario. Because of the ubiquity of these scenarios, accurately quantifying the values of training samples has great potential for improving model performance on real-world datasets.

Top: Examples of low-quality samples (noisy/crowd-sourced); Bottom: Examples of a train and test mismatch.

In addition to improving model performance, assigning a quality value to individual data can also enable new use cases. It can be used to suggest better practices for data collection, e.g., what kinds of additional data would benefit the most, and can be used to construct large-scale training datasets more efficiently, e.g., by web searching using the labels as keywords and filtering out less valuable data.

In “Data Valuation Using Deep Reinforcement Learning”, accepted at ICML 2020, we address the challenge of quantifying the value of training data using a novel approach based on meta-learning. Our method integrates data valuation into the training procedure of a predictor model that learns to recognize samples that are more valuable for the given task, improving both predictor and data valuation performance. We have also launched four AI Hub Notebooks that exemplify the use cases of DVRL and are designed to be conveniently adapted to other tasks and datasets, such as domain adaptationcorrupted sample discovery and robust learningtransfer learning on image data and data valuation.

Quantifying the Value of Data
Not all data are equal for a given ML model — some have greater relevance for the task at hand or are more rich in informative content than others. So how does one evaluate the value of a single datum? At the granularity of a full dataset, it is straightforward; one can simply train a model on the entire dataset and use its performance on a test set as its value. However, estimating the value of a single datum is far more difficult, especially for complex models that rely on large-scale datasets, because it is computationally infeasible to re-train and re-evaluate a model on all possible subsets.

To tackle this, researchers have explored permutation-based methods (e.g., influence functions), and game theory-based methods (e.g., data Shapley). However, even the best current methods are far from being computationally feasible for large datasets and complex models, and their data valuation performance is limited. Concurrently, meta learning-based adaptive weight assignment approaches have been developed to estimate the weight values using a meta-objective. But rather than prioritizing learning from high value data samples, their data value mapping is typically based on gradient descent learning or other heuristic approaches that alter the conventional predictor model training dynamics, which can result in performance changes that are unrelated to the value of individual data points.

Data Valuation Using Reinforcement Learning (DVRL)
To infer the data values, we propose a data value estimator (DVE) that estimates data values and selects the most valuable samples to train the predictor model. This selection operation is fundamentally non-differentiable and thus conventional gradient descent-based methods cannot be used. Instead, we propose to use reinforcement learning (RL) such that the supervision of the DVE is based on a reward that quantifies the predictor performance on a small (but clean) validation set. The reward guides the optimization of the policy towards the action of optimal data valuation, given the state and input samples. Here, we treat the predictor model learning and evaluation framework as the environment, a novel application scenario of RL-assisted machine learning.

Training with Data Value Estimation using Reinforcement Learning (DVRL). When training the data value estimator with an accuracy reward, the most valuable samples (denoted with green dots) are used more and more, whereas the least valuable samples (red dots) are used less frequently.

Results
We evaluate the data value estimation quality of DVRL on multiple types of datasets and use cases.

<!–

    –>

    • Model performance after removing high/low value samples
      Removing low value samples from the training dataset can improve the predictor model performance, especially in the cases where the training dataset contains corrupted samples. On the other hand, removing high value samples, especially if the dataset is small, decreases the performance significantly. Overall, the performance after removing high/low value samples is a strong indicator for the quality of data valuation.
      Accuracy with the removal of most and least valuable samples, where 20% of the labels are noisy by design. By removing such noisy labels as the least valuable samples, a high-quality data valuation method achieves better accuracy. We demonstrate that DVRL outperforms other methods significantly from this perspective.

      DVRL shows the fastest performance degradation after removing the most important samples and the slowest performance degradation after removing the least important samples in most cases, underlining the superiority of DVRL in identifying noisy labels compared to competing methods (Leave-One-Out and Data Shapley).

    • Robust learning with noisy labels
      We consider how reliably DVRL can learn with noisy data in an end-to-end way, without removing the low-value samples. Ideally, noisy samples should get low data values as DVRL converges and a high performance model would be returned.
      Robust learning with noisy labels. Test accuracy for ResNet-32 and WideResNet-28-10 on CIFAR-10 and CIFAR-100 datasets with 40% of uniform random noise on labels. DVRL outperforms other popular methods that are based on meta-learning.

      We show state-of-the-art results with DVRL in minimizing the impact of noisy labels. These also demonstrate that DVRL can scale to complex models and large-scale datasets.

    • Domain adaptation
      We consider the scenario where the training dataset comes from a substantially different distribution from the validation and testing datasets. Data valuation is expected to be beneficial for this task by selecting the samples from the training dataset that best match the distribution of the validation dataset. We focus on the three cases: (1) a training set based on image search results (low-quality web-scraped) applied to the task of predicting skin lesion classification using HAM 10000 data (high-quality medical); (2) an MNIST training set for a digit recognition task on USPS data (different visual domain); (3) e-mail spam data to detect spam applied to an SMS dataset (different task). DVRL yields significant improvements for domain adaptation, by jointly optimizing the data valuator and corresponding predictor model.

    <!–

–>

Conclusions
We propose a novel meta learning framework for data valuation which determines how likely each training sample will be used in training of the predictor model. Unlike previous works, our method integrates data valuation into the training procedure of the predictor model, allowing the predictor and DVE to improve each other’s performance. We model this data value estimation task using a DNN trained through RL with a reward obtained from a small validation set that represents the target task performance. In a computationally-efficient way, DVRL can provide high quality ranking of training data that is useful for domain adaptation, corrupted sample discovery and robust learning. We show that DVRL significantly outperforms alternative methods on diverse types of tasks and datasets.

Acknowledgements
We gratefully acknowledge the contributions of Tomas Pfister.

Read More

Exploring AI for radiotherapy planning with Mayo Clinic

More than 18 million new cancer cases are diagnosed globally each year, and radiotherapy is one of the most common cancer treatments—used to treat over halfof cancers in the United States. But planning for a course of radiotherapy treatment is often a time-consuming and manual process for clinicians. The most labor-intensive step in planning is a technique called “contouring” which involves segmenting both the areas of cancer and nearby healthy tissues that are susceptible to radiation damage during treatment. Clinicians have to painstakingly draw lines around sensitive organs on scans—a time-intensive process that can take up to seven hours for a single patient.

Technology has the potential to augment the work of doctors and other care providers, like the specialists who plan radiotherapy treatment. We’re collaborating with Mayo Clinic on research to develop an AI system that can support physicians, help reduce treatment planning time and improve the efficiency of radiotherapy. In this research partnership, Mayo Clinic and Google Health will work to develop an algorithm to assist clinicians in contouring healthy tissue and organs from tumors, and conduct research to better understand how this technology could be deployed effectively in clinical practice. 

Mayo Clinic is an international center of excellence for cancer treatment with world-renowned radiation oncologists. Google researchers have studied how AI can potentially be used to augment other areas of healthcare—from mammographies to the early deployment of an AI system that detects diabetic retinopathy using eye scans. 

In a previous collaboration with University College London Hospitals, Google researchers demonstrated how an AI system could analyze and segment medical scans of patients with head and neck cancer— similar to how expert clinicians would. Our research with Mayo Clinic will also focus on head and neck cancers, which are particularly challenging areas to contour, given the many delicate structures that sit close together. 

In this first phase of research with Mayo Clinic, we hope to develop and validate a model as well as study how an AI system could be deployed in practice. The technology will not be used in a clinical setting and algorithms will be developed using only de-identified data. 

While cancer rates continue to rise, the shortage of radiotherapy experts continues to grow as well. Waiting for a radiotherapy treatment plan can be an agonizing experience for cancer patients, and we hope this research will eventually support a faster planning process and potentially help patients to access treatment sooner.

Read More

How Eurovision inspired a research intern's project

How Eurovision inspired a research intern’s project

Research happens at Google everyday, on many different embedded teams throughout the company. For example, Amit Moryossef developed a machine learning model for sign language detection while interning this year with our Language team in Zurich. Since our 2021 Research Internship applications opened this month, Amit chatted with us to discuss what his experience has been like.

How did you end up pursuing research around sign language processing?

After finishing college, I started a master’s degree in computer science at Bar-Ilan University. While I was there, I was introduced to deep learning, and to doing research. I worked on natural language processing, specifically looking at text generation and gender bias in machine translation. I planned for those years to be my final years in an academic setting, and then I’d go into the workforce.

Everything changed, surprisingly, after I watched the 2019 Eurovision Song Contest. They had sign-language interpretations of the songs. I realized how much of the world is not built to be accessible to the Deaf and hard of hearing communities, and this led to a bit of a shift in my plans.

Today I’m doing a PhD in computer science, working on sign language processing with the hope of making the world more accessible. This is also the topic of research I worked on at Google during my internship.

Why did you apply for an internship at Google? 

Google always seemed to me like a great place to work — a place that would have all of the resources I could ever need, both computationally and personally. I applied to Google with the honest belief that this is the best place for me to do research on what I am passionate about, and make that research available to everyone.

How did the ongoing pandemic affect your internship?

In March, I was still in denial that this would affect me, and I was hoping the internship would go as planned. In April, I received the message saying the internship would move to a virtual model which was initially disappointing on a personal level, but made sense as the world was going deeper into lockdown.

The remote nature of the internship introduced new challenges. Having a supportive manager and caring recruiter were some of the key factors for me in dealing with some of these challenges successfully—helping me get assistance with unfamiliar tools, fostering relationships with new colleagues and helping me to create and maintain a work-life balance.

What project was your internship focused on? 

My internship project was about sign language detection for video conferencing applications.  This task is simply defined as to detect when someone uses sign language on a video call, and set them as the current “speaker” of that call, just like a person using their voice would be. This work goes hand in hand with my PhD research—making the world more accessible to people who use sign language.

Maayan Gazuli, an Israeli Sign Language interpreter, sits in a chair and demonstrates the sign language detection system.

Maayan Gazuli, an Israeli Sign Language interpreter, demonstrates the sign language detection system.

What was the outcome of your internship? 

We designed the sign-language detection model and built an application that runs this on-device, and works with all video-conferencing applications. This means we empower signers to use whichever video conferencing applications they would like, and our system should work just as well.

We published and presented a long paper in the SLRTP workshop, as well as an academic demo and a Google AI blog post. You can try our experimental demo right now! By default, the demo acts as a sign language detector. The training code and models as well as the web demo source code is available on GitHub.

What impact has this internship experience had on your research?

I learned how to better communicate and work with folks who were previously unaware of my research and how to operate within a large organization (compared to academia).

My experience showed me the practical application of my research, and that it is possible to change the world for the better.

Read More

Partnerships for advanced weather and climate prediction

Partnerships for advanced weather and climate prediction

When I was a child, growing up on an almond farm in central California, the day always began by turning on the radio and listening to the forecasters talking about temperature, precipitation, and something called “evapotranspiration rate.” I didn’t know what all those terms meant at the time, but I could see how my father made decisions based on what he heard, like when to water, or when to harvest.

Now, when I chat with my colleagues around the world on video conference, they’re making daily decisions based on the weather around them, just like my father did on the farm. Some decisions are routine and others are dramatic, including decisions about what to wear for a walk outside, or how to prepare a family for extreme events like hurricanes and wildfires.

At Google, we’ve been using AI research to develop new methods for understanding and predicting the weather, including hyperlocal precipitation forecasting to support precise personal decision making, flood forecasting in India and Bangladesh, and computational methods that can help improve the accuracy of forecasting technology.

We’re also partnering with institutions that supply forecasts and technology. This month, we began working with the National Oceanic and Atmospheric Administration (NOAA) Satellite and Information Service (NESDIS) to explore the benefits of artificial intelligence (AI) and machine learning (ML) for enhancing NOAA’s use of satellite and environmental data.

Together, NESDIS and Google will use AI and ML to amplify NOAA’s environmental monitoring, weather forecasting and climate research using Google Cloud infrastructure. By working directly with NOAA’s forecast scientists, we’ll be able to utilize the vast amount of satellite and other environmental data that NOAA collects to enhance prediction for extreme weather events, such as hurricanes and tornadoes.

Related, in August, the U.S. National Science Foundation (NSF) announced the AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES) led by Amy McGovern at the University of Oklahoma, with Google as a founding member. This Institute includes seven academic institutions, four private-sector partners, as well as U.S. government and federally-funded labs. AI2ES assembles researchers from the atmospheric and ocean sciences and risk communication to develop trustworthy AI technology to address concerns in weather, climate, and coastal hazards prediction. The team will create educational pathways to develop a more diverse AI and environmental science workforce.

AI2ES logo, an illustration of people and various weather conditions

AI2ES logo 

Now, when I look at the fundamental scope and depth of these partnerships in the atmospheric sciences, I know my father would approve that the work is meaningful and relevant. And then he’d tell me to get back to work.

Read More

Rethinking Attention with Performers

Rethinking Attention with Performers

Posted by Krzysztof Choromanski and Lucy Colwell, Research Scientists, Google Research

Transformer models have achieved state-of-the-art results across a diverse range of domains, including natural language, conversation, images, and even music. The core block of every Transformer architecture is the attention module, which computes similarity scores for all pairs of positions in an input sequence. This however, scales poorly with the length of the input sequence, requiring quadratic computation time to produce all similarity scores, as well as quadratic memory size to construct a matrix to store these scores.

For applications where long-range attention is needed, several fast and more space-efficient proxies have been proposed such as memory caching techniques, but a far more common way is to rely on sparse attention. Sparse attention reduces computation time and the memory requirements of the attention mechanism by computing a limited selection of similarity scores from a sequence rather than all possible pairs, resulting in a sparse matrix rather than a full matrix. These sparse entries may be manually proposed, found via optimization methods, learned, or even randomized, as demonstrated by such methods as Sparse Transformers, Longformers, Routing Transformers, Reformers, and Big Bird. Since sparse matrices can also be represented by graphs and edges, sparsification methods are also motivated by the graph neural network literature, with specific relationships to attention outlined in Graph Attention Networks. Such sparsity-based architectures usually require additional layers to implicitly produce a full attention mechanism.

Standard sparsification techniques. Left: Example of a sparsity pattern, where tokens attend only to other nearby tokens. Right: In Graph Attention Networks, tokens attend only to their neighbors in the graph, which should have higher relevance than other nodes. See Efficient Transformers: A Survey for a comprehensive categorization of various methods.

Unfortunately, sparse attention methods can still suffer from a number of limitations. (1) They require efficient sparse-matrix multiplication operations, which are not available on all accelerators; (2) they usually do not provide rigorous theoretical guarantees for their representation power; (3) they are optimized primarily for Transformer models and generative pre-training; and (4) they usually stack more attention layers to compensate for sparse representations, making them difficult to use with other pre-trained models, thus requiring retraining and significant energy consumption. In addition to these shortcomings, sparse attention mechanisms are often still not sufficient to address the full range of problems to which regular attention methods are applied, such as Pointer Networks. There are also some operations that cannot be sparsified, such as the commonly used softmax operation, which normalizes similarity scores in the attention mechanism and is used heavily in industry-scale recommender systems.

To resolve these issues, we introduce the Performer, a Transformer architecture with attention mechanisms that scale linearly, thus enabling faster training while allowing the model to process longer lengths, as required for certain image datasets such as ImageNet64 and text datasets such as PG-19. The Performer uses an efficient (linear) generalized attention framework, which allows a broad class of attention mechanisms based on different similarity measures (kernels). The framework is implemented by our novel Fast Attention Via Positive Orthogonal Random Features (FAVOR+) algorithm, which provides scalable low-variance and unbiased estimation of attention mechanisms that can be expressed by random feature map decompositions (in particular, regular softmax-attention). We obtain strong accuracy guarantees for this method while preserving linear space and time complexity, which can also be applied to standalone softmax operations.

Generalized Attention
In the original attention mechanism, the query and key inputs, corresponding respectively to rows and columns of a matrix, are multiplied together and passed through a softmax operation to form an attention matrix, which stores the similarity scores. Note that in this method, one cannot decompose the query-key product back into its original query and key components after passing it into the nonlinear softmax operation. However, it is possible to decompose the attention matrix back to a product of random nonlinear functions of the original queries and keys, otherwise known as random features, which allows one to encode the similarity information in a more efficient manner.

LHS: The standard attention matrix, which contains all similarity scores for every pair of entries, formed by a softmax operation on the query and keys, denoted by q and k. RHS: The standard attention matrix can be approximated via lower-rank randomized matrices Q′ and K′ with rows encoding potentially randomized nonlinear functions of the original queries/keys. For the regular softmax-attention, the transformation is very compact and involves an exponential function as well as random Gaussian projections.

Regular softmax-attention can be seen as a special case with these nonlinear functions defined by exponential functions and Gaussian projections. Note that we can also reason inversely, by implementing more general nonlinear functions first, implicitly defining other types of similarity measures, or kernels, on the query-key product. We frame this as generalized attention, based on earlier work in kernel methods. Although for most kernels, closed-form formulae do not exist, our mechanism can still be applied since it does not rely on them.

To the best of our knowledge, we are the first to show that any attention matrix can be effectively approximated in downstream Transformer-applications using random features. The novel mechanism enabling this is the use of positive random features, i.e., positive-valued nonlinear functions of the original queries and keys, which prove to be crucial for avoiding instabilities during training and provide more accurate approximation of the regular softmax attention mechanism.

Towards FAVOR: Fast Attention via Matrix Associativity
The decomposition described above allows one to store the implicit attention matrix with linear, rather than quadratic, memory complexity. One can also obtain a linear time attention mechanism using this decomposition. While the original attention mechanism multiplies the stored attention matrix with the value input to obtain the final result, after decomposing the attention matrix, one can rearrange matrix multiplications to approximate the result of the regular attention mechanism, without explicitly constructing the quadratic-sized attention matrix. This ultimately leads to FAVOR+.

Left: Standard attention module computation, where the final desired result is computed by performing a matrix multiplication with the attention matrix A and value tensor V. Right: By decoupling matrices Q′ and K′ used in lower rank decomposition of A and conducting matrix multiplications in the order indicated by dashed-boxes, we obtain a linear attention mechanism, never explicitly constructing A or its approximation.

The above analysis is relevant for so-called bidirectional attention, i.e., non-causal attention where there is no notion of past and future. For unidirectional (causal) attention, where tokens do not attend to other tokens appearing later in the input sequence, we slightly modify the approach to use prefix-sum computations, which only store running totals of matrix computations rather than storing an explicit lower-triangular regular attention matrix.

Left: Standard unidirectional attention requires masking the attention matrix to obtain its lower-triangular part. Right: Unbiased approximation on the LHS can be obtained via a prefix-sum mechanism, where the prefix-sum of the outer-products of random feature maps for keys and value vectors is built on the fly and left-multiplied by query random feature vector to obtain the new row in the resulting matrix.

Properties
We first benchmark the space- and time-complexity of the Performer and show that the attention speedups and memory reductions are empirically nearly optimal, i.e., very close to simply not using an attention mechanism at all in the model.

Bidirectional timing for the regular Transformer model in log-log plot with time (T) and length (L). Lines end at the limit of GPU memory. The black line (X) denotes the maximum possible memory compression and speedups when using a “dummy” attention block, which essentially bypasses attention calculations and demonstrates the maximum possible efficiency of the model. The Performer model is nearly able to reach this optimal performance in the attention component.

We further show that the Performer, using our unbiased softmax approximation, is backwards compatible with pretrained Transformer models after a bit of fine-tuning, which could potentially lower energy costs by improving inference speed, without having to fully retrain pre-existing models.

Using the One Billion Word Benchmark (LM1B) dataset, we transferred the original pre-trained Transformer weights to the Performer model, which produces an initial non-zero 0.07 accuracy (dotted orange line). Once fine-tuned however, the Performer quickly recovers accuracy in a small fraction of the original number of gradient steps.

Example Application: Protein Modeling
Proteins are large molecules with complex 3D structures and specific functions that are essential to life. Like words, proteins are specified as linear sequences where each character is one of 20 amino acid building blocks. Applying Transformers to large unlabeled corpora of protein sequences (e.g. UniRef) yields models that can be used to make accurate predictions about the folded, functional macromolecule. Performer-ReLU (which uses ReLU-based attention, an instance of generalized attention that is different from softmax) performs strongly at modeling protein sequence data, while Performer-Softmax matches the performance of the Transformer, as predicted by our theoretical results.

Performance at modeling protein sequences. Train = Dashed, Validation = Solid, Unidirectional = (U), Bidirectional = (B). We use the 36-layer model parameters from ProGen (2019) for all runs, each using a 16×16 TPU-v2. Batch sizes were maximized for each run, given the corresponding compute constraints.

Below we visualize a protein Performer model, trained using the ReLU-based approximate attention mechanism. Using the Performer to estimate similarity between amino acids recovers similar structure to well-known substitution matrices obtained by analyzing evolutionary substitution patterns across carefully curated sequence alignments. More generally, we find local and global attention patterns consistent with Transformer models trained on protein data. The dense attention approximation of the Performer has the potential to capture global interactions across multiple protein sequences. As a proof of concept, we train models on long concatenated protein sequences, which overloads the memory of a regular Transformer model, but not the Performer due to its space efficiency.

Left: Amino acid similarity matrix estimated from attention weights. The model recognizes highly similar amino acid pairs such as (D,E) and (F,Y), despite only having access to protein sequences without prior information about biochemistry. Center: Attention matrices from 4 layers (rows) and 3 selected heads (columns) for the BPT1_BOVIN protein, showing local and global attention patterns.
Performance on sequences up to length 8192 obtained by concatenating individual protein sequences. To fit into TPU memory, the Transformer’s size (number of layers and embedding dimensions) was reduced.

Conclusion
Our work contributes to the recent efforts on non-sparsity based methods and kernel-based interpretations of Transformers. Our method is interoperable with other techniques like reversible layers and we have even integrated FAVOR with the Reformer’s code. We provide the links for the paper, Performer code, and the Protein Language Modeling code. We believe that our research opens up a brand new way of thinking about attention, Transformer architectures, and even kernel methods.

Acknowledgements
This work was performed by the core Performers designers Krzysztof Choromanski (Google Brain Team, Tech and Research Lead), Valerii Likhosherstov (University of Cambridge) and Xingyou Song (Google Brain Team), with contributions from David Dohan, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. We give special thanks to the Applied Science Team for jointly leading the research effort on applying efficient Transformer architectures to protein sequence data.

We additionally wish to thank Joshua Meier, John Platt, and Tom Weingarten for many fruitful discussions on biological data and useful comments on this draft, along with Yi Tay and Mostafa Dehghani for discussions on comparing baselines. We further thank Nikita Kitaev and Wojciech Gajewski for multiple discussions on the Reformer, and Aurko Roy and Ashish Vaswani for multiple discussions on the Routing Transformer.

Read More

Announcing the Recipients of the 2020 Award for Inclusion Research

Announcing the Recipients of the 2020 Award for Inclusion Research

Posted by Negar Saei, Program Manager, Google Research

At Google, it is our ongoing goal to support faculty who are conducting innovative research that will have positive societal impact. As part of that goal, earlier this year we launched the Award for Inclusion Research program, a global program that supports academic research in computing and technology addressing the needs of underrepresented populations. The Award for Inclusion Research program allows faculty and Google researchers an opportunity to partner on their research initiatives and build new and constructive long-term relationships.

We received 100+ applications from over 100 universities, globally, and today we are excited to announce the 16 proposals chosen for funding, focused on an array of topics around diversity and inclusion, algorithmic bias, education innovation, health tools, accessibility, gender bias, AI for social good, security, and social justice. The proposals include 25 principal investigators who focus on making the community stronger through their research efforts.

Congratulations to announce this year’s recipients:

Human Centred Technology Design for Social Justice in Africa
Anicia Peters (University of Namibia) and Shaimaa Lazem (City for Scientific Research and Technological Applications, Egypt)

Modern NLP for Regional and Dialectal Language Variants
Antonios Anastasopoulos (George Mason University)

Culturally Relevant Collaborative Health Tracking Tools for Motivating Heart-Healthy Behaviors Among African Americans”
Aqueasha Martin-Hammond (Indiana University – Purdue University Indianapolis) and Tanjala S. Purnell (Johns Hopkins University)

Characterizing Energy Equity in the United States
Destenie Nock and Constantine Samaras (Carnegie Mellon University)

Developing a Dialogue System for a Culturally-Responsive Social Programmable Robot
Erin Walker (University of Pittsburgh) and Leshell Hatley (Coppin State University)

Eliminating Gender Bias in NLP Beyond English
Hinrich Schuetze (LMU Munich)

The Ability-Based Design Mobile Toolkit: Enabling Accessible Mobile Interactions through Advanced Sensing and Modeling
Jacob O. Wobbrock (University of Washington)

Mutual aid and community engagement: Community-based mechanisms against algorithmic bias
Jasmine McNealy (University of Florida)

Empowering Syrian Girls through Culturally Sensitive Mobile Technology and Media Literacy
Karen Elizabeth Fisher (University of Washington) and Yacine Ghamri-Doudane (University of La Rochelle)

Broadening participation in data science through examining the health, social, and economic impacts of gentrification
Latifa Jackson (Howard University) and Hasan Jackson (Howard University)

Understanding How Peer and Near Peer Mentors co-Facilitating the Active Learning Process of Introductory Data Structures Within an Immersive Summer Experience Effected Rising Sophomore Computer Science Student Persistence and Preparedness for Careers in Silicon Valley
Legand Burge (Howard University) and Marlon Mejias (University of North Carolina at Charlotte)

Who is Most Likely to Advocate for this Case? A Machine Learning Approach
Maria De-Arteaga (University of Texas at Austin)

Contextual Rendering of Equations for Visually Impaired Persons
Meenakshi Balakrishnan (Indian Institute of Technology Delhi, India) and Volker Sorge (University of Birmingham)

Measuring the Cultural Competence of Computing Students and Faculty Nationwide to Improve Diversity, Equity, and Inclusion
Nicki Washington (Duke University)

Designing and Building Collaborative Tools for Mixed-Ability Programming Teams
Steve Oney (University of Michigan)

Iterative Design of a Black Studies Research Computing Initiative through `Flipped Research’
Timothy Sherwood and Sharon Tettegah (University of California, Santa Barbara)

Read More

Duplex is getting smarter and making life a little easier

Duplex is getting smarter and making life a little easier

In 2018, we introduced Duplex, our AI technology that uses natural conversation to get things done. Since then, we’ve been exploring how conversational technology can be both easy to interact with and  help people save more time. 

Today, during our Search On event, we shared an update on how Duplex and Google Assistant are helping people in their everyday lives. From providing more accurate business information in products like Google Maps, to booking appointments and reservations on your behalf, to waiting on hold for you, we’re continuing to bring Duplex to new places to make life a little easier.

Keeping local businesses information fresh 

This pandemic has shown us how critical up-to-date local information is, both for people trying to find services nearby and for businesses looking for ways to serve their customers. Whether you’re looking to grab dinner from your favorite restaurant or stop by your neighborhood florist, chances are you’ll check their hours of operation online first, and maybe find out if they offer things like dine-in or curbside pickup. 

To help people find accurate information about local businesses online, Duplex conversational technology is now calling businesses to automatically update business listings on Search and Maps with modified store hours and details like takeout or no-contact delivery. We began using Duplex to automatically update business information and add it to Search and Maps at scale in the U.S. last year. That means business owners don’t have to worry about manually updating these details, and potential customers get access to the most accurate information. When the pandemic started, we expanded business updates to eight countries, and have since made over 3 million updates to businesses like pharmacies, restaurants and grocery stores that have been seen over 20 billion times in Maps and Search. 

A personal assistant to save you time

From restaurant reservations to salon appointments, Duplex powers Google Assistant to help people save time, having completed more than a million bookings since its launch. So whenever you’re ready to dine out again, you can try asking your Assistant to book you a table at your favorite restaurant and let Duplex get it done. 

With Duplex on the web, Google Assistant can also complete tasks on the mobile web that would otherwise take up to 20 steps to complete, like booking a rental car or buying movie tickets. And we’re currently piloting the same experience with things like shopping and food ordering for a faster checkout.

Another way conversational AI helps people save time is with Call Screen, which lets Google Assistant answer unknown calls on Android phones to avoid spam calls. Every month, Call Screen helps save more than 2 million minutes on the phone. And now with Hold for Me, Duplex is powering Google Assistant to wait on hold for you and let you know when someone is on the line. 

More natural conversations

We still have a way to go towards having truly natural-feeling conversations with machines, so it’s exciting to see the great progress across the industry in neural speech recognition and synthesis, and in our own new language understanding models. For Duplex, these and many other advancements translate into significant improvements in quality. In fact, 99 percent of calls made by Duplex today are entirely automated. 

Our ability to interact with technology as naturally as we interact with each other remains a long-standing promise. As Duplex continues to take steps in this direction, we remain committed to developing our conversational technology in a responsible way, upholding the standards outlined in our AI principles and with transparency. For example, we always disclose that you’re speaking with an automated system when making a call. We’re excited by how far we’ve come, and more importantly, by how many people and businesses this technology can help in ways big and small.

Read More