Learning from deep learning: a case study of feature discovery and validation in pathology

Learning from deep learning: a case study of feature discovery and validation in pathology

When a patient is diagnosed with cancer, one of the most important steps is examination of the tumor under a microscope by pathologists to determine the cancer stage and to characterize the tumor. This information is central to understanding clinical prognosis (i.e., likely patient outcomes) and for determining the most appropriate treatment, such as undergoing surgery alone versus surgery plus chemotherapy. Developing machine learning (ML) tools in pathology to assist with the microscopic review represents a compelling research area with many potential applications.

Previous studies have shown that ML can accurately identify and classify tumors in pathology images and can even predict patient prognosis using known pathology features, such as the degree to which gland appearances deviate from normal. While these efforts focus on using ML to detect or quantify known features, alternative approaches offer the potential to identify novel features. The discovery of new features could in turn further improve cancer prognostication and treatment decisions for patients by extracting information that isn’t yet considered in current workflows.

Today, we’d like to share progress we’ve made over the past few years towards identifying novel features for colorectal cancer in collaboration with teams at the Medical University of Graz in Austria and the University of Milano-Bicocca (UNIMIB) in Italy. Below, we will cover several stages of the work: (1) training a model to predict prognosis from pathology images without specifying the features to use, so that it can learn what features are important; (2) probing that prognostic model using explainability techniques; and (3) identifying a novel feature and validating its association with patient prognosis. We describe this feature and evaluate its use by pathologists in our recently published paper, “Pathologist validation of a machine-learned feature for colon cancer risk stratification”. To our knowledge, this is the first demonstration that medical experts can learn new prognostic features from machine learning, a promising start for the future of this “learning from deep learning” paradigm.

Training a prognostic model to learn what features are important

One potential approach to identifying novel features is to train ML models to directly predict patient outcomes using only the images and the paired outcome data. This is in contrast to training models to predict “intermediate” human-annotated labels for known pathologic features and then using those features to predict outcomes.

Initial work by our team showed the feasibility of training models to directly predict prognosis for a variety of cancer types using the publicly available TCGA dataset. It was especially exciting to see that for some cancer types, the model’s predictions were prognostic after controlling for available pathologic and clinical features. Together with collaborators from the Medical University of Graz and the Biobank Graz, we subsequently extended this work using a large de-identified colorectal cancer cohort. Interpreting these model predictions became an intriguing next step, but common interpretability techniques were challenging to apply in this context and did not provide clear insights.

Interpreting the model-learned features

To probe the features used by the prognostic model, we used a second model (trained to identify image similarity) to cluster cropped patches of the large pathology images. We then used the prognostic model to compute the average ML-predicted risk score for each cluster.

One cluster stood out for its high average risk score (associated with poor prognosis) and its distinct visual appearance. Pathologists described the images as involving high grade tumor (i.e., least-resembling normal tissue) in close proximity to adipose (fat) tissue, leading us to dub this cluster the “tumor adipose feature” (TAF); see next figure for detailed examples of this feature. Further analysis showed that the relative quantity of TAF was itself highly and independently prognostic.

A prognostic ML model was developed to predict patient survival directly from unannotated giga-pixel pathology images. A second image similarity model was used to cluster cropped patches of pathology images. The prognostic model was used to compute the average model-predicted risk score for each cluster. One cluster, dubbed the “tumor adipose feature” (TAF) stood out in terms of its high average risk score (associated with poor survival) and distinct visual appearance. Pathologists learned to identify TAF and pathologist scoring for TAF was shown to be prognostic.
 
Left: H&E pathology slide with an overlaid heatmap indicating locations of the tumor adipose feature (TAF). Regions highlighted in red/orange are considered to be more likely TAF by the image similarity model, compared to regions highlighted in green/blue or regions not highlighted at all. Right: Representative collection of TAF patches across multiple cases.

Validating that the model-learned feature can be used by pathologists

These studies provided a compelling example of the potential for ML models to predict patient outcomes and a methodological approach for obtaining insights into model predictions. However, there remained the intriguing questions of whether pathologists could learn and score the feature identified by the model while maintaining demonstrable prognostic value.

In our most recent paper, we collaborated with pathologists from the UNIMIB to investigate these questions. Using example images of TAF from the previous publication to learn and understand this feature of interest, UNIMIB pathologists developed scoring guidelines for TAF. If TAF was not seen, the case was scored as “absent”, and if TAF was observed, then “unifocal”, “multifocal”, and “widespread” categories were used to indicate the relative quantity. Our study showed that pathologists could reproducibly identify the ML-derived TAF and that their scoring for TAF provided statistically significant prognostic value on an independent retrospective dataset. To our knowledge, this is the first demonstration of pathologists learning to identify and score a specific pathology feature originally identified by an ML-based approach.

Putting things in context: learning from deep learning as a paradigm

Our work is an example of people “learning from deep learning”. In traditional ML, models learn from hand-engineered features informed by existing domain knowledge. More recently, in the deep learning era, a combination of large-scale model architectures, compute, and datasets has enabled learning directly from raw data, but this is often at the expense of human interpretability. Our work couples the use of deep learning to predict patient outcomes with interpretability methods, to extract new knowledge that could be applied by pathologists. We see this process as a natural next step in the evolution of applying ML to problems in medicine and science, moving from the use of ML to distill existing human knowledge to people using ML as a tool for knowledge discovery.

Traditional ML focused on engineering features from raw data using existing human knowledge. Deep learning enables models to learn features directly from raw data at the expense of human interpretability. Coupling deep learning with interpretability methods provides an avenue for expanding the frontiers of scientific knowledge by learning from deep learning.

Acknowledgements

This work would not have been possible without the efforts of coauthors Vincenzo L’Imperio, Markus Plass, Heimo Muller, Nicolò’ Tamini, Luca Gianotti, Nicola Zucchini, Robert Reihs, Greg S. Corrado, Dale R. Webster, Lily H. Peng, Po-Hsuan Cameron Chen, Marialuisa Lavitrano, David F. Steiner, Kurt Zatloukal, Fabio Pagni. We also appreciate the support from Verily Life Sciences and the Google Health Pathology teams – in particular Timo Kohlberger, Yunnan Cai, Hongwu Wang, Kunal Nagpal, Craig Mermel, Trissia Brown, Isabelle Flament-Auvigne, and Angela Lin. We also appreciate manuscript feedback from Akinori Mitani, Rory Sayres, and Michael Howell, and illustration help from Abi Jones. This work would also not have been possible without the support of Christian Guelly, Andreas Holzinger, Robert Reihs, Farah Nader, the Biobank Graz, the efforts of the slide digitization team at the Medical University Graz, the participation of the pathologists who reviewed and annotated cases during model development, and the technicians of the UNIMIB team.

Read More

PaLM-E: An embodied multimodal language model

PaLM-E: An embodied multimodal language model

Recent years have seen tremendous advances across machine learning domains, from models that can explain jokes or answer visual questions in a variety of languages to those that can produce images based on text descriptions. Such innovations have been possible due to the increase in availability of large scale datasets along with novel advances that enable the training of models on these data. While scaling of robotics models has seen some success, it is outpaced by other domains due to a lack of datasets available on a scale comparable to large text corpora or image datasets.

Today we introduce PaLM-E, a new generalist robotics model that overcomes these issues by transferring knowledge from varied visual and language domains to a robotics system. We began with PaLM, a powerful large language model, and “embodied” it (the “E” in PaLM-E), by complementing it with sensor data from the robotic agent. This is the key difference from prior efforts to bring large language models to robotics — rather than relying on only textual input, with PaLM-E we train the language model to directly ingest raw streams of robot sensor data. The resulting model not only enables highly effective robot learning, but is also a state-of-the-art general-purpose visual-language model, while maintaining excellent language-only task capabilities.

<!–

PaLM-E is a generalist model competent with robotics, vision, and language tasks. It can control robots, answer visual questions, and write text – and quantitatively excels at all three relative to state-of-the-art models.

–>

An embodied  language model, and also a visual-language generalist

On the one hand, PaLM-E was primarily developed to be a model for robotics, and it solves a variety of tasks on multiple types of robots and for multiple modalities (images, robot states, and neural scene representations). At the same time, PaLM-E is a generally-capable vision-and-language model. It can perform visual tasks, such as describing images, detecting objects, or classifying scenes, and is also proficient at language tasks, like quoting poetry, solving math equations or generating code.

PaLM-E combines our most recent large language model, PaLM, together with one of our most advanced vision models, ViT-22B. The largest instantiation of this approach, built on PaLM-540B, is called PaLM-E-562B and sets a new state of the art on the visual-language OK-VQA benchmark, without task-specific fine-tuning, and while retaining essentially the same general language performance as PaLM-540B.

How does PaLM-E work?

Technically, PaLM-E works by injecting observations into a pre-trained language model. This is realized by transforming sensor data, e.g., images, into a representation through a procedure that is comparable to how words of natural language are processed by a language model.

Language models rely on a mechanism to represent text mathematically in a way that neural networks can process. This is achieved by first splitting the text into so-called tokens that encode (sub)words, each of which is associated with a high-dimensional vector of numbers, the token embedding. The language model is then able to apply mathematical operations (e.g., matrix multiplication) on the resulting sequence of vectors to predict the next, most likely word token. By feeding the newly predicted word back to the input, the language model can iteratively generate a longer and longer text.

The inputs to PaLM-E are text and other modalities — images, robot states, scene embeddings, etc. — in an arbitrary order, which we call “multimodal sentences”. For example, an input might look like, “What happened between <img_1> and <img_2>?”, where <img_1> and <img_2> are two images. The output is text generated auto-regressively by PaLM-E, which could be an answer to a question, or a sequence of decisions in text form.

PaLM-E model architecture, showing how PaLM-E ingests different modalities (states and/or images) and addresses tasks through multimodal language modeling.

The idea of PaLM-E is to train encoders that convert a variety of inputs into the same space as the natural word token embeddings. These continuous inputs are mapped into something that resembles “words” (although they do not necessarily form discrete sets). Since both the word and image embeddings now have the same dimensionality, they can be fed into the language model.

We initialize PaLM-E for training with pre-trained models for both the language (PaLM) and vision components (Vision Transformer, a.k.a. ViT). All parameters of the model can be updated during training.

Transferring knowledge from large-scale training to robots

PaLM-E offers a new paradigm for training a generalist model, which is achieved by framing robot tasks and vision-language tasks together through a common representation: taking images and text as input, and outputting text. A key result is that PaLM-E attains significant positive knowledge transfer from both the vision and language domains, improving the effectiveness of robot learning.

Positive transfer of knowledge from general vision-language tasks results in more effective robot learning, shown for three different robot embodiments and domains.

Results show that PaLM-E can address a large set of robotics, vision and language tasks simultaneously without performance degradation compared to training individual models on individual tasks. Further, the visual-language data actually significantly improves the performance of the robot tasks. This transfer enables PaLM-E to learn robotics tasks efficiently in terms of the number of examples it requires to solve a task.

Results

We evaluate PaLM-E on three robotic environments, two of which involve real robots, as well as general vision-language tasks such as visual question answering (VQA), image captioning, and general language tasks. When PaLM-E is tasked with making decisions on a robot, we pair it with a low-level language-to-action policy to translate text into low-level robot actions.

In the first example below, a person asks a mobile robot to bring a bag of chips to them. To successfully complete the task, PaLM-E produces a plan to find the drawer and open it and then responds to changes in the world by updating its plan as it executes the task. In the second example, the robot is asked to grab a green block. Even though the block has not been seen by that robot, PaLM-E still generates a step-by-step plan that generalizes beyond the training data of that robot.

  
PaLM-E controls a mobile robot operating in a kitchen environment. Left: The task is to get a chip bag. PaLM-E shows robustness against adversarial disturbances, such as putting the chip bag back into the drawer. Right: The final steps of executing a plan to retrieve a previously unseen block (green star). This capability is facilitated by transfer learning from the vision and language models.

In the second environment below, the same PaLM-E model solves very long-horizon, precise tasks, such as “sort the blocks by colors into corners,” on a different type of robot. It directly looks at the images and produces a sequence of shorter textually-represented actions — e.g., “Push the blue cube to the bottom right corner,” “Push the blue triangle there too.” — long-horizon tasks that were out of scope for autonomous completion, even in our own most recent models. We also demonstrate the ability to generalize to new tasks not seen during training time (zero-shot generalization), such as pushing red blocks to the coffee cup.

  
PaLM-E controlling a tabletop robot to successfully complete long-horizon tasks.

The third robot environment is inspired by the field of task and motion planning (TAMP), which studies combinatorially challenging planning tasks (rearranging objects) that confront the robot with a very high number of possible action sequences. We show that with a modest amount of training data from an expert TAMP planner, PaLM-E is not only able to also solve these tasks, but it also leverages visual and language knowledge transfer in order to more effectively do so.

  
PaLM-E produces plans for a task and motion planning environment.

As a visual-language generalist, PaLM-E is a competitive model, even compared with the best vision-language-only models, including Flamingo and PaLI. In particular, PaLM-E-562B achieves the highest number ever reported on the challenging OK-VQA dataset, which requires not only visual understanding but also external knowledge of the world. Further, this result is reached with a generalist model, without fine-tuning specifically on only that task.

PaLM-E exhibits capabilities like visual chain-of-thought reasoning in which the model breaks down its answering process in smaller steps, an ability that has so far only been demonstrated in the language-only domain. The model also demonstrates the ability to perform inference on multiple images although being trained on only single-image prompts. The image of the New York Knicks and Boston Celtics is under the terms CC-by-2.0 and was posted to Flickr by kowarski. The image of Kobe Bryant is in the Public Domain. The other images were taken by us.

Conclusion

PaLM-E pushes the boundaries of how generally-capable models can be trained to simultaneously address vision, language and robotics while also being capable of transferring knowledge from vision and language to the robotics domain. There are additional topics investigated in further detail in the paper, such as how to leverage neural scene representations with PaLM-E and also the extent to which PaLM-E, with greater model scale, experiences less catastrophic forgetting of its language capabilities.

PaLM-E not only provides a path towards building more capable robots that benefit from other data sources, but might also be a key enabler to other broader applications using multimodal learning, including the ability to unify tasks that have so far seemed separate.

Acknowledgements

This work was done in collaboration across several teams at Google, including the Robotics at Google team and the Brain team, and with TU Berlin. Co-authors: Igor Mordatch, Andy Zeng, Aakanksha Chowdhery, Klaus Greff, Mehdi S. M. Sajjadi, Daniel Duckworth, Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Fei Xia, Brian Ichter, Karol Hausman, Tianhe Yu, Quan Vuong, Yevgen Chebotar, Wenlong Huang, Pierre Sermanet, Sergey Levine, Vincent Vanhoucke, and Marc Toussiant. Danny is a PhD student advised by Marc Toussaint at TU Berlin. We also would like to thank several other colleagues for their advice and help, including Xi Chen, Etienne Pot, Sebastian Goodman, Maria Attarian, Ted Xiao, Keerthana Gopalakrishnan, Kehang Han, Henryk Michalewski, Neil Houlsby, Basil Mustafa, Justin Gilmer, Yonghui Wu, Erica Moreira, Victor Gomes, Tom Duerig, Mario Lucic, Henning Meyer, and Kendra Byrne.

Read More

The BirdCLEF 2023 Challenge: Pushing the frontiers of biodiversity monitoring

The BirdCLEF 2023 Challenge: Pushing the frontiers of biodiversity monitoring

Worldwide bird populations are declining at an alarming rate, with approximately 48% of existing bird species known or suspected to be experiencing population declines. For instance, the U.S. and Canada have reported 29% fewer birds since 1970.

Effective monitoring of bird populations is essential for the development of solutions that promote conservation. Monitoring allows researchers to better understand the severity of the problem for specific bird populations and evaluate whether existing interventions are working. To scale monitoring, bird researchers have started analyzing ecosystems remotely using bird sound recordings instead of physically in-person via passive acoustic monitoring. Researchers can gather thousands of hours of audio with remote recording devices, and then use machine learning (ML) techniques to process the data. While this is an exciting development, existing ML models struggle with tropical ecosystem audio data due to higher bird species diversity and overlapping bird sounds.

Annotated audio data is needed to understand model quality in the real world. However, creating high-quality annotated datasets — especially for areas with high biodiversity — can be expensive and tedious, often requiring tens of hours of expert analyst time to annotate a single hour of audio. Furthermore, existing annotated datasets are rare and cover only a small geographic region, such as Sapsucker Woods or the Peruvian rainforest. Thousands of unique ecosystems in the world still need to be analyzed.

In an effort to tackle this problem, over the past 3 years, we’ve hosted ML competitions on Kaggle in partnership with specialized organizations focused on high-impact ecologies. In each competition, participants are challenged with building ML models that can take sounds from an ecology-specific dataset and accurately identify bird species by sound. The best entries can train reliable classifiers with limited training data. Last year’s competition focused on Hawaiian bird species, which are some of the most endangered in the world.

The 2023 BirdCLEF ML competition

This year we partnered with The Cornell Lab of Ornithology’s K. Lisa Yang Center for Conservation Bioacoustics and NATURAL STATE to host the 2023 BirdCLEF ML competition focused on Kenyan birds. The total prize pool is $50,000, the entry deadline is May 17, 2023, and the final submission deadline is May 24, 2023. See the competition website for detailed information on the dataset to be used, timelines, and rules.

Kenya is home to over 1,000 species of birds, covering a wide range of ecosystems, from the savannahs of the Maasai Mara to the Kakamega rainforest, and even alpine regions on Kilimanjaro and Mount Kenya. Tracking this vast number of species with ML can be challenging, especially with minimal training data available for many species.

NATURAL STATE is working in pilot areas around Northern Mount Kenya to test the effect of various management regimes and states of degradation on bird biodiversity in rangeland systems. By using the ML algorithms developed within the scope of this competition, NATURAL STATE will be able to demonstrate the efficacy of this approach in measuring the success and cost-effectiveness of restoration projects. In addition, the ability to cost-effectively monitor the impact of restoration efforts on biodiversity will allow NATURAL STATE to test and build some of the first biodiversity-focused financial mechanisms to channel much-needed investment into the restoration and protection of this landscape upon which so many people depend. These tools are necessary to scale this cost-effectively beyond the project area and achieve their vision of restoring and protecting the planet at scale.

In previous competitions, we used metrics like the F1 score, which requires choosing specific detection thresholds for the models. This requires significant effort, and makes it difficult to assess the underlying model quality: A bad thresholding strategy on a good model may underperform. This year we are using a threshold-free model quality metric: class mean average precision. This metric treats each bird species output as a separate binary classifier to compute an average AUC score for each, and then averages these scores. Switching to an uncalibrated metric should increase the focus on core model quality by removing the need to choose a specific detection threshold.

How to get started

This will be the first Kaggle competition where participants can use the recently launched Kaggle Models platform that provides access to over 2,300 public, pre-trained models, including most of the TensorFlow Hub models. This new resource will have deep integrations with the rest of Kaggle, including Kaggle notebook, datasets, and competitions.

If you are interested in participating in this competition, a great place to get started quickly is to use our recently open-sourced Bird Vocalization Classifier model that is available on Kaggle Models. This global bird embedding and classification model provides output logits for more than 10k bird species and also creates embedding vectors that can be used for other tasks. Follow the steps shown in the figure below to use the Bird Vocalization Classifier model on Kaggle.

To try the model on Kaggle, navigate to the model here. 1) Click “New Notebook”; 2) click on the “Copy Code” button to copy the example lines of code needed to load the model; 3) click on the “Add Model” button to add this model as a data source to your notebook; and 4) paste the example code in the editor to load the model.

Alternatively, the competition starter notebook includes the model and extra code to more easily generate a competition submission.

We invite the research community to consider participating in the BirdCLEF competition. As a result of this effort, we hope that it will be easier for researchers and conservation practitioners to survey bird population trends and build effective conservation strategies.

Acknowledgements

Compiling these extensive datasets was a major undertaking, and we are very thankful to the many domain experts who helped to collect and manually annotate the data for this competition. Specifically, we would like to thank (institutions and individual contributors in alphabetic order): Julie Cattiau and Tom Denton on the Brain team, Maximilian Eibl and Stefan Kahl at Chemnitz University of Technology, Stefan Kahl and Holger Klinck from the K. Lisa Yang Center for Conservation Bioacoustics at the Cornell Lab of Ornithology, Alexis Joly and Henning Müller at LifeCLEF, Jonathan Baillie from NATURAL STATE, Hendrik Reers, Alain Jacot and Francis Cherutich from OekoFor GbR, and Willem-Pier Vellinga from xeno-canto. We would also like to thank Ian Davies from the Cornell Lab of Ornithology for allowing us to use the hero image in this post.

Read More

Announcing the ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

Announcing the ICDAR 2023 Competition on Hierarchical Text Detection and Recognition

The last few decades have witnessed the rapid development of Optical Character Recognition (OCR) technology, which has evolved from an academic benchmark task used in early breakthroughs of deep learning research to tangible products available in consumer devices and to third party developers for daily use. These OCR products digitize and democratize the valuable information that is stored in paper or image-based sources (e.g., books, magazines, newspapers, forms, street signs, restaurant menus) so that they can be indexed, searched, translated, and further processed by state-of-the-art natural language processing techniques.

Research in scene text detection and recognition (or scene text spotting) has been the major driver of this rapid development through adapting OCR to natural images that have more complex backgrounds than document images. These research efforts, however, focus on the detection and recognition of each individual word in images, without understanding how these words compose sentences and articles.

Layout analysis is another relevant line of research that takes a document image and extracts its structure, i.e., title, paragraphs, headings, figures, tables and captions. These layout analysis efforts are parallel to OCR and have been largely developed as independent techniques that are typically evaluated only on document images. As such, the synergy between OCR and layout analysis remains largely under-explored. We believe that OCR and layout analysis are mutually complementary tasks that enable machine learning to interpret text in images and, when combined, could improve the accuracy and efficiency of both tasks.

With this in mind, we announce the Competition on Hierarchical Text Detection and Recognition (the HierText Challenge), hosted as part of the 17th annual International Conference on Document Analysis and Recognition (ICDAR 2023). The competition is hosted on the Robust Reading Competition website, and represents the first major effort to unify OCR and layout analysis. In this competition, we invite researchers from around the world to build systems that can produce hierarchical annotations of text in images using words clustered into lines and paragraphs. We hope this competition will have a significant and long-term impact on image-based text understanding with the goal to consolidate the research efforts across OCR and layout analysis, and create new signals for downstream information processing tasks.

The concept of hierarchical text representation.

Constructing a hierarchical text dataset

In this competition, we use the HierText dataset that we published at CVPR 2022 with our paper “Towards End-to-End Unified Scene Text Detection and Layout Analysis”. It’s the first real-image dataset that provides hierarchical annotations of text, containing word, line, and paragraph level annotations. Here, “words” are defined as sequences of textual characters not interrupted by spaces. “Lines” are then interpreted as “space“-separated clusters of “words” that are logically connected in one direction, and aligned in spatial proximity. Finally, “paragraphs” are composed of “lines” that share the same semantic topic and are geometrically coherent.

To build this dataset, we first annotated images from the Open Images dataset using the Google Cloud Platform (GCP) Text Detection API. We filtered through these annotated images, keeping only images rich in text content and layout structure. Then, we worked with our third-party partners to manually correct all transcriptions and to label words, lines and paragraph composition. As a result, we obtained 11,639 transcribed images, split into three subsets: (1) a train set with 8,281 images, (2) a validation set with 1,724 images, and (3) a test set with 1,634 images. As detailed in the paper, we also checked the overlap between our dataset, TextOCR, and Intel OCR (both of which also extracted annotated images from Open Images), making sure that the test images in the HierText dataset were not also included in the TextOCR or Intel OCR training and validation splits and vice versa. Below, we visualize examples using the HierText dataset and demonstrate the concept of hierarchical text by shading each text entity with different colors. We can see that HierText has a diversity of image domain, text layout, and high text density.

Samples from the HierText dataset. Left: Illustration of each word entity. Middle: Illustration of line clustering. Right: Illustration paragraph clustering.

Dataset with highest density of text

In addition to the novel hierarchical representation, HierText represents a new domain of text images. We note that HierText is currently the most dense publicly available OCR dataset. Below we summarize the characteristics of HierText in comparison with other OCR datasets. HierText identifies 103.8 words per image on average, which is more than 3x the density of TextOCR and 25x more dense than ICDAR-2015. This high density poses unique challenges for detection and recognition, and as a consequence HierText is used as one of the primary datasets for OCR research at Google.

Dataset       Training split       Validation split       Testing split       Words per image      
ICDAR-2015       1,000       0       500       4.4      
TextOCR       21,778       3,124       3,232       32.1      
Intel OCR       19,1059       16,731       0       10.0      
HierText       8,281       1,724       1,634       103.8

Comparing several OCR datasets to the HierText dataset.

Spatial distribution

We also find that text in the HierText dataset has a much more even spatial distribution than other OCR datasets, including TextOCR, Intel OCR, IC19 MLT, COCO-Text and IC19 LSVT. These previous datasets tend to have well-composed images, where text is placed in the middle of the images, and are thus easier to identify. On the contrary, text entities in HierText are broadly distributed across the images. It’s proof that our images are from more diverse domains. This characteristic makes HierText uniquely challenging among public OCR datasets.

Spatial distribution of text instances in different datasets.

The HierText challenge

The HierText Challenge represents a novel task and with unique challenges for OCR models. We invite researchers to participate in this challenge and join us in ICDAR 2023 this year in San Jose, CA. We hope this competition will spark research community interest in OCR models with rich information representations that are useful for novel down-stream tasks.

Acknowledgements

The core contributors to this project are Shangbang Long, Siyang Qin, Dmitry Panteleev, Alessandro Bissacco, Yasuhisa Fujii and Michalis Raptis. Ashok Popat and Jake Walker provided valuable advice. We also thank Dimosthenis Karatzas and Sergi Robles from Autonomous University of Barcelona for helping us set up the competition website.

Read More

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Universal Speech Model (USM): State-of-the-art speech AI for 100+ languages

Last November, we announced the 1,000 Languages Initiative, an ambitious commitment to build a machine learning (ML) model that would support the world’s one thousand most-spoken languages, bringing greater inclusion to billions of people around the globe. However, some of these languages are spoken by fewer than twenty million people, so a core challenge is how to support languages for which there are relatively few speakers or limited available data.

Today, we are excited to share more about the Universal Speech Model (USM), a critical first step towards supporting 1,000 languages. USM is a family of state-of-the-art speech models with 2B parameters trained on 12 million hours of speech and 28 billion sentences of text, spanning 300+ languages. USM, which is for use in YouTube (e.g., for closed captions), can perform automatic speech recognition (ASR) not only on widely-spoken languages like English and Mandarin, but also on under-resourced languages like Amharic, Cebuano, Assamese, and Azerbaijani to name a few. In “Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages”, we demonstrate that utilizing a large unlabeled multilingual dataset to pre-train the encoder of the model and fine-tuning on a smaller set of labeled data enables us to recognize under-represented languages. Moreover, our model training process is effective at adapting to new languages and data.

A sample of the languages that USM supports.

Challenges in current ASR

To accomplish this ambitious goal, we need to address two significant challenges in ASR.

First, there is a lack of scalability with conventional supervised learning approaches. A fundamental challenge of scaling speech technologies to many languages is obtaining enough data to train high-quality models. With conventional approaches, audio data needs to be either manually labeled, which is time-consuming and costly, or collected from sources with pre-existing transcriptions, which are harder to find for languages that lack wide representation. In contrast, self-supervised learning can leverage audio-only data, which is available in much larger quantities across languages. This makes self-supervision a better approach to accomplish our goal of scaling across hundreds of languages.

Another challenge is that models must improve in a computationally efficient manner while we expand the language coverage and quality. This requires the learning algorithm to be flexible, efficient, and generalizable. More specifically, such an algorithm should be able to use large amounts of data from a variety of sources, enable model updates without requiring complete retraining, and generalize to new languages and use cases.

Our approach: Self-supervised learning with fine-tuning

USM uses the standard encoder-decoder architecture, where the decoder can be CTC, RNN-T, or LAS. For the encoder, USM uses the Conformer, or convolution-augmented transformer. The key component of the Conformer is the Conformer block, which consists of attention, feed-forward, and convolutional modules. It takes as input the log-mel spectrogram of the speech signal and performs a convolutional sub-sampling, after which a series of Conformer blocks and a projection layer are applied to obtain the final embeddings.

Our training pipeline starts with the first step of self-supervised learning on speech audio covering hundreds of languages. In the second optional step, the model’s quality and language coverage can be improved through an additional pre-training step with text data. The decision to incorporate the second step depends on whether text data is available. USM performs best with this second optional step. The last step of the training pipeline is to fine-tune on downstream tasks (e.g., ASR or automatic speech translation) with a small amount of supervised data.

For the first step, we use BEST-RQ, which has already demonstrated state-of-the-art results on multilingual tasks and has proven to be efficient when using very large amounts of unsupervised audio data.

In the second (optional) step, we used multi-objective supervised pre-training to incorporate knowledge from additional text data. The model introduces an additional encoder module to take text as input and additional layers to combine the output of the speech encoder and the text encoder, and trains the model jointly on unlabeled speech, labeled speech, and text data.

In the last stage, USM is fine-tuned on the downstream tasks. The overall training pipeline is illustrated below. With the knowledge acquired during pre-training, USM models achieve good quality with only a small amount of supervised data from the downstream tasks.

USM’s overall training pipeline.

Key results

Performance across multiple languages on YouTube Captions

Our encoder incorporates 300+ languages through pre-training. We demonstrate the effectiveness of the pre-trained encoder through fine-tuning on YouTube Caption’s multilingual speech data. The supervised YouTube data includes 73 languages and has on average less than three thousand hours of data per language. Despite limited supervised data, the model achieves less than 30% word error rate (WER; lower is better) on average across the 73 languages, a milestone we have never achieved before. For en-US, USM has a 6% relative lower WER compared to the current internal state-of-the-art model. Lastly, we compare with the recently released large model, Whisper (large-v2), which was trained with more than 400k hours of labeled data. For the comparison, we only use the 18 languages that Whisper can successfully decode with lower than 40% WER. Our model has, on average, a 32.7% relative lower WER compared to Whisper for these 18 languages.

USM supports all 73 languages in the YouTube Captions’ Test Set and outperforms Whisper on the languages it can support with lower than 40% WER. Lower WER is better.

Generalization to downstream ASR tasks

On publicly available datasets, our model shows lower WER compared to Whisper on CORAAL (African American Vernacular English), SpeechStew (en-US), and FLEURS (102 languages). Our model achieves lower WER with and without training on in-domain data. The comparison on FLEURS reports the subset of languages (62) that overlaps with the languages supported by the Whisper model. For FLEURS, USM without in-domain data has a 65.8% relative lower WER compared to Whisper and has a 67.8% relative lower WER with in-domain data.

Comparison of USM (with or without in-domain data) and Whisper results on ASR benchmarks. Lower WER is better.

Performance on automatic speech translation (AST)

For speech translation, we fine-tune USM on the CoVoST dataset. Our model, which includes text via the second stage of our pipeline, achieves state-of-the-art quality with limited supervised data. To assess the breadth of the model’s performance, we segment the languages from the CoVoST dataset into high, medium, and low based on resource availability and calculate the BLEU score (higher is better) for each segment. As shown below, USM outperforms Whisper for all segments.

CoVoST BLEU score. Higher BLEU is better.

Toward 1,000 languages

The development of USM is a critical effort towards realizing Google’s mission to organize the world’s information and make it universally accessible. We believe USM’s base model architecture and training pipeline comprise a foundation on which we can build to expand speech modeling to the next 1,000 languages.

Learn More

Check out our paper here. Researchers can request access to the USM API here.

Acknowledgements

We thank all the co-authors for contributing to the project and paper, including Andrew Rosenberg, Ankur Bapna, Bhuvana Ramabhadran, Bo Li, Chung-Cheng Chiu, Daniel Park, Françoise Beaufays, Hagen Soltau, Gary Wang, Ginger Perng, James Qin, Jason Riesa, Johan Schalkwyk, Ke Hu, Nanxin Chen, Parisa Haghani, Pedro Moreno Mengibar, Rohit Prabhavalkar, Tara Sainath, Trevor Strohman, Vera Axelrod, Wei Han, Yonghui Wu, Yongqiang Wang, Yu Zhang, Zhehuai Chen, and Zhong Meng.

We also thank Alexis Conneau, Min Ma, Shikhar Bharadwaj, Sid Dalmia, Jiahui Yu, Jian Cheng, Paul Rubenstein, Ye Jia, Justin Snyder, Vincent Tsang, Yuanzhong Xu, Tao Wang for useful discussions.

We appreciate valuable feedback and support from Eli Collins, Jeff Dean, Sissie Hsiao, Zoubin Ghahramani. Special thanks to Austin Tarango, Lara Tumeh, Amna Latif, and Jason Porta for their guidance around Responsible AI practices. We thank Elizabeth Adkison, James Cokerille for help with naming the model, Tom Small for the animated graphic, Abhishek Bapna for editorial support, and Erica Moreira for resource management . We thank Anusha Ramesh for feedback, guidance, and assistance with the publication strategy, and Calum Barnes and Salem Haykal for their valuable partnership.

Read More

Performer-MPC: Navigation via real-time, on-robot transformers

Performer-MPC: Navigation via real-time, on-robot transformers

Despite decades of research, we don’t see many mobile robots roaming our homes, offices, and streets. Real-world robot navigation in human-centric environments remains an unsolved problem. These challenging situations require safe and efficient navigation through tight spaces, such as squeezing between coffee tables and couches, maneuvering in tight corners, doorways, untidy rooms, and more. An equally critical requirement is to navigate in a manner that complies with unwritten social norms around people, for example, yielding at blind corners or staying at a comfortable distance. Google Research is committed to examining how advances in ML may enable us to overcome these obstacles.

In particular, Transformers models have achieved stunning advances across various data modalities in real-world machine learning (ML) problems. For example, multimodal architectures have enabled robots to leverage Transformer-based language models for high-level planning. Recent work that uses Transformers to encode robotic policies opens an exciting opportunity to use those architectures for real-world navigation. However, the on-robot deployment of massive Transformer-based controllers can be challenging due to the strict latency constraints for safety-critical mobile robots. The quadratic space and time complexity of the attention mechanism with respect to the input length is often prohibitively expensive, forcing researchers to trim Transformer-stacks at the cost of expressiveness.

As part of our ongoing exploration of ML advances for robotic products we partnered across Robotics at Google and Everyday Robots to present “Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation” at the Conference on Robot Learning (CoRL 2022). Here, we introduce Performer-MPC, an end-to-end learnable robotic system that combines (1) a JAX-based differentiable model predictive controller (MPC) that back-propagates gradients to its cost function parameters, (2) Transformer-based encodings of the context (e.g., occupancy grids for navigation tasks) that represent the MPC cost function and adapt the MPC to complex social scenarios without hand-coded rules, and (3) Performer architectures: scalable low-rank implicit-attention Transformers with linear space and time complexity attention modules for efficient on-robot deployment (providing 8ms on-robot latency). We demonstrate that Performer-MPC can generalize across different environments to help robots navigate tight spaces while demonstrating socially acceptable behaviors.

Performer-MPC

Performer-MPC aims to blend classic MPCs with ML via their learnable cost functions. Thus Performer-MPCs can be thought of as an instantiation of the inverse reinforcement learning algorithms, where the cost function is inferred by learning from expert demonstrations. Critically, the learnable component of the cost function is parameterized by latent embeddings produced by the Performer-Transformer. The linear inference provided by Performers is a gateway to on-robot deployment in real time.

In practice, the occupancy grid provided by fusing the robot’s sensors serves as an input to the Vision Performer model. This model never explicitly materializes the attention matrix, but rather leverages its low-rank decomposition for efficient linear computation of the attention module, resulting in scalable attention. Then, the embedding of the particular fixed input-patch token from the last layer of the model parameterizes the quadratic, learnable part of the MPC model’s cost function. That part is added to the regular hand-engineered cost (distance from the obstacles, penalty-terms for sudden velocity changes, etc.). The system is trained end-to-end via imitation learning to mimic expert demonstrations.

Performer-MPC overview. The final latent embedding of the patch highlighted in red is used to construct context dependent learnable cost. The backpropagation (red arrows) is through the parameters of the Transformer. Performer provides scalable attention module computation via low-rank approximate decomposition of the regular attention matrix (matrices Query’ and Key’) and by changing the order of matrix multiplications (indicated by the black brackets).

Real-world robot navigation

Although, in principle, Performer-MPC can be applied in various robotic settings, we evaluate its performance on navigation in confined spaces with the potential presence of people. We deployed Performer-MPC on a differential wheeled robot that has a 3D LiDAR camera in the front and depth sensors mounted on its head. Our robot-deployable 8ms-latency Performer-MPC has 8.3M Performer parameters. The actual time of a single Performer run is about 1ms and we use the fastest Performer-ReLU variant.

We compare Performer-MPC with two baselines, a regular MPC policy (RMPC) without the learned cost components, and an Explicit Policy (EP) that predicts a reference and goal state using the same Performer architecture, but without being coupled to the MPC structure. We evaluate Performer-MPC in a simulation and in three real world scenarios. For each scenario, the learned policies (EP and Performer-MPC) are trained with scenario-specific demonstrations.

Experiment Scenarios: (a) Learning to avoid local minima during doorway traversal, (b) maneuvering through highly constrained spaces, (c) enabling socially compliant behaviors for blind corner, and (d) pedestrian obstruction interactions.

Our policies are trained through behavior cloning with a few hours of human-controlled robot navigation data in the real world. For more data collection details, see the paper. We visualize the planning results of Performer-MPC (green) and RMPC (red) along with expert demonstrations (gray) in the top half and the train and test curves in the bottom half of the following two figures. To measure the distance between the robot trajectory and the expert trajectory, we use Hausdorff distance.

Top: Visualization of test examples in the doorway traversal (left) and highly constrained obstacle course (right). Performer-MPC trajectories aiming at the goal are always closer to the expert demonstrations compared to the RMPC trajectories. Bottom: Train and test curves, where the vertical axis represents Hausdorff distance and horizontal axis represents training steps.
Top: Visualization of test examples in the blind corner (left) and pedestrian obstruction (right) scenarios. Performer-MPC trajectories aiming at the goal are always closer to the expert demonstrations compared to the RMPC trajectories. Bottom: Train and test curves, where the vertical axis represents Hausdorff distance and horizontal axis represents training steps.

Learning to avoid local minima

We evaluate Performer-MPC in a simulated doorway traversal scenario in which 100 start and goal pairs are randomly sampled from opposing sides of the wall. A planner, guided by a greedy cost function, often leads the robot to a local minimum (i.e., getting stuck at the closest point to the goal on the other side of the wall). Performer-MPC learns a cost function that steers the robot to pass the doorway, even if it must veer away from the goal and travel further. Performer-MPC shows a success rate of 86% compared to RMPC’s 24%.

Comparison of the Performer-MPC with Regular MPC on the doorway passing task.

Learning highly constrained maneuvers

Next, we test Performer-MPC in a challenging real-world scenario, where the robot must perform sharp, near-collision maneuvers in a cluttered home or office setting. A global planner provides coarse way points (a skeleton navigation path) that the robot follows. Each policy is run ten times and we report a success rate (SR) and an average completion percentage (CP) with variance (VAR) of navigating the obstacle course, where the robot is able to traverse without failure (collisions or getting stuck). Performer-MPC outperforms both RMPC and EP in SR and CP.

An obstacle course with policy trajectories and failure locations (indicated by crosses) for RMPC, EP, and Performer-MPC.
An Everyday Robots helper robot maneuvering through highly constrained spaces using Regular MPC, Explicit Policy, and Performer-MPC.

Learning to navigate in spaces with people

Going beyond static obstacles, we apply Performer-MPC to social robot navigation, where robots must navigate in a socially-acceptable manner for which cost functions are difficult to design. We consider two scenarios: (1) blind corners, where robots should avoid the inner side of a hallway corner in case a person suddenly appears, and (2) pedestrian obstruction, where a person unexpectedly impedes the robot’s prescribed path.

Performer-MPC deployed on an Everyday Robots helper robot. Left: Regular MPC efficiently cuts blind corners, forcing the person to move back. Right: Performer-MPC avoids cutting blind corners, enabling safe and socially acceptable navigation around people.
Comparison with an Everyday Robots helper robot using Regular MPC, Explicit Policy, and Performer-MPC in unseen blind corners.
Comparison with an Everyday Robots helper robot using Regular MPC, Explicit Policy, and Performer-MPC in unseen pedestrian obstruction scenarios.

Conclusion

We introduce Performer-MPC, an end-to-end learnable robotic system that combines several mechanisms to enable real-world, robust, and adaptive robot navigation with real-time, on-robot transformers. This work shows that scalable Transformer-architectures play a critical role in designing expressive attention-based robotic controllers. We demonstrate that real-time millisecond-latency inference is feasible for policies leveraging Transformers with a few million parameters. Furthermore, we show that such policies enable robots to learn efficient and socially acceptable behaviors that can generalize well. We believe this opens an exciting new chapter on applying Transformers to real-world robotics and look forward to continuing our research with Everyday Robots helper robots.

Acknowledgements

Special thanks to Xuesu Xiao for co-leading this effort at Everyday Robots as a Visiting Researcher. This research was done by Xuesu Xiao, Tingnan Zhang, Krzysztof Choromanski, Edward Lee, Anthony Francis, Jake Varley, Stephen Tu, Sumeet Singh, Peng Xu, Fei Xia, Sven Mikael Persson, Dmitry Kalashnikov, Leila Takayama, Roy Frostig, Jie Tan, Carolina Parada and Vikas Sindhwani. Special thanks to Vincent Vanhoucke for his feedback on the manuscript.

Read More

Distributed differential privacy for federated learning

Distributed differential privacy for federated learning

Federated learning is a distributed way of training machine learning (ML) models where data is locally processed and only focused model updates and metrics that are intended for immediate aggregation are shared with a server that orchestrates training. This allows the training of models on locally available signals without exposing raw data to servers, increasing user privacy. In 2021, we announced that we are using federated learning to train Smart Text Selection models, an Android feature that helps users select and copy text easily by predicting what text they want to select and then automatically expanding the selection for them.

Since that launch, we have worked to improve the privacy guarantees of this technology by carefully combining secure aggregation (SecAgg) and a distributed version of differential privacy. In this post, we describe how we built and deployed the first federated learning system that provides formal privacy guarantees to all user data before it becomes visible to an honest-but-curious server, meaning a server that follows the protocol but could try to gain insights about users from data it receives. The Smart Text Selection models trained with this system have reduced memorization by more than two-fold, as measured by standard empirical testing methods.

Scaling secure aggregation

Data minimization is an important privacy principle behind federated learning. It refers to focused data collection, early aggregation, and minimal data retention required during training. While every device participating in a federated learning round computes a model update, the orchestrating server is only interested in their average. Therefore, in a world that optimizes for data minimization, the server would learn nothing about individual updates and only receive an aggregate model update. This is precisely what the SecAgg protocol achieves, under rigorous cryptographic guarantees.

Important to this work, two recent advancements have improved the efficiency and scalability of SecAgg at Google:

  • An improved cryptographic protocol: Until recently, a significant bottleneck in SecAgg was client computation, as the work required on each device scaled linearly with the total number of clients (N) participating in the round. In the new protocol, client computation now scales logarithmically in N. This, along with similar gains in server costs, results in a protocol able to handle larger rounds. Having more users participate in each round improves privacy, both empirically and formally.
  • Optimized client orchestration: SecAgg is an interactive protocol, where participating devices progress together. An important feature of the protocol is that it is robust to some devices dropping out. If a client does not send a response in a predefined time window, then the protocol can continue without that client’s contribution. We have deployed statistical methods to effectively auto-tune such a time window in an adaptive way, resulting in improved protocol throughput.

The above improvements made it easier and faster to train Smart Text Selection with stronger data minimization guarantees.

Aggregating everything via secure aggregation

A typical federated training system not only involves aggregating model updates but also metrics that describe the performance of the local training. These are important for understanding model behavior and debugging potential training issues. In federated training for Smart Text Selection, all model updates and metrics are aggregated via SecAgg. This behavior is statically asserted using TensorFlow Federated, and locally enforced in Android’s Private Compute Core secure environment. As a result, this enhances privacy even more for users training Smart Text Selection, because unaggregated model updates and metrics are not visible to any part of the server infrastructure.

Differential privacy

SecAgg helps minimize data exposure, but it does not necessarily produce aggregates that guarantee against revealing anything unique to an individual. This is where differential privacy (DP) comes in. DP is a mathematical framework that sets a limit on an individual’s influence on the outcome of a computation, such as the parameters of a ML model. This is accomplished by bounding the contribution of any individual user and adding noise during the training process to produce a probability distribution over output models. DP comes with a parameter (ε) that quantifies how much the distribution could change when adding or removing the training examples of any individual user (the smaller the better).

Recently, we announced a new method of federated training that enforces formal and meaningfully strong DP guarantees in a centralized manner, where a trusted server controls the training process. This protects against external attackers who may attempt to analyze the model. However, this approach still relies on trust in the central server. To provide even greater privacy protections, we have created a system that uses distributed differential privacy (DDP) to enforce DP in a distributed manner, integrated within the SecAgg protocol.

Distributed differential privacy

DDP is a technology that offers DP guarantees with respect to an honest-but-curious server coordinating training. It works by having each participating device clip and noise its update locally, and then aggregating these noisy clipped updates through the new SecAgg protocol described above. As a result, the server only sees the noisy sum of the clipped updates.

However, the combination of local noise addition and use of SecAgg presents significant challenges in practice:

  • An improved discretization method: One challenge is properly representing model parameters as integers in SecAgg’s finite group with integer modular arithmetic, which can inflate the norm of the discretized model and require more noise for the same privacy level. For example, randomized rounding to the nearest integers could inflate the user’s contribution by a factor equal to the number of model parameters. We addressed this by scaling the model parameters, applying a random rotation, and rounding to nearest integers. We also developed an approach for auto-tuning the discretization scale during training. This led to an even more efficient and accurate integration between DP and SecAgg.
  • Optimized discrete noise addition: Another challenge is devising a scheme for choosing an arbitrary number of bits per model parameter without sacrificing end-to-end privacy guarantees, which depend on how the model updates are clipped and noised. To address this, we added integer noise in the discretized domain and analyzed the DP properties of sums of integer noise vectors using the distributed discrete Gaussian and distributed Skellam mechanisms.
An overview of federated learning with distributed differential privacy.

We tested our DDP solution on a variety of benchmark datasets and in production and validated that we can match the accuracy to central DP with a SecAgg finite group of size 12 bits per model parameter. This meant that we were able to achieve added privacy advantages while also reducing memory and communication bandwidth. To demonstrate this, we applied this technology to train and launch Smart Text Selection models. This was done with an appropriate amount of noise chosen to maintain model quality. All Smart Text Selection models trained with federated learning now come with DDP guarantees that apply to both the model updates and metrics seen by the server during training. We have also open sourced the implementation in TensorFlow Federated.

Empirical privacy testing

While DDP adds formal privacy guarantees to Smart Text Selection, those formal guarantees are relatively weak (a finite but large ε, in the hundreds). However, any finite ε is an improvement over a model with no formal privacy guarantee for several reasons: 1) A finite ε moves the model into a regime where further privacy improvements can be quantified; and 2) even large ε’s can indicate a substantial decrease in the ability to reconstruct training data from the trained model. To get a more concrete understanding of the empirical privacy advantages, we performed thorough analyses by applying the Secret Sharer framework to Smart Text Selection models. Secret Sharer is a model auditing technique that can be used to measure the degree to which models unintentionally memorize their training data.

To perform Secret Sharer analyses for Smart Text Selection, we set up control experiments which collect gradients using SecAgg. The treatment experiments use distributed differential privacy aggregators with different amounts of noise.

We found that even low amounts of noise reduce memorization meaningfully, more than doubling the Secret Sharer rank metric for relevant canaries compared to the baseline. This means that even though the DP ε is large, we empirically verified that these amounts of noise already help reduce memorization for this model. However, to further improve on this and to get stronger formal guarantees, we aim to use even larger noise multipliers in the future.

Next steps

We developed and deployed the first federated learning and distributed differential privacy system that comes with formal DP guarantees with respect to an honest-but-curious server. While offering substantial additional protections, a fully malicious server might still be able to get around the DDP guarantees either by manipulating the public key exchange of SecAgg or by injecting a sufficient number of “fake” malicious clients that don’t add the prescribed noise into the aggregation pool. We are excited to address these challenges by continuing to strengthen the DP guarantee and its scope.

Acknowledgements

The authors would like to thank Adria Gascon for significant impact on the blog post itself, as well as the people who helped develop these ideas and bring them to practice: Ken Liu, Jakub Konečný, Brendan McMahan, Naman Agarwal, Thomas Steinke, Christopher Choquette, Adria Gascon, James Bell, Zheng Xu, Asela Gunawardana, Kallista Bonawitz, Mariana Raykova, Stanislav Chiknavaryan, Tancrède Lepoint, Shanshan Wu, Yu Xiao, Zachary Charles, Chunxiang Zheng, Daniel Ramage, Galen Andrew, Hugo Song, Chang Li, Sofia Neata, Ananda Theertha Suresh, Timon Van Overveldt, Zachary Garrett, Wennan Zhu, and Lukas Zilka. We’d also like to thank Tom Small for creating the animated figure.

Read More

Teaching old labels new tricks in heterogeneous graphs

Teaching old labels new tricks in heterogeneous graphs

Industrial applications of machine learning are commonly composed of various items that have differing data modalities or feature distributions. Heterogeneous graphs (HGs) offer a unified view of these multimodal data systems by defining multiple types of nodes (for each data type) and edges (for the relation between data items). For instance, e-commerce networks might have [user, product, review] nodes or video platforms might have [channel, user, video, comment] nodes. Heterogeneous graph neural networks (HGNNs) learn node embeddings summarizing each node’s relationships into a vector. However, in real world HGs, there is often a label imbalance issue between different node types. This means that label-scarce node types cannot exploit HGNNs, which hampers the broader applicability of HGNNs.

In “Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks”, presented at NeurIPS 2022, we propose a model called a Knowledge Transfer Network (KTN), which transfers knowledge from label-abundant node types to zero-labeled node types using the rich relational information given in a HG. We describe how we pre-train a HGNN model without the need for fine-tuning. KTNs outperform state-of-the-art transfer learning baselines by up to 140% on zero-shot learning tasks, and can be used to improve many existing HGNN models on these tasks by 24% (or more).

KTNs transform labels from one type of information (squares) through a graph to another type (stars).

What is a heterogeneous graph?

A HG is composed of multiple node and edge types. The figure below shows an e-commerce network presented as a HG. In e-commerce, “users” purchase “products” and write “reviews”. A HG presents this ecosystem using three node types [user, product, review] and three edge types [user-buy-product, user-write-review, review-on-product]. Individual products, users, and reviews are then presented as nodes and their relationships as edges in the HG with the corresponding node and edge types.

E-commerce heterogeneous graph.

In addition to all connectivity information, HGs are commonly given with input node attributes that summarize each node’s information. Input node attributes could have different modalities across different node types. For instance, images of products could be given as input node attributes for the product nodes, while text can be given as input attributes to review nodes. Node labels (e.g., the category of each product or the category that most interests each user) are what we want to predict on each node.

HGNNs and label scarcity issues

HGNNs compute node embeddings that summarize each node’s local structures (including the node and its neighbor’s information). These node embeddings are utilized by a classifier to predict each node’s label. To train a HGNN model and a classifier to predict labels for a specific node type, we require a good amount of labels for the type.

A common issue in industrial applications of deep learning is label scarcity, and with their diverse node types, HGNNs are even more likely to face this challenge. For instance, publicly available content node types (e.g., product nodes) are abundantly labeled, whereas labels for user or account nodes may not be available due to privacy restrictions. This means that in most standard training settings, HGNN models can only learn to make good inferences for a few label-abundant node types and can usually not make any inferences for any remaining node types (given the absence of any labels for them).

Transfer learning on heterogeneous graphs

Zero-shot transfer learning is a technique used to improve the performance of a model on a target domain with no labels by using the knowledge learned by the model from another related source domain with adequately labeled data. To apply transfer learning to solve this label scarcity issue for certain node types in HGs, the target domain would be the zero-labeled node types. Then what would be the source domain? Previous work commonly sets the source domain as the same type of nodes located in a different HG, assuming those nodes are abundantly labeled. This graph-to-graph transfer learning approach pre-trains a HGNN model on the external HG and then runs the model on the original (label-scarce) HG.

However, these approaches are not applicable in many real-world scenarios for three reasons. First, any external HG that could be used in a graph-to-graph transfer learning setting would almost surely be proprietary, thus, likely unavailable. Second, even if practitioners could obtain access to an external HG, it is unlikely the distribution of that source HG would match their target HG well enough to apply transfer learning. Finally, node types suffering from label scarcity are likely to suffer the same issue on other HGs (e.g., privacy issues on user nodes).

Our approach: Transfer learning between node types within a heterogeneous graph

Here, we shed light on a more practical source domain, other node types with abundant labels located on the same HG. Instead of using extra HGs, we transfer knowledge within a single HG (assumed to be fully owned by the practitioners) across different types of nodes. More specifically, we pre-train a HGNN model and a classifier on a label-abundant (source) node type, then reuse the models on the zero-labeled (target) node types located in the same HG without additional fine-tuning. The one requirement is that the source and target node types share the same label set (e.g., in the e-commerce HG, product nodes have a label set describing product categories, and user nodes share the same label set describing their favorite shopping categories).

Why is it challenging?

Unfortunately, we cannot directly reuse the pre-trained HGNN and classifier on the target node type. One crucial characteristic of HGNN architectures is that they are composed of modules specialized to each node type to fully learn the multiplicity of HGs. HGNNs use distinct sets of modules to compute embeddings for each node type. In the figure below, blue- and red-colored modules are used to compute node embeddings for the source and target node types, respectively.

HGNNs are composed of modules specialized to each node type and use distinct sets of modules to compute embeddings of different node types. More details can be found in the paper.

While pre-training HGNNs on the source node type, source-specific modules in the HGNNs are well trained, however target-specific modules are under-trained as they have only a small amount of gradients flowing into them. This is shown below, where we see that the L2 norm of gradients for target node types (i.e., Mtt) are much lower than for source types (i.e., Mss). In this case a HGNN model outputs poor node embeddings for the target node type, which results in poor task performance.

In HGNNs, target type-specific modules receive zero or only a small amount of gradients during pre-training on the source node type, leading to poor performance on the target node type.

KTN: Trainable cross-type transfer learning for HGNNs

Our work focuses on transforming the (poor) target node embeddings computed by a pre-trained HGNN model to follow the distribution of the source node embeddings. Then the classifier, pre-trained on the source node type, can be reused for the target node type. How can we map the target node embeddings to the source domain? To answer this question, we investigate how HGNNs compute node embeddings to learn the relationship between source and target distributions.

HGNNs aggregate connected node embeddings to augment a target node’s embeddings in each layer. In other words, the node embeddings for both source and target node types are updated using the same input — the previous layer’s node embeddings of any connected node types. This means that they can be represented by each other. We prove this relationship theoretically and find there is a mapping matrix (defined by HGNN parameters) from the target domain to the source domain (more details in Theorem 1 in the paper). Based on this theorem, we introduce an auxiliary neural network, which we refer to as a Knowledge Transfer Network (KTN), that receives the target node embeddings and then transforms them by multiplying them with a (trainable) mapping matrix. We then define a regularizer that is minimized along with the performance loss in the pre-training phase to train the KTN. At test time, we map the target embeddings computed from the pre-trained HGNN to the source domain using the trained KTN for classification.

In HGNNs, the final node embeddings of both source and target types are computed from different mathematical functions (f(): source, g(): target) which use the same input — the previous layer’s node embeddings.

Experimental results

To examine the effectiveness of KTNs, we ran 18 different zero-shot transfer learning tasks on two public heterogeneous graphs, Open Academic Graph and Pubmed. We compare KTN with eight state-of-the-art transfer learning methods (DAN, JAN, DANN, CDAN, CDAN-E, WDGRL, LP, EP). Shown below, KTN consistently outperforms all baselines on all tasks, beating transfer learning baselines by up to 140% (as measured by Normalized Discounted Cumulative Gain, a ranking metric).

Zero-shot transfer learning on Open Academic Graph (OAG-CS) and Pubmed datasets. The colors represent different categories of transfer learning baselines against which the results are compared. Yellow: Use statistical properties (e.g., mean, variance) of distributions. Green: Use adversarial models to transfer knowledge. Orange: Transfer knowledge directly via graph structure using label propagation.

Most importantly, KTN can be applied to almost all HGNN models that have node and edge type-specific parameters and improve their zero-shot performance on target domains. As shown below, KTN improves accuracy on zero-labeled node types across six different HGNN models(R-GCN, HAN, HGT, MAGNN, MPNN, H-MPNN) by up to 190%.

KTN can be applied to six different HGNN models and improve their zero-shot performance on target domains.

Takeaways

Various ecosystems in industry can be presented as heterogeneous graphs. HGNNs summarize heterogeneous graph information into effective representations. However, label scarcity issues on certain types of nodes prevent the wider application of HGNNs. In this post, we introduced KTN, the first cross-type transfer learning method designed for HGNNs. With KTN, we can fully exploit the richness of heterogeneous graphs via HGNNs regardless of label scarcity. See the paper for more details.

Acknowledgements

This paper is joint work with our co-authors John Palowitch (Google Research), Dustin Zelle (Google Research), Ziniu Hu (Intern, Google Research), and Russ Salakhutdinov (CMU). We thank Tom Small for creating the animated figure in this blog post.

Read More