Using AI to help find answers to common skin conditions

Artificial intelligence (AI) has the potential to help clinicians care for patients and treat disease — from improving the screening process for breast cancer to helping detect tuberculosis more efficiently. When we combine these advances in AI with other technologies, like smartphone cameras, we can unlock new ways for people to stay better informed about their health, too.  

Today at  I/O, we shared a preview of an AI-powered dermatology assist tool that helps you understand what’s going on with issues related to your body’s largest organ: your skin, hair and nails. Using many of the same techniques that detect diabetic eye disease or lung cancer in CT scans, this tool gets you closer to identifying dermatologic issues — like a rash on your arm that’s bugging you — using your phone’s camera. 

How our AI-powered dermatology tool works 

Each year we see almost ten billion Google Searches related to skin, nail and hair issues. Two billion people worldwide suffer from dermatologic issues, but there’s a global shortage of specialists. While many people’s first step involves going to a Google Search bar, it can be difficult to describe what you’re seeing on your skin through words alone.

Our AI-powered dermatology assist tool is a web-based application that we hope to launch as a pilot later this year, to make it easier to figure out what might be going on with your skin. Once you launch the tool, simply use your phone’s camera to take three images of the skin, hair or nail concern from different angles. You’ll then be asked questions about your skin type, how long you’ve had the issue and other symptoms that help the tool narrow down the possibilities. The AI model analyzes this information and draws from its knowledge of 288 conditions to give you a list of possible matching conditions that you can then research further.

For each matching condition, the tool will show dermatologist-reviewed information and answers to commonly asked questions, along with similar matching images from the web. The tool is not intended to provide a diagnosis nor be a substitute for medical advice as many conditions require clinician review, in-person examination, or additional testing like a biopsy. Rather we hope it gives you access to authoritative information so you can make a more informed decision about your next step.

Image of a phone showing you each step of using the AI-powered dermatology assist tool.

Based on the photos and information you provide, our AI-powered dermatology assist tool will offer suggested conditions. This product has been CE marked as a Class I medical device in the EU. It is not available in the United States.

Developing an AI model that assesses issues for all skin types 

Our tool is the culmination of over three years of machine learning research and product development. To date, we’ve published several peer-reviewed papers that validate our AI model and more are in the works. 

Our landmark study, featured in Nature Medicine, debuted our deep learning approach to assessing skin diseases and showed that our AI system can achieve accuracy that is on par with U.S. board-certified dermatologists. Our most recent paper in JAMA Network Open demonstrated how non-specialist doctors can use AI-based tools to improve their ability to interpret skin conditions

To make sure we’re building for everyone, our model accounts for factors like age, sex, race and skin types — from pale skin that does not tan to brown skin that rarely burns. We developed and fine-tuned our model with de-identified data encompassing around 65,000 images and case data of diagnosed skin conditions, millions of curated skin concern images and thousands of examples of healthy skin — all across different demographics. 

Recently, the AI model that powers our tool successfully passed clinical validation, and the tool has been CE marked as a Class I medical device in the EU.¹ In the coming months, we plan to build on this work so more people can use this tool to answer questions about common skin issues. If you’re interested in this tool, sign up here to be notified (subject to availability in your region).

¹This tool has not been evaluated by the U.S. FDA for safety or efficacy. It is not available in the United States.

Read More

Tackling tuberculosis screening with AI

Today we’re sharing new AI research that aims to improve screening for one of the top causes of death worldwide: tuberculosis (TB). TB infects 10 million people per year and disproportionately affects people in low-to-middle-income countries. Diagnosing TB early is difficult because its symptoms can mimic those of common respiratory diseases.

Cost-effective screening, specifically chest X-rays, has been identified as one way to improve the screening process. However, experts aren’t always available to interpret results. That’s why the World Health Organization (WHO) recently recommended the use of computer-aided detection (CAD) for screening and triaging.

To help catch the disease early and work toward eventually eradicating it, Google researchers developed an AI-based tool that builds on our existing work in medical imaging to identify potential TB patients for follow-up testing. 

A deep learning system to detect active pulmonary tuberculosis  

In a new study released this week, we found that the right deep learning system can be used to accurately identify patients who are likely to have active TB based on their chest X-ray. By using this screening tool as a preliminary step before ordering a more expensive diagnostic test, our study showed that effective AI-powered screening could save up to 80% of the cost per positive TB case detected. 

Our AI-based tool was able to accurately detect active pulmonary TB cases with false-negative and false-positive detection rates that were similar to 14 radiologists. This accuracy was maintained even when examining patients who were HIV-positive, a population that is at higher risk of developing TB and is challenging to screen because their chest X-rays may differ from typical TB cases.

To make sure the model worked for patients from a wide range of races and ethnicities, we used de-identified data from nine countries to train the model and tested it on cases from five countries. These findings build on our previousresearch that showed AI can detect common issues like collapsed lungs, nodules or fractures in chest X-rays. 

Applying these findings in the real world

The AI system produces a number between 0 and 1 that indicates the risk of TB. For the system to be useful in a real-world setting, there needs to be agreement about what risk level indicates that patients should be recommended for additional testing. Calibrating this threshold can be time-consuming and expensive because administrators can only come to this number after running the system on hundreds of patients, testing these patients, and analyzing the results. 

Based on the performance of our model, our research suggests that any clinic could start from this default threshold and be confident that the model will perform similarly to radiologists, making it easier to deploy this technology. From there, clinics can adjust the threshold based on local needs and resources. For example, regions with fewer resources may use a higher cut-off point to reduce the number of follow-up tests needed. 

The path to eradicating tuberculosis

The WHO’s “The End TB Strategy” lays out the global efforts that are underway to dramatically reduce the incidence of tuberculosis in the coming decade. Because TB can remain pervasive in communities, even if a relatively low number of people have it at a given time, more and earlier screenings are critical to reducing its prevalence. 

We’ll keep contributing to these efforts — especially when it comes to research and development. Later this year, we plan to expand this work through two separate research studies with our partners, Apollo Hospitals in India and the Centre for Infectious Disease Research in Zambia (CIDRZ). 

Read More

A smoother ride and a more detailed Map thanks to AI

AI is a critical part of what makes Google Maps so helpful. With it, we’re able to map roads over 10 times faster than we could five years ago, and we can bring maps filled with useful information to virtually every corner of the world. Today, we’re giving you a behind-the-scenes look at how AI makes two of the features we announced at I/O possible.

Teaching Maps to identify and forecast when people are hitting the brakes

Let’s start with our routing update that helps you avoid situations that cause you to slam on the brakes, such as confusing lane changes or freeway exits. We use AI and navigation information to identify hard-braking events — moments that cause drivers to decelerate sharply and are known indicators of car crash likelihood — and then suggest alternate routes when available. We believe these updates have the potential to eliminate over 100 million hard-braking events in routes driven with Google Maps each year. But how exactly do we find when and where these moments are likely to occur?

That’s where AI comes in. To do this, we train our machine learning models on two sets of data. The first set of information comes from phones using Google Maps. Mobile phone sensors can determine deceleration along a route, but this data is highly prone to false alarms because your phone can move independently of your car. This is what makes it hard for our systems to decipher you tossing your phone into the cupholder or accidentally dropping it on the floor from an actual hard-braking moment. To combat this, we also use information from routes driven with Google Maps when it’s projected on a car’s display, like Android Auto. This represents a relatively small subset of data, but it’s highly accurate because Maps is now tethered to a stable spot — your car display. Training our models on both sets of data makes it possible to spot actual deceleration moments from fake ones, making detection across all trips more accurate. 

Understanding spots along a route that are likely to cause hard-braking is just one part of the equation. We’re also working to identify other contextual factors that lead to hard-braking events, like construction or visibility conditions. For example, if there’s a sudden increase in hard-braking events along a route during a certain time of day when people are likely to be driving toward the glare of the sun, our system could detect those events and offer alternate routes. These details inform future routing so we can suggest safer, smoother routes.

Using AI to go beyond driving

When you’re walking or biking or taking public transit, AI is also there helping you move along safely and easily. Last August we launched detailed street maps which show accurate road widths, along with details about where the sidewalks, crosswalks and pedestrian islands are in an area so people can better understand its layout and how to navigate it. Today, we announced that detailed street maps will expand to 50 more cities by the end of 2021. While this sounds straightforward, a lot is going on under the hood — especially with AI — to make this possible! 

A GIF that shows a before and after comparison of detailed streets maps built from satellite imagery

A before and after comparison of detailed streets maps built from satellite imagery

Imagine that you’re taking a stroll down a typical San Francisco street. As you approach the intersection, you’ll notice that the crosswalk uses a “zebra” pattern — vertical stripes that show you where to walk. But if you were in another city, say London, then parallel dotted lines would define the crosswalks. To account for these differences and accurately display them on the map, our systems need to know what crosswalks look like — not just in one city but across the entire world. It gets even trickier since urban design can change at the country, state, and even city level.

  • A street-level picture of crosswalks in San Francisco

    Crosswalks in San Francisco

  • A street-level image of crosswalks in London

    Crosswalks in London

  • A street-level image of crosswalks in Tokyo

    Crosswalks in Tokyo

  • A street-level image of crosswalks in Madrid

    Crosswalks in Madrid

  • A street-level image of crosswalks in Zurich

    Crosswalks in Zurich

To expand globally and account for local differences, we needed to completely revamp our mapmaking process. Traditionally, we’ve approached mapmaking like baking a cake — one layer at a time. We trained machine learning models to identify and classify features one by one across our index of millions of Street View, satellite and aerial images — starting first with roads, then addresses, buildings and so on. 

But detailed street maps require significantly more granularity and precision than a normal map. To map these dense urban features correctly, we’ve updated our models to identify all objects in a scene at once. This requires a ton of AI smarts. The model has to understand not only what the objects are, but the relationships between them — like where exactly a street ends and a sidewalk begins. With these new full-scene models, we’re able to detect and classify broad sets of features at a time without sacrificing accuracy, allowing us to map a single city faster than ever before. 

An image of Google Maps’ single-feature AI models

Single-feature AI model that classifies buildings.

An image of Google Maps’ full-scene AI models

Full-scene AI models that capture multiple categories of objects at once.

Once we have a model trained on a particular city, we can then expand it to other cities with similar urban designs. For example, the sidewalks, curbs, and traffic lights look similar in Atlanta and Ho Chi Minh City — despite being over 9,000 miles away. And the same model works in Madrid as it does in Dallas, something that may be hard to believe at first glance. With our new advanced machine learning techniques combined with our collection of high-definition imagery, we’re on track to bring a level of detail to the map at scale like never before.

AI will continue to play an important role as we build the most helpful map for people around the globe. For more behind-the-scenes looks at the technology that powers Google Maps, check out the rest of our Maps 101 blog series.

More from this Series

Maps 101

Google Maps helps you navigate, explore, and get things done every single day. In this series, we’ll take a look under the hood at how Google Maps uses technology to build helpful products—from using flocks of sheep and laser beams to gather high-definition imagery to predicting traffic jams that haven’t even happened yet.

View more from Maps 101

Read More

Unveiling our new Quantum AI campus

Within the decade, Google aims to build a useful, error-corrected quantum computer. This will accelerate solutions for some of the world’s most pressing problems, like sustainable energy and reduced emissions to feed the world’s growing population, and unlocking new scientific discoveries, like more helpful AI.

To begin our journey, today we’re unveiling our new Quantum AI campus in Santa Barbara, California. This campus includes our first quantum data center, our quantum hardware research laboratories, and our own quantum processor chip fabrication facilities. Here, our team is working to build an error-corrected quantum computer for the world.

This is a drone's perspective of entering the new Quantum AI campus

Our new Quantum AI campus in Santa Barbara, CA will include our first quantum data center, new research laboratories, and quantum processor fabrication facilities.

Google began using machine learning 20 years ago (for spell checking in Search), and led the deep learning revolution 10 years ago (advancing neural nets, the leading approach to modern AI). These advances in AI and other technologies have enabled many of the incredible applications we’re seeing today. As we look 10 years into the future, many of the greatest global challenges, from climate change to handling the next pandemic, demand a new kind of computing.

To build better batteries (to lighten the load on the power grid), or to create fertilizer to feed the world without creating 2% of global carbon emissions (as nitrogen fixation does today), or to create more targeted medicines (to stop the next pandemic before it starts), we need to understand and design molecules better. That means simulating nature accurately. But you can’t simulate molecules very well using classical computers. As you get to even modestly sized molecules, you quickly run out of computing resources. Nature is quantum mechanical: The bonds and interactions among atoms behave probabilistically, with richer dynamics that exhaust the simple classical computing logic.

A look inside of our cryostats

The inside of our cryostats, like the ones found in the Quantum AI campus, are some of the coldest places in the universe, reaching temperatures around 10 milliKelvin

This is where quantum computers come in. Quantum computers use quantum bits, or “qubits,” which can be entangled in a complex superposition of states, naturally mirroring the complexity of molecules in the real world. With an error-corrected quantum computer, we’ll be able to simulate how molecules behave and interact, so we can test and invent new chemical processes and new materials before investing in costly real-life prototypes. These new computing capabilities will help to accelerate the discovery of better batteries, energy-efficient fertilizers, and targeted medicines, as well as improved optimization, new AI architectures, and more.

Depicting the journey to building an error-corrected quantum computer

Our journey to build an error-corrected quantum computer within the decade includes several scientific milestones, including building an error-corrected logical qubit.

To reach this goal, we’re on a journey to build 1,000,000 physical qubits that work in concert inside a room-sized error-corrected quantum computer. That’s a big leap from today’s modestly-sized systems of fewer than 100 qubits.

To get there, we must build the world’s first “quantum transistor” — two error-corrected “logical qubits” performing quantum operations together — and then figure out how to tile hundreds to thousands of them to form the error-corrected quantum computer. That will take years.

To get there, we need to show we can encode one logical qubit — with 1,000 physical qubits. Using quantum error-correction, these physical qubits work together to form a long-lived nearly perfect qubit — a forever qubit that maintains coherence until power is removed, ushering in the digital era of quantum computing. Again, we expect years of concerted development to achieve this goal.

And to get THERE(!), we need to show that the more physical qubits participate in error correction, the more you can cut down on errors in the first place — this is a crucial step given how error-prone physical qubits are. We’re doing that research right now on our Quantum AI campus.

Already we run quantum computers that can perform calculations beyond the reach of classical computers. To continue this journey towards a useful error-corrected quantum computer and provide humanity with a new tool tuned to the way nature works, we’re assembling an amazing team to invent the future of computing together right here, right now, at Google’s Quantum AI campus.

  • Picture of the Sycamore quantum processor

    The Sycamore quantum processor has 54 individually controllable qubits, and 88 tunable couplers. The couplers are used to enable fast quantum operations between qubits.

  • Current generation of cryostats

    The current generation of cryostats that hold our quantum processors are about the size of three household refrigerators.

  • Art in the Quantum AI campus

    Quantum computing could help us understand and simulate the natural world around us. The art in the Quantum AI campus is influenced by nature.

Read More

LaMDA: our breakthrough conversation technology

We’ve always had a soft spot for language at Google. Early on, we set out to translate the web. More recently, we’ve invented machine learning techniques that help us better grasp the intent of Search queries. Over time, our advances in these and other areas have made it easier and easier to organize and access the heaps of information conveyed by the written and spoken word.

But there’s always room for improvement. Language is remarkably nuanced and adaptable. It can be literal or figurative, flowery or plain, inventive or informational. That versatility makes language one of humanity’s greatest tools — and one of computer science’s most difficult puzzles. 

LaMDA, our latest research breakthrough, adds pieces to one of the most tantalizing sections of that puzzle: conversation.

Animation of a multi-turn conversation between a person and LaMDA. The transcript reads:  USER: Hi, I’d like to have a conversation demonstration for a blog post.  LaMDA: Hi. I’m a friendly and knowledgeable demonstration for your blog post. I can engage in interesting conversation about virtually anything. What would you like to chat about?  USER: if my washing machine and tv are both malfunctioning at the same time what could the reason be?  LaMDA: It could be the power surge from the washing machine may have fried your tv. If it’s a power surge problem it may be time to get a new washing machine. Or a new TV.  USER: yeah maybe. btw how many legs does a horse have, and could it fix my washing machine?  LaMDA: Horses have four legs. How could they help?  USER: you tell me  LaMDA: Oh, well they could certainly help you. By moving you to a new home where you could install a new washing machine.

While conversations tend to revolve around specific topics, their open-ended nature means they can start in one place and end up somewhere completely different. A chat with a friend about a TV show could evolve into a discussion about the country where the show was filmed before settling on a debate about that country’s best regional cuisine. 

That meandering quality can quickly stump modern conversational agents (commonly known as chatbots), which tend to follow narrow, pre-defined paths. But LaMDA — short for “Language Model for Dialogue Applications” — can engage in a free-flowing way about a seemingly endless number of topics, an ability we think could unlock more natural ways of interacting with technology and entirely new categories of helpful applications.

The long road to LaMDA

LaMDA’s conversational skills have been years in the making. Like many recent language models, including BERT and GPT-3, it’s built on Transformer, a neural network architecture that Google Research invented and open-sourced in 2017. That architecture produces a model that can be trained to read many words (a sentence or paragraph, for example), pay attention to how those words relate to one another and then predict what words it thinks will come next. 

But unlike most other language models, LaMDA was trained on dialogue. During its training, it picked up on several of the nuances that distinguish open-ended conversation from other forms of language. One of those nuances is sensibleness. Basically: Does the response to a given conversational context make sense? For instance, if someone says:

“I just started taking guitar lessons.”

You might expect another person to respond with something like: 

“How exciting! My mom has a vintage Martin that she loves to play.”

That response makes sense, given the initial statement. But sensibleness isn’t the only thing that makes a good response. After all, the phrase “that’s nice” is a sensible response to nearly any statement, much in the way “I don’t know” is a sensible response to most questions. Satisfying responses also tend to be specific, by relating clearly to the context of the conversation. In the example above, the response is sensible and specific.

LaMDA builds on earlier Google research, published in 2020, that showed Transformer-based language models trained on dialogue could learn to talk about virtually anything. Since then, we’ve also found that, once trained, LaMDA can be fine-tuned to significantly improve the sensibleness and specificity of its responses. 

Responsibility first

These early results are encouraging, and we look forward to sharing more soon, but sensibleness and specificity aren’t the only qualities we’re looking for in models like LaMDA. We’re also exploring dimensions like “interestingness,” by assessing whether responses are insightful, unexpected or witty. Being Google, we also care a lot about factuality (that is, whether LaMDA sticks to facts, something language models often struggle with), and are investigating ways to ensure LaMDA’s responses aren’t just compelling but correct.

But the most important question we ask ourselves when it comes to our technologies is whether they adhere to our AI Principles. Language might be one of humanity’s greatest tools, but like all tools it can be misused. Models trained on language can propagate that misuse — for instance, by internalizing biases, mirroring hateful speech, or replicating misleading information. And even when the language it’s trained on is carefully vetted, the model itself can still be put to ill use. 

Our highest priority, when creating technologies like LaMDA, is working to ensure we minimize such risks. We’re deeply familiar with issues involved with machine learning models, such as unfair bias, as we’ve been researching and developing these technologies for many years. That’s why we build and open-source resources that researchers can use to analyze models and the data on which they’re trained; why we’ve scrutinized LaMDA at every step of its development; and why we’ll continue to do so as we work to incorporate conversational abilities into more of our products.

Read More

Learning to Manipulate Deformable Objects

Posted by Daniel Seita, Research Intern and Andy Zeng, Research Scientist, Robotics at Google

While the robotics research community has driven recent advances that enable robots to grasp a wide range of rigid objects, less research has been devoted to developing algorithms that can handle deformable objects. One of the challenges in deformable object manipulation is that it is difficult to specify such an object’s configuration. For example, with a rigid cube, knowing the configuration of a fixed point relative to its center is sufficient to describe its arrangement in 3D space, but a single point on a piece of fabric can remain fixed while other parts shift. This makes it difficult for perception algorithms to describe the complete “state” of the fabric, especially under occlusions. In addition, even if one has a sufficiently descriptive state representation of a deformable object, its dynamics are complex. This makes it difficult to predict the future state of the deformable object after some action is applied to it, which is often needed for multi-step planning algorithms.

In “Learning to Rearrange Deformable Cables, Fabrics, and Bags with Goal-Conditioned Transporter Networks,” to appear at ICRA 2021, we release an open-source simulated benchmark, called DeformableRavens, with the goal of accelerating research into deformable object manipulation. DeformableRavens features 12 tasks that involve manipulating cables, fabrics, and bags and includes a set of model architectures for manipulating deformable objects towards desired goal configurations, specified with images. These architectures enable a robot to rearrange cables to match a target shape, to smooth a fabric to a target zone, and to insert an item in a bag. To our knowledge, this is the first simulator that includes a task in which a robot must use a bag to contain other items, which presents key challenges in enabling a robot to learn more complex relative spatial relations.

The DeformableRavens Benchmark
DeformableRavens expands our prior work on rearranging objects and includes a suite of 12 simulated tasks involving 1D, 2D, and 3D deformable structures. Each task contains a simulated UR5 arm with a mock gripper for pinch grasping, and is bundled with scripted demonstrators to autonomously collect data for imitation learning. Tasks randomize the starting state of the items within a distribution to test generality to different object configurations.

Examples of scripted demonstrators for manipulation of 1D (cable), 2D (fabric), and 3D (bag) deformable structures in our simulator, using PyBullet. These show three of the 12 tasks in DeformableRavens. Left: the task is to move the cable so it matches the underlying green target zone. Middle: the task is to wrap the cube with the fabric. Right: the task is to insert the item in the bag, then to lift and move the bag to the square target zone.

Specifying goal configurations for manipulation tasks can be particularly challenging with deformable objects. Given their complex dynamics and high-dimensional configuration spaces, goals cannot be as easily specified as a set of rigid object poses, and may involve complex relative spatial relations, such as “place the item inside the bag”. Hence, in addition to tasks defined by the distribution of scripted demonstrations, our benchmark also contains goal-conditioned tasks that are specified with goal images. For goal-conditioned tasks, a given starting configuration of objects must be paired with a separate image that shows the desired configuration of those same objects. A success for that particular case is then based on whether the robot is able to get the current configuration to be sufficiently close to the configuration conveyed in the goal image.

Goal-Conditioned Transporter Networks
To complement the goal-conditioned tasks in our simulated benchmark, we integrated goal-conditioning into our previously released Transporter Network architecture — an action-centric model architecture that works well on rigid object manipulation by rearranging deep features to infer spatial displacements from visual input. The architecture takes as input both an image of the current environment and a goal image with a desired final configuration of objects, computes deep visual features for both images, then combines the features using element-wise multiplication to condition pick and place correlations to manipulate both the rigid and deformable objects in the scene. A strength of the Transporter Network architecture is that it preserves the spatial structure of the visual images, which provides inductive biases that reformulate image-based goal conditioning into a simpler feature matching problem and improves the learning efficiency with convolutional networks.

An example task involving goal-conditioning is shown below. In order to place the green block into the yellow bag, the robot needs to learn spatial features that enable it to perform a multi-step sequence of actions to spread open the top opening of the yellow bag, before placing the block into it. After it places the block into the yellow bag, the demonstration ends in a success. If in the goal image the block were placed in the blue bag, then the demonstrator would need to put the block in the blue bag.

An example of a goal-conditioned task in DeformableRavens. Left: A frontal camera view of the UR5 robot and the bags, plus one item, in a desired goal configuration. Middle: The top-down orthographic image of this setup, which is size 160×320 and passed as the goal image to specify the task success criterion. Right: A video of the demonstration policy showing that the item goes into the yellow bag, instead of the blue one.

Results
Our results suggest that goal-conditioned Transporter Networks enable agents to manipulate deformable structures into flexibly specified configurations without test-time visual anchors for target locations. We also significantly extend prior results using Transporter Networks for manipulating deformable objects by testing on tasks with 2D and 3D deformables. Results additionally suggest that the proposed approach is more sample-efficient than alternative approaches that rely on using ground-truth pose and vertex position instead of images as input.

For example, the learned policies can effectively simulate bagging tasks, and one can also provide a goal image so that the robot must infer into which bag the item should be placed.

An example of policies trained using Transporter Networks applied in action on bagging tasks, where the objective is to first open the bag, then to put one (left) or two (right) items in the bag, then to insert the bag into the target zone. The left animation is zoomed in for clarity.
An example of the learned policy using Goal-Conditioned Transporter Networks. Left: The frontal camera view. Middle: The goal image that the Goal-Conditioned Transporter Network receives as input, which shows that the item should go in the red bag, instead of the blue distractor bag. Right: The learned policy putting the item in the red bag, instead of the distractor bag (colored yellow in this case).

We encourage other researchers to check out our open-source code to try the simulated environments and to build upon this work. For more details, please check out our paper.

Future Work
This work exposes several directions for future development, including the mitigation of observed failure modes. As shown below, one failure is when the robot pulls the bag upwards and causes the item to fall out. Another is when the robot places the item on the irregular exterior surface of the bag, which causes the item to fall off. Future algorithmic improvements might allow actions that operate at a higher frequency rate, so that the robot can react in real time to counteract such failures.

Examples of failure cases from the learned Transporter-based policies on bag manipulation tasks. Left: the robot inserts the cube into the opening of the bag, but the bag pulling action fails to enclose the cube. Right: the robot fails to insert the cube into the opening, and is unable to perform recovery actions to insert the cube in a better location.

Another area for advancement is to train Transporter Network-based models for deformable object manipulation using techniques that do not require expert demonstrations, such as example-based control or model-based reinforcement learning. Finally, the ongoing pandemic limited access to physical robots, so in future work we will explore the necessary ingredients to get a system working with physical bags, and to extend the system to work with different types of bags.

Acknowledgments
This research was conducted during Daniel Seita’s internship at Google’s NYC office in Summer 2020. We thank our collaborators Pete Florence, Jonathan Tompson, Erwin Coumans, Vikas Sindhwani, and Ken Goldberg.

Read More

ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision

Posted by Chao Jia and Yinfei Yang, Software Engineers, Google Research

Learning good visual and vision-language representations is critical to solving computer vision problems — image retrieval, image classification, video understanding — and can enable the development of tools and products that change people’s daily lives. For example, a good vision-language matching model can help users find the most relevant images given a text description or an image input and help tools such as Google Lens find more fine-grained information about an image.

To learn such representations, current state-of-the-art (SotA) visual and vision-language models rely heavily on curated training datasets that require expert knowledge and extensive labels. For vision applications, representations are mostly learned on large-scale datasets with explicit class labels, such as ImageNet, OpenImages, and JFT-300M. For vision-language applications, popular pre-training datasets, such as Conceptual Captions and Visual Genome Dense Captions, all require non-trivial data collection and cleaning steps, limiting the size of datasets and thus hindering the scale of the trained models. In contrast, natural language processing (NLP) models have achieved SotA performance on GLUE and SuperGLUE benchmarks by utilizing large-scale pre-training on raw text without human labels.

In “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision“, to appear at ICML 2021, we propose bridging this gap with publicly available image alt-text data (written copy that appears in place of an image on a webpage if the image fails to load on a user’s screen) in order to train larger, state-of-the-art vision and vision-language models. To that end, we leverage a noisy dataset of over one billion image and alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. We show that the scale of our corpus can make up for noisy data and leads to SotA representation, and achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations also set new SotA results on Flickr30K and MS-COCO benchmarks, even when compared with more sophisticated cross-attention models, and enable zero-shot image classification and cross-modality search with complex text and text + image queries.

Creating the Dataset
Alt-texts usually provide a description of what the image is about, but the dataset is “noisy” because some text may be partly or wholly unrelated to its paired image.

Example image-text pairs randomly sampled from the training dataset of ALIGN. One clearly noisy text label is marked in italics.

In this work, we follow the methodology of constructing the Conceptual Captions dataset to get a version of raw English alt-text data (image and alt-text pairs). While the Conceptual Captions dataset was cleaned by heavy filtering and post-processing, this work scales up visual and vision-language representation learning by relaxing most of the cleaning steps in the original work. Instead, we only apply minimal frequency-based filtering. The result is a much larger but noisier dataset of 1.8B image-text pairs.

ALIGN: A Large-scale ImaGe and Noisy-Text Embedding
For the purpose of building larger and more powerful models easily, we employ a simple dual-encoder architecture that learns to align visual and language representations of the image and text pairs. Image and text encoders are learned via a contrastive loss (formulated as normalized softmax) that pushes the embeddings of matched image-text pairs together while pushing those of non-matched image-text pairs (within the same batch) apart. The large-scale dataset makes it possible for us to scale up the model size to be as large as EfficientNet-L2 (image encoder) and BERT-large (text encoder) trained from scratch. The learned representation can be used for downstream visual and vision-language tasks.

Figure of ImageNet credit to (Krizhevsky et al. 2012) and VTAB figure credit to (Zhai et al. 2019)

The resulting representation can be used for vision-only or vision-language task transfer. Without any fine-tuning, ALIGN powers cross-modal search – image-to-text search, text-to-image search, and even search with joint image+text queries, examples below.

Evaluating Retrieval and Representation
The learned ALIGN model with BERT-Large and EfficientNet-L2 as text and image encoder backbones achieves SotA performance on multiple image-text retrieval tasks (Flickr30K and MS-COCO) in both zero-shot and fine-tuned settings, as shown below.

Flickr30K (1K test set) R@1 MS-COCO (5K test set) R@1
Setting Model    image → text       text → image       image → text       text → image   
Zero-shot ImageBERT    70.7 54.3 44.0 32.3
UNITER 83.6 68.7
CLIP 88.0 68.7 58.4 37.8
ALIGN 88.6 75.7 58.6 45.6
Fine-tuned    GPO 88.7 76.1 68.1 52.7
UNITER 87.3 75.6 65.7 52.9
ERNIE-ViL 88.1 76.7
VILLA 87.9 76.3
Oscar 73.5 57.5
ALIGN 95.3 84.9 77.0 59.9
Image-text retrieval results (recall@1) on Flickr30K and MS-COCO datasets (both zero-shot and fine-tuned). ALIGN significantly outperforms existing methods including the cross-modality attention models that are too expensive for large-scale retrieval applications.

ALIGN is also a strong image representation model. Shown below, with frozen features, ALIGN slightly outperforms CLIP and achieves a SotA result of 85.5% top-1 accuracy on ImageNet. With fine-tuning, ALIGN achieves higher accuracy than most generalist models, such as BiT and ViT, and is only worse than Meta Pseudo Labels, which requires deeper interaction between ImageNet training and large-scale unlabeled data.

Model (backbone)    Acc@1 w/ frozen features       Acc@1       Acc@5   
WSL (ResNeXt-101 32x48d) 83.6 85.4 97.6
CLIP (ViT-L/14) 85.4
BiT (ResNet152 x 4) 87.54 98.46
NoisyStudent (EfficientNet-L2) 88.4 98.7
ViT (ViT-H/14) 88.55
Meta-Pseudo-Labels (EfficientNet-L2)    90.2 98.8
ALIGN (EfficientNet-L2) 85.5 88.64 98.67
ImageNet classification results comparison with supervised training (fine-tuning).

Zero-Shot Image Classification
Traditionally, image classification problems treat each class as independent IDs, and people have to train the classification layers with at least a few shots of labeled data per class. The class names are actually also natural language phrases, so we can naturally extend the image-text retrieval capability of ALIGN for image classification without any training data.

The pre-trained image and text encoder can directly be used in classifying an image into a set of classes by retrieving the nearest class name in the aligned embedding space. This approach does not require any training data for the defined class space.

On the ImageNet validation dataset, ALIGN achieves 76.4% top-1 zero-shot accuracy and shows great robustness in different variants of ImageNet with distribution shifts, similar to the concurrent work CLIP. We also use the same text prompt engineering and ensembling as in CLIP.

   ImageNet       ImageNet-R       ImageNet-A       ImageNet-V2   
CLIP 76.2 88.9 77.2 70.1
ALIGN    76.4 92.2 75.8 70.1
Top-1 accuracy of zero-shot classification on ImageNet and its variants.

Application in Image Search
To illustrate the quantitative results above, we build a simple image retrieval system with the embeddings trained by ALIGN and show the top 1 text-to-image retrieval results for a handful of text queries from a 160M image pool. ALIGN can retrieve precise images given detailed descriptions of a scene, or fine-grained or instance-level concepts like landmarks and artworks. These examples demonstrate that the ALIGN model can align images and texts with similar semantics, and that ALIGN can generalize to novel complex concepts.

Image retrieval with fine-grained text queries using ALIGN’s embeddings.

Multimodal (Image+Text) Query for Image Search
A surprising property of word vectors is that word analogies can often be solved with vector arithmetic. A common example, “king – man + woman = queen”. Such linear relationships between image and text embeddings also emerge in ALIGN.

Specifically, given a query image and a text string, we add their ALIGN embeddings together and use it to retrieve relevant images using cosine similarity, as shown below. These examples not only demonstrate the compositionality of ALIGN embeddings across vision and language domains, but also show the feasibility of searching with a multi-modal query. For instance, one could now look for the “Australia” or “Madagascar” equivalence of pandas, or turn a pair of black shoes into identically-looking beige shoes. Also, it is possible to remove objects/attributes from a scene by performing subtraction in the embedding space, shown below.

Image retrieval with image text queries. By adding or subtracting text query embedding, ALIGN retrieves relevant images.

Social Impact and Future Work
While this work shows promising results from a methodology perspective with a simple data collection method, additional analysis of the data and the resulting model is necessary before the responsible use of the model in practice. For instance, considerations should be made towards the potential for the use of harmful text data in alt-texts to reinforce such harms. With regard to fairness, data balancing efforts may be required to prevent reinforcing stereotypes from the web data. Additional testing and training around sensitive religious or cultural items should be taken to understand and mitigate the impact from possibly mislabeled data.

Further analysis should also be taken to ensure that the demographic distribution of humans and related cultural items, such as clothing, food, and art, do not cause skewed model performance. Analysis and balancing would be required if such models will be used in production.

Conclusion
We have presented a simple method of leveraging large-scale noisy image-text data to scale up visual and vision-language representation learning. The resulting model, ALIGN, is capable of cross-modal retrieval and significantly outperforms SotA models. In visual-only downstream tasks, ALIGN is also comparable to or outperforms SotA models trained with large-scale labeled data.

Acknowledgement
We would like to thank our co-authors in Google Research: Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V. Le, Yunhsuan Sung, Zhen Li, Tom Duerig. This work was also done with invaluable help from other colleagues from Google. We would like to thank Jan Dlabal and Zhe Li for continuous support in training infrastructure, Simon Kornblith for building the zero-shot & robustness model evaluation on ImageNet variants, Xiaohua Zhai for help on conducting VTAB evaluation, Mingxing Tan and Max Moroz for suggestions on EfficientNet training, Aleksei Timofeev for the early idea of multimodal query retrieval, Aaron Michelony and Kaushal Patel for their early work on data generation, and Sergey Ioffe, Jason Baldridge and Krishna Srinivasan for the insightful feedback and discussion.

Read More

Accelerating Eye Movement Research for Wellness and Accessibility

Posted by Nachiappan Valliappan, Senior Software Engineer and Kai Kohlhoff, Staff Research Scientist, Google Research

Eye movement has been studied widely across vision science, language, and usability since the 1970s. Beyond basic research, a better understanding of eye movement could be useful in a wide variety of applications, ranging across usability and user experience research, gaming, driving, and gaze-based interaction for accessibility to healthcare. However, progress has been limited because most prior research has focused on specialized hardware-based eye trackers that are expensive and do not easily scale.

In “Accelerating eye movement research via accurate and affordable smartphone eye tracking”, published in Nature Communications, and “Digital biomarker of mental fatigue”, published in npj Digital Medicine, we present accurate, smartphone-based, ML-powered eye tracking that has the potential to unlock new research into applications across the fields of vision, accessibility, healthcare, and wellness, while additionally providing orders-of-magnitude scaling across diverse populations in the world, all using the front-facing camera on a smartphone. We also discuss the potential use of this technology as a digital biomarker of mental fatigue, which can be useful for improved wellness.

Model Overview
The core of our gaze model was a multilayer feed-forward convolutional neural network (ConvNet) trained on the MIT GazeCapture dataset. A face detection algorithm selected the face region with associated eye corner landmarks, which were used to crop the images down to the eye region alone. These cropped frames were fed through two identical ConvNet towers with shared weights. Each convolutional layer was followed by an average pooling layer. Eye corner landmarks were combined with the output of the two towers through fully connected layers. Rectified Linear Units (ReLUs) were used for all layers except the final fully connected output layer (FC6), which had no activation.

Architecture of the unpersonalized gaze model. Eye regions, extracted from a front-facing camera image, serve as input into a convolutional neural network. Fully-connected (FC) layers combine the output with eye corner landmarks to infer gaze x– and y-locations on screen via a multi-regression output layer.

The unpersonalized gaze model accuracy was improved by fine-tuning and per-participant personalization. For the latter, a lightweight regression model was fitted to the model’s penultimate ReLU layer and participant-specific data.

Model Evaluation
To evaluate the model, we collected data from consenting study participants as they viewed dots that appeared at random locations on a blank screen. The model error was computed as the distance (in cm) between the stimulus location and model prediction. Results show that while the unpersonalized model has high error, personalization with ~30s of calibration data led to an over fourfold error reduction (from 1.92 to 0.46cm). At a viewing distance of 25-40 cm, this corresponds to 0.6-1° accuracy, a significant improvement over the 2.4-3° reported in previous work [1, 2].

Additional experiments show that the smartphone eye tracker model’s accuracy is comparable to state-of-the-art wearable eye trackers both when the phone is placed on a device stand, as well as when users hold the phone freely in their hand in a near frontal headpose. In contrast to specialized eye tracking hardware with multiple infrared cameras close to each eye, running our gaze model using a smartphone’s single front-facing RGB camera is significantly more cost effective (~100x cheaper) and scalable.

Using this smartphone technology, we were able to replicate key findings from prior eye movement research in neuroscience and psychology, including standard oculomotor tasks (to understand basic visual functioning in the brain) and natural image understanding. For example, in a simple prosaccade task, which tests a person’s ability to quickly move their eyes towards a stimulus that appears on the screen, we found that the average saccade latency (time to move the eyes) matches prior work for basic visual health (210ms versus 200-250ms). In controlled visual search tasks, we were able to replicate key findings, such as the effect of target saliency and clutter on eye movements.

Example gaze scanpaths show the effect of the target’s saliency (i.e., color contrast) on visual search performance. Fewer fixations are required to find a target (left) with high saliency (different from the distractors), while more fixations are required to find a target (right) with low saliency (similar to the distractors).

For complex stimuli, such as natural images, we found that the gaze distribution (computed by aggregating gaze positions across all participants) from our smartphone eye tracker are similar to those obtained from bulky, expensive eye trackers that used highly controlled settings, such as laboratory chin rest systems. While the smartphone-based gaze heatmaps have a broader distribution (i.e., they appear more “blurred”) than hardware-based eye trackers, they are highly correlated both at the pixel level (r = 0.74) and object level (r = 0.90). These results suggest that this technology could be used to scale gaze analysis for complex stimuli such as natural and medical images (e.g., radiologists viewing MRI/PET scans).

Similar gaze distribution from our smartphone approach vs. a more expensive (100x) eye tracker (from the OSIE dataset).

We found that smartphone gaze could also help detect difficulty with reading comprehension. Participants reading passages spent significantly more time looking within the relevant excerpts when they answered correctly. However, as comprehension difficulty increased, they spent more time looking at the irrelevant excerpts in the passage before finding the relevant excerpt that contained the answer. The fraction of gaze time spent on the relevant excerpt was a good predictor of comprehension, and strongly negatively correlated with comprehension difficulty (r = −0.72).

Digital Biomarker of Mental Fatigue
Gaze detection is an important tool to detect alertness and wellbeing, and is studied widely in medicine, sleep research, and mission-critical settings such as medical surgeries, aviation safety, etc. However, existing fatigue tests are subjective and often time-consuming. In our recent paper published in npj Digital Medicine, we demonstrated that smartphone gaze is significantly impaired with mental fatigue, and can be used to track the onset and progression of fatigue.

A simple model predicts mental fatigue reliably using just a few minutes of gaze data from participants performing a task. We validated these findings in two different experiments — using a language-independent object-tracking task and a language-dependent proofreading task. As shown below, in the object-tracking task, participants’ gaze initially follows the object’s circular trajectory, but under fatigue, their gaze shows high errors and deviations. Given the pervasiveness of phones, these results suggest that smartphone-based gaze could provide a scalable, digital biomarker of mental fatigue.

Example gaze scanpaths for a participant with no fatigue (left) versus with mental fatigue (right) as they track an object following a circular trajectory.
The corresponding progression of fatigue scores (ground truth) and model prediction as a function of time on task.

Beyond wellness, smartphone gaze could also provide a digital phenotype for screening or monitoring health conditions such as autism spectrum disorder, dyslexia, concussion and more. This could enable timely and early interventions, especially for countries with limited access to healthcare services.

Another area that could benefit tremendously is accessibility. People with conditions such as ALS, locked-in syndrome and stroke have impaired speech and motor ability. Smartphone gaze could provide a powerful way to make daily tasks easier by using gaze for interaction, as recently demonstrated with Look to Speak.

Ethical Considerations
Gaze research needs careful consideration, including being mindful of the correct use of such technology — applications should obtain explicit approval and fully informed consent from users for the specific task at hand. In our work, all data was collected for research purposes with users’ explicit approval and consent. In addition, users were allowed to opt out at any point and request their data to be deleted. We continue to research additional ways to ensure ML fairness and improve the accuracy and robustness of gaze technology across demographics, in a responsible, privacy-preserving way.

Conclusion
Our findings of accurate and affordable ML-powered smartphone eye tracking offer the potential for orders-of-magnitude scaling of eye movement research across disciplines (e.g., neuroscience, psychology and human-computer interaction). They unlock potential new applications for societal good, such as gaze-based interaction for accessibility, and smartphone-based screening and monitoring tools for wellness and healthcare.

Acknowledgements
This work involved collaborative efforts from a multidisciplinary team of software engineers, researchers, and cross-functional contributors. We’d like to thank all the co-authors of the papers, including our team members, Junfeng He, Na Dai, Pingmei Xu, Venky Ramachandran; interns, Ethan Steinberg, Kantwon Rogers, Li Guo, and Vincent Tseng; collaborators, Tanzeem Choudhury; and UXRs: Mina Shojaeizadeh, Preeti Talwai, and Ran Tao. We’d also like to thank Tomer Shekel, Gaurav Nemade, and Reena Lee for their contributions to this project, and Vidhya Navalpakkam for her technical leadership in initiating and overseeing this body of work.

Read More

Crisscrossed Captions: Semantic Similarity for Images and Text

Posted by Zarana Parekh, Software Engineer and Jason Baldridge, Staff Research Scientist, Google Research

The past decade has seen remarkable progress on automatic image captioning, a task in which a computer algorithm creates written descriptions for images. Much of the progress has come through the use of modern deep learning methods developed for both computer vision and natural language processing, combined with large scale datasets that pair images with descriptions created by people. In addition to supporting important practical applications, such as providing descriptions of images for visually impaired people, these datasets also enable investigations into important and exciting research questions about grounding language in visual inputs. For example, learning deep representations for a word like “car”, means using both linguistic and visual contexts.

Image captioning datasets that contain pairs of textual descriptions and their corresponding images, such as MS-COCO and Flickr30k, have been widely used to learn aligned image and text representations and to build captioning models. Unfortunately, these datasets have limited cross-modal associations: images are not paired with other images, captions are only paired with other captions of the same image (also called co-captions), there are image-caption pairs that match but are not labeled as a match, and there are no labels that indicate when an image-caption pair does not match. This undermines research into how inter-modality learning (connecting captions to images, for example) impacts intra-modality tasks (connecting captions to captions or images to images). This is important to address, especially because a fair amount of work on learning from images paired with text is motivated by arguments about how visual elements should inform and improve representations of language.

To address this evaluation gap, we present “Crisscrossed Captions: Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO“, which was recently presented at EACL 2021. The Crisscrossed Captions (CxC) dataset extends the development and test splits of MS-COCO with semantic similarity ratings for image-text, text-text and image-image pairs. The rating criteria are based on Semantic Textual Similarity, an existing and widely-adopted measure of semantic relatedness between pairs of short texts, which we extend to include judgments about images as well. In all, CxC contains human-derived semantic similarity ratings for 267,095 pairs (derived from 1,335,475 independent judgments), a massive extension in scale and detail to the 50k original binary pairings in MS-COCO’s development and test splits. We have released CxC’s ratings, along with code to merge CxC with existing MS-COCO data. Anyone familiar with MS-COCO can thus easily enhance their experiments with CxC.

Crisscrossed Captions extends the MS-COCO evaluation sets by adding human-derived semantic similarity ratings for existing image-caption pairs and co-captions (solid lines), and it increases rating density by adding human ratings for new image-caption, caption-caption and image-image pairs (dashed lines).*

Creating the CxC Dataset
If a picture is worth a thousand words, it is likely because there are so many details and relationships between objects that are generally depicted in pictures. We can describe the texture of the fur on a dog, name the logo on the frisbee it is chasing, mention the expression on the face of the person who has just thrown the frisbee, or note the vibrant red on a large leaf in a tree above the person’s head, and so on.

The CxC dataset extends the MS-COCO evaluation splits with graded similarity associations within and across modalities. MS-COCO has five captions for each image, split into 410k training, 25k development, and 25k test captions (for 82k, 5k, 5k images, respectively). An ideal extension would rate every pair in the dataset (caption-caption, image-image, and image-caption), but this is infeasible as it would require obtaining human ratings for billions of pairs.

Given that randomly selected pairs of images and captions are likely to be dissimilar, we came up with a way to select items for human rating that would include at least some new pairs with high expected similarity. To reduce the dependence of the chosen pairs on the models used to find them, we introduce an indirect sampling scheme (depicted below) where we encode images and captions using different encoding methods and compute the similarity between pairs of same modality items, resulting in similarity matrices. Images are encoded using Graph-RISE embeddings, while captions are encoded using two methods — Universal Sentence Encoder (USE) and average bag-of-words (BoW) based on GloVe embeddings. Since each MS-COCO example has five co-captions, we average the co-caption encodings to create a single representation per example, ensuring all caption pairs can be mapped to image pairs (more below on how we select intermodality pairs).

The next step of the indirect sampling scheme is to use the computed similarities of images for a biased sampling of caption pairs for human rating (and vice versa). For example, we select two captions with high computed similarities from the text similarity matrix, then take each of their images, resulting in a new pair of images that are different in appearance but similar in what they depict based on their descriptions. For example, the captions “A dog looking bashfully to the side” and “A black dog lifts its head to the side to enjoy a breeze” would have a reasonably high model similarity, so the corresponding images of the two dogs in the figure below could be selected for image similarity rating. This step can also start with two images with high computed similarities to yield a new pair of captions. We now have indirectly sampled new intramodal pairs — at least some of which are highly similar — for which we obtain human ratings.

Top: Text similarity matrix (each cell corresponds to a similarity score) constructed using averaged co-caption encodings, so each text entry corresponds to a single image, resulting in a 5k x 5k matrix. Two different text encoding methods were used, but only one text similarity matrix has been shown for simplicity. Bottom: Image similarity matrix for each image in the dataset, resulting in a 5k x 5k matrix.
Top: Pairs of images are picked based on their computed caption similarity. Bottom: pairs of captions are picked based on the computed similarity of the images they describe.

Last, we then use these new intramodal pairs and their human ratings to select new intermodal pairs for human rating. We do this by using existing image-caption pairs to link between modalities. For example, if a caption pair example ij was rated by humans as highly similar, we pick the image from example i and caption from example j to obtain a new intermodal pair for human rating. And again, we use the intramodal pairs with the highest rated similarity for sampling because this includes at least some new pairs with high similarity. Finally, we also add human ratings for all existing intermodal pairs and a large sample of co-captions.

The following table shows examples of semantic image similarity (SIS) and semantic image-text similarity (SITS) pairs corresponding to each rating, with 5 being the most similar and 0 being completely dissimilar.

Examples for each human-derived similarity score (left: 5 to 0, 5 being very similar and 0 being completely dissimilar) of image pairs based on SIS (middle) and SITS (right) tasks. Note that these examples are for illustrative purposes and are not themselves in the CxC dataset.

Evaluation
MS-COCO supports three retrieval tasks:

  1. Given an image, find its matching captions out of all other captions in the evaluation set.
  2. Given a caption, find its corresponding image out of all other images in the evaluation set.
  3. Given a caption, find its other co-captions out of all other captions in the evaluation set.

MS-COCO’s pairs are incomplete because captions created for one image at times apply equally well to another, yet these associations are not captured in the dataset. CxC enhances these existing retrieval tasks with new positive pairs, and it also supports a new image-image retrieval task. With its graded similarity judgements, CxC also makes it possible to measure correlations between model and human rankings. Retrieval metrics in general focus only on positive pairs, while CxC’s correlation scores additionally account for the relative ordering of similarity and include low-scoring items (non-matches). Supporting these evaluations on a common set of images and captions makes them more valuable for understanding inter-modal learning compared to disjoint sets of caption-image, caption-caption, and image-image associations.

We ran a series of experiments to show the utility of CxC’s ratings. For this, we constructed three dual encoder (DE) models using BERT-base as the text encoder and EfficientNet-B4 as the image encoder:

  1. A text-text (DE_T2T) model that uses a shared text encoder for both sides.
  2. An image-text model (DE_I2T) that uses the aforementioned text and image encoders, and includes a layer above the text encoder to match the image encoder output.
  3. A multitask model (DE_I2T+T2T) trained on a weighted combination of text-text and image-text tasks.
CxC retrieval results — a comparison of our text-text (T2T), image-text (I2T) and multitask (I2T+T2T) dual encoder models on all the four retrieval tasks.

From the results on the retrieval tasks, we can see that DE_I2T+T2T (yellow bar) performs better than DE_I2T (red bar) on the image-text and text-image retrieval tasks. Thus, adding the intramodal (text-text) training task helped improve the intermodal (image-text, text-image) performance. As for the other two intramodal tasks (text-text and image-image), DE_I2T+T2T shows strong, balanced performance on both of them.

CxC correlation results for the same models shown above.

For the correlation tasks, DE_I2T performs the best on SIS and DE_I2T+T2T is the best overall. The correlation scores also show that DE_I2T performs well only on images: it has the highest SIS but has much worse STS. Adding the text-text loss to DE_I2T training (DE_I2T+T2T) produces more balanced overall performance.

The CxC dataset provides a much more complete set of relationships between and among images and captions than the raw MS-COCO image-caption pairs. The new ratings have been released and further details are in our paper. We hope to encourage the research community to push the state of the art on the tasks introduced by CxC with better models for jointly learning inter- and intra-modal representations.

Acknowledgments
The core team includes Daniel Cer, Yinfei Yang and Austin Waters. We thank Julia Hockenmaier for her inputs on CxC’s formulation, the Google Data Compute Team, especially Ashwin Kakarla and Mohd Majeed for their tooling and annotation support, Yuan Zhang, Eugene Ie for their comments on the initial versions of the paper and Daphne Luong for executive support for the data collection.

* All the images in the article have been taken from the Open Images dataset under the CC-by 4.0 license.

Read More

Woolaroo: a new tool for exploring indigenous languages

“Our dictionary doesn’t have a word for shoe” my Uncle Allan Lena said, so when kids ask him what to call it in Yugambeh, he’ll say “jinung gulli” – a foot thing.

Uncle Allan Lena is a frontline worker in the battle to reteach the Yugambeh Aboriginal language to the children of southeast Queensland, Australia, where it hasn’t been spoken fluently for decades and thus is – like many other languages around the world – in danger of disappearing.  

For the younger generation, even general language can be a challenge to understand, but it can be especially difficult to try to describe modern items using Indigenous languages like Yugambeh. For example in the Australian outdoors, it’s easy to teach children the words for trees and animals, but around the house it becomes harder. Traditional language didn’t have a word for a fridge – so we say waring bin – a cold place. The same with a telephone – we call it a gulgun biral – voice thrower.

However, today’s technology can help provide an educational and interactive way to promote language learning and preservation.  I’m particularly proud for Yugambeh to be the first Australian Aboriginal language to be featured on Woolaroo, a new Google Arts & Culture experiment using the Google Cloud Vision API

The team behind the Yugambeh Museum has been working for three decades to help gather local language and cultural stories. Given the importance of Aboriginal language to Australian culture we have the incentive to record the known but in particular new words our community members are using as the world evolves bringing us new technology we didn’t have before.

An info graphic with numbers on the Yugambeh language

Woolaroo is open source and allows language communities like ours to preserve and expand their language word lists and add audio recordings to help with pronunciation. Today it supports 10 global languages including Louisiana Creole, Calabrian Greek, Māori, Nawat, Tamazight, Sicilian, Yang Zhuang, Rapa Nui, Yiddish and Yugambeh. Any of these languages are an important aspect of a community’s cultural heritage. 

Crucial to Indigenous communities is that Woolaroo puts the power to add, edit and delete entries completely in their hands. So people can respond immediately to newly remembered words and phrases and add them directly.

So if you, your grandparents or people in your community speak any of these languages – even if just a few words –  you can help to expand the growing coverage of Woolaroo.

We hope people will enjoy learning and interacting with a new language and  learn about the diversity of communities and heritage we all share together. 

Explore more on the Google Arts & Culture app for iOS and Android and at g.co/woolaroo.

Read More