​​Deep Hierarchical Planning from Pixels

Research into how artificial agents can make decisions has evolved rapidly through advances in deep reinforcement learning. Compared to generative ML models like GPT-3 and Imagen, artificial agents can directly influence their environment through actions, such as moving a robot arm based on camera inputs or clicking a button in a web browser. While artificial agents have the potential to be increasingly helpful to people, current methods are held back by the need to receive detailed feedback in the form of frequently provided rewards to learn successful strategies. For example, despite large computational budgets, even powerful programs such as AlphaGo are limited to a few hundred moves until receiving their next reward.

In contrast, complex tasks like making a meal require decision making at all levels, from planning the menu, navigating to the store to pick up groceries, and following the recipe in the kitchen to properly executing the fine motor skills needed at each step along the way based on high-dimensional sensory inputs. Hierarchical reinforcement learning (HRL) promises to automatically break down such complex tasks into manageable subgoals, enabling artificial agents to solve tasks more autonomously from fewer rewards, also known as sparse rewards. However, research progress on HRL has proven to be challenging; current methods rely on manually specified goal spaces or subtasks, and no general solution exists.

To spur progress on this research challenge and in collaboration with the University of California, Berkeley, we present the Director agent, which learns practical, general, and interpretable hierarchical behaviors from raw pixels. Director trains a manager policy to propose subgoals within the latent space of a learned world model and trains a worker policy to achieve these goals. Despite operating on latent representations, we can decode Director’s internal subgoals into images to inspect and interpret its decisions. We evaluate Director across several benchmarks, showing that it learns diverse hierarchical strategies and enables solving tasks with very sparse rewards where previous approaches fail, such as exploring 3D mazes with quadruped robots directly from first-person pixel inputs.

Director learns to solve complex long-horizon tasks by automatically breaking them down into subgoals. Each panel shows the environment interaction on the left and the decoded internal goals on the right.

How Director Works
Director learns a world model from pixels that enables efficient planning in a latent space. The world model maps images to model states and then predicts future model states given potential actions. From predicted trajectories of model states, Director optimizes two policies: The manager chooses a new goal every fixed number of steps, and the worker learns to achieve the goals through low-level actions. However, choosing goals directly in the high-dimensional continuous representation space of the world model would be a challenging control problem for the manager. Instead, we learn a goal autoencoder to compress the model states into smaller discrete codes. The manager then selects discrete codes and the goal autoencoder turns them into model states before passing them as goals to the worker.

Left: The goal autoencoder (blue) compresses the world model (green) state (st) into discrete codes (z). Right: The manager policy (orange) selects a code that the goal decoder (blue) turns into a feature space goal (g). The worker policy (red) learns to achieve the goal from future trajectories (s1, …, s4) predicted by the world model.

All components of Director are optimized concurrently, so the manager learns to select goals that are achievable by the worker. The manager learns to select goals to maximize both the task reward and an exploration bonus, leading the agent to explore and steer towards remote parts of the environment. We found that preferring model states where the goal autoencoder incurs high prediction error is a simple and effective exploration bonus. Unlike prior methods, such as Feudal Networks, our worker receives no task reward and learns purely from maximizing the feature space similarity between the current model state and the goal. This means the worker has no knowledge of the task and instead concentrates all its capacity on achieving goals.

Benchmark Results
Whereas prior work in HRL often resorted to custom evaluation protocols — such as assuming diverse practice goals, access to the agents’ global position on a 2D map, or ground-truth distance rewards — Director operates in the end-to-end RL setting. To test the ability to explore and solve long-horizon tasks, we propose the challenging Egocentric Ant Maze benchmark. This challenging suite of tasks requires finding and reaching goals in 3D mazes by controlling the joints of a quadruped robot, given only proprioceptive and first-person camera inputs. The sparse reward is given when the robot reaches the goal, so the agents have to autonomously explore in the absence of task rewards throughout most of their learning.

The Egocentric Ant Maze benchmark measures the ability of agents to explore in a temporally-abstract manner to find the sparse reward at the end of the maze.

We evaluate Director against two state-of-the-art algorithms that are also based on world models: Plan2Explore, which maximizes both task reward and an exploration bonus based on ensemble disagreement, and Dreamer, which simply maximizes the task reward. Both baselines learn non-hierarchical policies from imagined trajectories of the world model. We find that Plan2Explore results in noisy movements that flip the robot onto its back, preventing it from reaching the goal. Dreamer reaches the goal in the smallest maze but fails to explore the larger mazes. In these larger mazes, Director is the only method to find and reliably reach the goal.

To study the ability of agents to discover very sparse rewards in isolation and separately from the challenge of representation learning of 3D environments, we propose the Visual Pin Pad suite. In these tasks, the agent controls a black square, moving it around to step on differently colored pads. At the bottom of the screen, the history of previously activated pads is shown, removing the need for long-term memory. The task is to discover the correct sequence for activating all the pads, at which point the agent receives the sparse reward. Again, Director outperforms previous methods by a large margin.

The Visual Pin Pad benchmark allows researchers to evaluate agents under very sparse rewards and without confounding challenges such as perceiving 3D scenes or long-term memory.

In addition to solving tasks with sparse rewards, we study Director’s performance on a wide range of tasks common in the literature that typically require no long-term exploration. Our experiment includes 12 tasks that cover Atari games, Control Suite tasks, DMLab maze environments, and the research platform Crafter. We find that Director succeeds across all these tasks with the same hyperparameters, demonstrating the robustness of the hierarchy learning process. Additionally, providing the task reward to the worker enables Director to learn precise movements for the task, fully matching or exceeding the performance of the state-of-the-art Dreamer algorithm.

Director solves a wide range of standard tasks with dense rewards with the same hyperparameters, demonstrating the robustness of the hierarchy learning process.

Goal Visualizations
While Director uses latent model states as goals, the learned world model allows us to decode these goals into images for human interpretation. We visualize the internal goals of Director for multiple environments to gain insights into its decision making and find that Director learns diverse strategies for breaking down long-horizon tasks. For example, on the Walker and Humanoid tasks, the manager requests a forward leaning pose and shifting floor patterns, with the worker filling in the details of how the legs need to move. In the Egocentric Ant Maze, the manager steers the ant robot by requesting a sequence of different wall colors. In the 2D research platform Crafter, the manager requests resource collection and tools via the inventory display at the bottom of the screen, and in DMLab mazes, the manager encourages the worker via the teleport animation that occurs right after collecting the desired object.

Left: In Egocentric Ant Maze XL, the manager directs the worker through the maze by targeting walls of different colors. Right: In Visual Pin Pad Six, the manager specifies subgoals via the history display at the bottom and by highlighting different pads.
Left: In Walker, the manager requests a forward leaning pose with both feet off the ground and a shifting floor pattern, with the worker filling in the details of leg movement. Right: In the challenging Humanoid task, Director learns to stand up and walk reliably from pixels and without early episode terminations.
Left: In Crafter, the manager requests resource collection via the inventory display at the bottom of the screen. Right: In DMLab Goals Small, the manager requests the teleport animation that occurs when receiving a reward as a way to communicate the task to the worker.

Future Directions
We see Director as a step forward in HRL research and are preparing its code to be released in the future. Director is a practical, interpretable, and generally applicable algorithm that provides an effective starting point for the future development of hierarchical artificial agents by the research community, such as allowing goals to only correspond to subsets of the full representation vectors, dynamically learning the duration of the goals, and building hierarchical agents with three or more levels of temporal abstraction. We are optimistic that future algorithmic advances in HRL will unlock new levels of performance and autonomy of intelligent agents.

Read More

Enabling Creative Expression with Concept Activation Vectors

Advances in computer vision and natural language processing continue to unlock new ways of exploring billions of images available on public and searchable websites. Today’s visual search tools make it possible to search with your camera, voice, text, images, or multiple modalities at the same time. However, it remains difficult to input subjective concepts, such as visual tones or moods, into current systems. For this reason, we have been working collaboratively with artists, photographers, and image researchers to explore how machine learning (ML) might enable people to use expressive queries as a way of visually exploring datasets.

Today, we are introducing Mood Board Search, a new ML-powered research tool that uses mood boards as a query over image collections. This enables people to define and evoke visual concepts on their own terms. Mood Board Search can be useful for subjective queries, such as “peaceful”, or for words and individual images that may not be specific enough to produce useful results in a standard search, such as “abstract details in overlooked scenes” or “vibrant color palette that feels part memory, part dream. We developed, and will continue to develop, this research tool in alignment with our AI Principles.”

Search Using Mood Boards
With Mood Board Search, our goal is to design a flexible and approachable interface so people without ML expertise can train a computer to recognize a visual concept as they see it. The tool interface is inspired by mood boards, commonly used by people in creative fields to communicate the “feel” of an idea using collections of visual materials.

With Mood Board Search, users can train a computer to recognize visual concepts in image collections.

To get started, simply drag and drop a small number of images that represent the idea you want to convey. Mood Board Search returns the best results when the images share a consistent visual quality, so results are more likely to be relevant with mood boards that share visual similarities in color, pattern, texture, or composition.

It’s also possible to signal which images are more important to a visual concept by upweighting or downweighting images, or by adding images that are the opposite of the concept. Then, users can review and inspect search results to understand which part of an image best matches the visual concept. Focus mode does this by revealing a bounding box around part of the image, while AI crop cuts in directly, making it easier to draw attention to new compositions.

Supported interactions, like AI crop, allow users to see which part of an image best matches their visual concept.

Powered by Concept Activation Vectors (CAVs)
Mood Board Search takes advantage of pre-trained computer vision models, such as GoogLeNet and MobileNet, and a machine learning approach called Concept Activation Vectors (CAVs).

CAVs are a way for machines to represent images (what we understand) using numbers or directions in a neural net’s embedding space (which can be thought of as what machines understand). CAVs can be used as part of a technique, Testing with CAVs (TCAV), to quantify the degree to which a user-defined concept is important to a classification result; e.g., how sensitive a prediction of “zebra” is to the presence of stripes. This is a research approach we open-sourced in 2018, and the work has since been widely applied to medical applications and science to build ML applications that can provide better explanations for what machines see. You can learn more about embedding vectors in general in this Google AI blog post, and our approach to working with TCAVs in Been Kim’s Keynote at ICLR.

In Mood Board Search, we use CAVs to find a model’s sensitivity to a mood board created by the user. In other words, each mood board creates a CAV — a direction in embedding space — and the tool searches an image dataset, surfacing images that are the closest match to the CAV. However, the tool takes it one step further, by segmenting each image in the dataset in 15 different ways, to uncover as many relevant compositions as possible. This is the approach behind features like Focus mode and AI crop.

Three artists created visual concepts to share their way of seeing, shown here in an experimental app by design invention studio, Nord Projects.

Because embedding vectors can be learned and re-used across models, tools like Mood Board Search can help us express our perspective to other people. Early collaborations with creative communities have shown value in being able to create and share subjective experiences with others, resulting in feelings of being able to “break out of visually-similar echo chambers” or “see the world through another person’s eyes”. Even misalignment between model and human understanding of a concept frequently resulted in unexpected and inspiring connections for collaborators. Taken together, these findings point towards new ways of designing collaborative ML systems that embrace personal and collective subjectivity.

Conclusions and Future Work
Today, we’re open-sourcing the code to Mood Board Search, including three visual concepts made by our collaborators, and a Mood Board Search Python Library for people to tap the power of CAVs directly into their own websites and apps. While these tools are early-stage prototypes, we believe this capability can have a wide-range of applications from exploring unorganized image collections to externalizing ways of seeing into collaborative and shareable artifacts. Already, an experimental app by design invention studio Nord Projects, made using Mood Board Search, investigates the opportunities for running CAVs in camera, in real-time. In future work, we plan to use Mood Board Search to learn about new forms of human-machine collaboration and expand ML models and inputs — like text and audio — to allow even deeper subjective discoveries, regardless of medium.

If you’re interested in a demo of this work for your team or organization, email us at cav-experiments-support@google.com.

Acknowledgments
This blog presents research by (in alphabetical order): Kira Awadalla, Been Kim, Eva Kozanecka, Alison Lentz, Alice Moloney, Emily Reif, and Oliver Siy, in collaboration with design invention studio Nord Projects. We thank our co-author, Eva Kozanecka, our artist collaborators, Alexander Etchells, Tom Hatton, Rachel Maggart, the Imaging team at The British Library for their participation in beta previews, and Blaise Agüera y Arcas, Jess Holbrook, Fernanda Viegas, and Martin Wattenberg for their support of this research project.

Read More

An update on our work in responsible innovation

Over the last year, we’ve seen artificial intelligence (AI) systems advance our work in areas like inclusive product development and support for small businesses and job seekers. We’ve also seen its potential to be helpful in addressing major global needs — like forecasting and planning humanitarian responses to natural disasters, addressing global environmental challenges, and delivering groundbreaking scientific research.

AI is exciting — both from a technical perspective and when considering its underlying social benefits. And yet, to fully realize AI’s potential, it must be developed responsibly, thoughtfully and in a way that gives deep consideration to core ethical questions. After all, the promise of great reward inherently involves risk — and we’re committed to ethically developing AI in a way that is socially beneficial.

Our AI Principles guide how we integrate AI research into Google’s products and services and engage with external partners. Internally, we implement the Principles, every day, through education programs, AI ethics reviews and technical tools. There are more than 200 Googlers across the company whose full-time roles are to operationalize responsible practices for developing AI.

We’re committed to sharing our lessons learned so others across the industry can learn, too (see our posts from 2018, 2019, 2020 and 2021, and our in-depth annual AI Principles Progress Updates).

Internal education

It’s important to craft principles, but putting them into practice requires both training and constant dialogue.

Launched in late 2019, to date more than 32,000 employees across Google have engaged in AI Principles training. Given our growing understanding of effective hybrid and remote learning, we continue to expand and modify the courses. For example, this year we adapted our popular four-part Tech Ethics self-study course to a one-part deep dive based on Googler feedback. Similarly, we launched the Responsible Innovation Challenge — taken by more than 13,000 employees — as a series of engaging online puzzles, quizzes and games to raise awareness of the AI Principles and measure employees’ retention of ethical concepts, such as avoiding unfair bias.

We also piloted a new Moral Imagination workshop, a two-day, live-video immersive set of activities for product teams to walk through the ethical implications of potential AI products. To date, 248 Googlers across 23 Google product and research teams have taken the workshop, resulting in deeper, ongoing AI ethics consultations on product development.

As we develop internal training, we’re committed to incorporating the input of both Googlers and outside experts. This year, when we launched a live workshop to educate our internal user experience and product teams on the concept of AI explainability, we first piloted the workshop with outside experts at the international Trust, Transparency and Control Labs summit in May.

We believe this approach complements programs like our internal AI Principles Ethics Fellows program, a six-month fellowship that this year involved Googlers from 17 different global offices. We also just launched a version of the fellowship program tailored for senior leaders.

Putting the Principles into practice

Our approach to responsible AI innovation starts early, before teams plan a new AI application. When a team starts to build a machine learning (ML) model, dataset or product feature, they can attend office hours with experts to ask questions and engage in analyses using responsible AI tools that Google develops, or seek adversarial proactive fairness (ProFair) testing. Pre-launch, a team then can request an AI Principles review.

AI Principles reviewers are in place to implement a structured assessment to identify, measure and analyze potential risk of harm. The risk rating focuses on the extent to which people and society may be impacted if solutions did not exist or were to fail. Reviewers also consider a growing body of lessons from thousands of previous AI Principles reviews conducted since 2019.

When reviewers find medium- to high-risk issues, such as product exclusion or a potential privacy or security concern, they work with the teams to address these issues. Reviews either result in an approval, approval with conditions or recommendations, or non-approval. New AI applications that might affect multiple product areas are escalated to the Advanced Technology Review Council — a group of senior research, product and business leaders who make the final decision.

To supplement the expertise of our internal AI Principles group members, we often incorporate trusted external advisors. For example, a team was incorporating AI to help build a near real-time dataset to enable reliable measurement of global land cover for environmental and social benefit. They submitted for AI Principles review and then collaborated with the review team to design several safeguards. The review team also worked with third-party experts at the World Resources Institute and BSR. Following the example of the European Commission’s Copernicus mission’s open data and services terms, the product team applied open data principles, making the ML model’s training and test data used to create the dataset, as well as the dataset itself, freely available under CC-BY-4.0, and the model available on Github under an Apache 2.0 license. We recently released a Codelab for developers to walk through the ethics review process and apply learnings to their own projects.

A video explaining Google's AI Principles Review process

10:25

Projects such as research methods for evaluating misinformation and datasets that need more diverse representation tend to receive conditions to proceed toward a launch. A recurring condition given to teams is to engage in ProFair testing with people from a diversity of backgrounds, often in partnership with our central Product Inclusion and Equity team. This year, the number of ProFair consultations increased annually by 100%. A recurring approach is to create and release detailed documentation in the form ofdata cards and model cards for transparency and accountability. The number of AI Principles reviews with model or data card mitigations increased 68% in the last year.

As we’ve stated, we’ve embedded customized AI governance and review committees within certain product areas (like Cloud and Health). As a result, both the Health Ethics Committee and Cloud make decisions with specialized expertise, such as establishing policies for potentially winding down the Covid-19 Community Mobility Reports and the Covid-19 Forecaster, respectively, if situations arise that might cause the data quality to degrade. This year, we extended this specialized approach and created a dedicated consumer hardware AI Principles review process.

It’s important to note that product teams across Google engage in everyday responsible AI practices even if not in formal reviews. YouTube is leveraging a more targeted mix of classifiers, keywords in additional languages, and information from regional analysts. This work is a result of collaboration with our researchers who focus on new tools for AI fairness. The Photos team participated in an Equitable AI Research Roundtable (EARR) with a group of external advisors on potential fairness considerations. And the Gboard team deployed a new, privacy-by-design approach to federated machine learning. These examples did not stem from AI Principles reviews, but reflect the adoption of the AI Principles across Google.

Tools and research

In early 2022, to offer easier access to our publications on responsible AI, we curated an external collection of more than 200 research papers focused on the topic. We continue to launch, refine and consolidate technical resources, including proactive tools like:

  • The Monk Skin Tone Scale, developed by Harvard University Sociology Professor Dr. Ellis Monk. The scale offers a spectrum of skin tones from all around the world for use in evaluating and addressing fairness considerations in AI.
  • The Know Your Data tool (KYD), which helps developers with tasks such as quickly identifying issues in fairness, and which has integrated the Monk Scale to help developers examine skin tone data for unfair bias.
  • The Language Interpretability Tool, or LIT, to help developers probe an ML model, now with a new method to better understand, test and debug its behaviors.
  • Counterfactual Logit Pairing, which helps ensure that a model’s prediction doesn’t change when sensitive attributes or identity terms referenced in an example are removed or replaced, now added to the TensorFlow Model Remediation Library (see the research paper for more).
  • And to help teams measure their progress against the AI Principles, we’re piloting an internal tool to help teams assess how ML models were developed in accordance with emerging smart practices, previous reviews, and our growing body of ethics, fairness, and human-rights work.

Many responsible AI tools developed by researchers are actively in use by product teams at Google. For example, Photos, Pixel and Image Search are leveraging the Monk Skin Tone Scale.

External engagement

Ensuring the responsible development and deployment of AI is an ongoing process. We believe it should be a collaborative one, too, so we remain deeply engaged with governments across Europe, the Middle East and Africa, Latin America, Asia Pacific, and the U.S. to advocate for AI regulation that supports innovation around the world for businesses of all sizes. We share our approach to responsible AI and recommendations, comments and responses to open requests for information. We also initiated and are leading an effort with the International Standards Organization (ISO/IEC PWI TS 17866) to share best practice guidance for the development of AI.

As these efforts look toward the future, Responsible AI needs to be supported across industries today. So for current Google Cloud Partners and customers seeking best practices to help with the responsible implementation and AI governance in their organization, we added responsible AI prerequisites to the Google Cloud Partner Advantage ML Specialization, including a newly-released training, “Applying AI Principles with Google Cloud.”

To help nurture the next generation of responsible AI practitioners, we launched a free introduction to AI and machine learning for K-12 students. And we continue to develop an external Responsible Innovation Fellowship program in the U.S. for students at historically Black colleges and universities.

Our approach to responsible innovation also means keeping an eye on emerging markets where AI is being developed. We launched a new AI research center in Bulgaria and expanded support for African entrepreneurs whose businesses use AI through our Startup Accelerator Africa.

The examples we’re sharing today are a sampling of our ongoing commitment to responsible innovation. They also reflect our ability to change and keep setting a high bar for trustworthy AI standards for our company. We remain dedicated to sharing helpful information on Google’s journey, as recommended practices for responsible AI continue to emerge and evolve.

Read More

MLGO: A Machine Learning Framework for Compiler Optimization

The question of how to compile faster and smaller code arose together with the birth of modem computers. Better code optimization can significantly reduce the operational cost of large datacenter applications. The size of compiled code matters the most to mobile and embedded systems or software deployed on secure boot partitions, where the compiled binary must fit in tight code size budgets. With advances in the field, the headroom has been heavily squeezed with increasingly complicated heuristics, impeding maintenance and further improvements.

Recent research has shown that machine learning (ML) can unlock more opportunities in compiler optimization by replacing complicated heuristics with ML policies. However, adopting ML in general-purpose, industry-strength compilers remains a challenge.

To address this, we introduce “MLGO: a Machine Learning Guided Compiler Optimizations Framework”, the first industrial-grade general framework for integrating ML techniques systematically in LLVM (an open-source industrial compiler infrastructure that is ubiquitous for building mission-critical, high-performance software). MLGO uses reinforcement learning (RL) to train neural networks to make decisions that can replace heuristics in LLVM. We describe two MLGO optimizations for LLVM: 1) reducing code size with inlining; and 2) improving code performance with register allocation (regalloc). Both optimizations are available in the LLVM repository, and have been deployed in production.

How Does MLGO Work? With Inlining-for-Size As a Case Study
Inlining helps reduce code size by making decisions that enable the removal of redundant code. In the example below, the caller function foo() calls the callee function bar(), which itself calls baz(). Inlining both callsites returns a simple foo() function that reduces the code size.

Inlining reduces code size by removing redundant code.

In real code, there are thousands of functions calling each other, and thus comprise a call graph. During the inlining phase, the compiler traverses over the call graph on all caller-callee pairs, and makes decisions on whether to inline a caller-callee pair or not. It is a sequential decision process as previous inlining decisions will alter the call graph, affecting later decisions and the final result. In the example above, the call graph foo()bar()baz() needs a “yes” decision on both edges to make the code size reduction happen.

Before MLGO, the inline / no-inline decision was made by a heuristic that, over time, became increasingly difficult to improve. MLGO substitutes the heuristic with an ML model. During the call graph traversal, the compiler seeks advice from a neural network on whether to inline a particular caller-callee pair by feeding in relevant features (i.e., inputs) from the graph, and executes the decisions sequentially until the whole call graph is traversed.

Illustration of MLGO during inlining. “#bbs”, “#users”, and “callsite height” are example caller-callee pair features.

MLGO trains the decision network (policy) with RL using policy gradient and evolution strategies algorithms. While there is no ground truth about best decisions, online RL iterates between training and running compilation with the trained policy to collect data and improve the policy. In particular, given the current model under training, the compiler consults the model for inline / no-inline decision making during the inlining stage. After the compilation finishes, it produces a log of the sequential decision process (state, action, reward). The log is then passed to the trainer to update the model. This process repeats until we obtain a satisfactory model.

Compiler behavior during training. The compiler compiles the source code foo.cpp to an object file foo.o with a sequence of optimization passes, one of which is the inline pass.

The trained policy is then embedded into the compiler to provide inline / no-inline decisions during compilation. Unlike the training scenario, the policy does not produce a log. The TensorFlow model is embedded with XLA AOT, which converts the model into executable code. This avoids TensorFlow runtime dependency and overhead, minimizing the extra time and memory cost introduced by ML model inference at compilation time.

Compiler behavior in production.

We trained the inlining-for-size policy on a large internal software package containing 30k modules. The trained policy is generalizable when applied to compile other software and achieves a 3% ~ 7% size reduction. In addition to the generalizability across software, generalizability across time is also important — both the software and compiler are under active development so the trained policy needs to retain good performance for a reasonable time. We evaluated the model’s performance on the same set of software three months later and found only slight degradation.

Inlining-for-size policy size reduction percentages. The x-axis presents different software and the y-axis represents the percentage size reduction. “Training” is the software on which the model was trained and “Infra[1|2|3]” are different internal software packages.

The MLGO inlining-for-size training has been deployed on Fuchsia — a general purpose open source operating system designed to power a diverse ecosystem of hardware and software, where binary size is critical. Here, MLGO showed a 6.3% size reduction for C++ translation units.

Register-Allocation (for performance)
As a general framework, we used MLGO to improve the register allocation pass, which improves the code performance in LLVM. Register Allocation solves the problem of assigning physical registers to live ranges (i.e., variables).

As the code executes, different live ranges are completed at different times, freeing up registers for use by subsequent processing stages. In the example below, each “add” and “multiply” instruction requires all operands and the result to be in physical registers. The live range x is allocated to the green register and is completed before either live ranges in the blue or yellow registers. After x is completed, the green register becomes available and is assigned to live range t.

Register allocation example.

When it’s time to allocate live range q, there are no available registers, so the register allocation pass must decide which (if any) live range can be “evicted” from its register to make room for q. This is referred to as the “live range eviction” problem, and is the decision for which we train the model to replace original heuristics. In this particular example, it evicts z from the yellow register, and assigns it to q and the first half of z.

We now consider the unassigned second half of live range z. We have a conflict again, and this time the live range t is evicted and split, and the first half of t and the final part of z end up using the green register. The middle part of z corresponds to the instruction q = t * y, where z is not being used, so it is not assigned to any register and its value is stored in the stack from the yellow register, which later gets reloaded to the green register. The same happens to t. This adds extra load/store instructions to the code and degrades performance. The goal of the register allocation algorithm is to reduce such inefficiencies as much as possible. This is used as the reward to guide RL policy training.

Similar to the inlining-for-size policy, the register allocation (regalloc-for-performance) policy is trained on a large Google internal software package, and is generalizable across different software, with 0.3% ~1.5% improvements in queries per second (QPS) on a set of internal large-scale datacenter applications. The QPS improvement has persisted for months after its deployment, showing the model’s generalizability across the time horizon.

Conclusion and Future Work
We propose MLGO, a framework for integrating ML techniques systematically in an industrial compiler, LLVM. MLGO is a general framework that can be expanded to be: 1) deeper, e.g., adding more features, and applying better RL algorithms; and 2) broader, by applying it to more optimization heuristics beyond inlining and regalloc. We are enthusiastic about the possibilities MLGO can bring to the compiler optimization domain and look forward to its further adoption and to future contributions from the research community.

Try it Yourself
Check out the open-sourced end-to-end data collection and training solution on github and a demo that uses policy gradient to train an inlining-for-size policy.

Acknowledgements
We’d like to thank MLGO’s contributors and collaborators Eugene Brevdo, Jacob Hegna, Gaurav Jain, David Li, Zinan Lin, Kshiteej Mahajan, Jack Morris, Girish Mururu, Jin Xin Ng, Robert Ormandi, Easwaran Raman, Ondrej Sykora, Maruf Zaber, Weiye Zhao. We would also like to thank Petr Hosek, Yuqian Li, Roland McGrath, Haowei Wu for trusting us and deploying MLGO in Fuchsia as MLGO’s very first customer; thank David Blaikie, Eric Christopher, Brooks Moses, Jordan Rupprecht for helping to deploy MLGO in Google internal large-scale datacenter applications; and thank Ed Chi, Tipp Moseley for their leadership support.

Read More

Identifying Disfluencies in Natural Speech

People don’t write in the same way that they speak. Written language is controlled and deliberate, whereas transcripts of spontaneous speech (like interviews) are hard to read because speech is disorganized and less fluent. One aspect that makes speech transcripts particularly difficult to read is disfluency, which includes self-corrections, repetitions, and filled pauses (e.g., words like “umm”, and “you know”). Following is an example of a spoken sentence with disfluencies from the LDC CALLHOME corpus:

But that’s it’s not, it’s not, it’s, uh, it’s a word play on what you just said.

It takes some time to understand this sentence — the listener must filter out the extraneous words and resolve all of the nots. Removing the disfluencies makes the sentence much easier to read and understand:

But it’s a word play on what you just said.

While people generally don’t even notice disfluencies in day-to-day conversation, early foundational work in computational linguistics demonstrated how common they are. In 1994, using the Switchboard corpus, Elizabeh Shriberg demonstrated that there is a 50% probability for a sentence of 10–13 words to include a disfluency and that the probability increases with sentence length.

The proportion of sentences from the Switchboard dataset with at least one disfluency plotted against sentence length measured in non-disfluent (i.e., efficient) tokens in the sentence. The longer a sentence gets, the more likely it is to contain a disfluency.

In “Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection”, we present research findings on how to “clean up” transcripts of spoken text. We create more readable transcripts and captions of human speech by finding and removing disfluencies in people’s speech. Using labeled data, we created machine learning (ML) algorithms that identify disfluencies in human speech. Once those are identified we can remove the extra words to make transcripts more readable. This also improves the performance of natural language processing (NLP) algorithms that work on transcripts of human speech. Our work puts special priority on ensuring that these models are able to run on mobile devices so that we can protect user privacy and preserve performance in scenarios with low connectivity.

Base Model Overview
At the core of our base model is a pre-trained BERTBASE encoder with 108.9 million parameters. We use the standard per-token classifier configuration, with a binary classification head being fed by the sequence encodings for each token.

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.

<!–

Illustration of how tokens in text become numerical embeddings, which then lead to output labels.

–>

We refined the BERT encoder by continuing the pretraining on the comments from the Pushrift Reddit dataset from 2019. Reddit comments are not speech data, but are more informal and conversational than the wiki and book data. This trains the encoder to better understand informal language, but may run the risk of internalizing some of the biases inherent in the data. For our particular use case, however, the model only captures the syntax or overall form of the text, not its content, which avoids potential issues related to semantic-level biases in the data.

We fine-tune our model for disfluency classification on hand-labeled corpora, such as the Switchboard corpus mentioned above. Hyperparameters (batch size, learning rate, number of training epochs, etc.) were optimized using Vizier.

We also produce a range of “small” models for use on mobile devices using a knowledge distillation technique known as “self training”. Our best small model is based on the Small-vocab BERT variant with 3.1 million parameters. This smaller model achieves comparable results to our baseline at 1% the size (in MiB). You can read more about how we achieved this model miniaturization in our 2021 Interspeech paper.

Streaming
Some of the latest use cases for automatic speech transcription include automated live captioning, such as produced by the Android “Live Captions” feature, which automatically transcribes spoken language in audio being played on the device. For disfluency removal to be of use in improving the readability of the captions in this setting, then it must happen quickly and in a stable manner. That is, the model should not change its past predictions as it sees new words in the transcript.

We call this live token-by-token processing streaming. Accurate streaming is difficult because of temporal dependencies; most disfluencies are only recognizable later. For example, a repetition does not actually become a repetition until the second time the word or phrase is said.

To investigate whether our disfluency detection model is effective in streaming applications, we split the utterances in our training set into prefix segments, where only the first N tokens of the utterance were provided at training time, for all values of N up to the full length of the utterance. We evaluated the model simulating a stream of spoken text by feeding prefixes to the models and measuring the performance with several metrics that capture model accuracy, stability, and latency including streaming F1, time to detection (TTD), edit overhead (EO), and average wait time (AWT). We experimented with look-ahead windows of either one or two tokens, allowing the model to “peek” ahead at additional tokens for which the model is not required to produce a prediction. In essence, we’re asking the model to “wait” for one or two more tokens of evidence before making a decision.

While adding this fixed look-ahead did improve the stability and streaming F1 scores in many contexts, we found that in some cases the label was already clear even without looking ahead to the next token and the model did not necessarily benefit from waiting. Other times, waiting for just one extra token was sufficient. We hypothesized that the model itself could learn when it should wait for more context. Our solution was a modified model architecture that includes a “wait” classification head that decides when the model has seen enough evidence to trust the disfluency classification head.

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.

<!–

Diagram showing how the model labels input tokens as they arrive. The BERT embedding layers feed into two separate classification heads, which are combined for the output.

–>

We constructed a training loss function that is a weighted sum of three factors:

  1. The traditional cross-entropy loss for the disfluency classification head
  2. A cross-entropy term that only considers up to the first token with a “wait” classification
  3. A latency penalty that discourages the model from waiting too long to make a prediction

We evaluated this streaming model as well as the standard baseline with no look-ahead and with both 1- and 2-token look-ahead values:

Graph of the streaming F1 score versus the average wait time in tokens. Three data points indicate F1 scores above 0.82 across multiple wait times. The proposed streaming model achieves near top performance with much shorter wait times than the fixed look ahead models.

The streaming model achieved a better streaming F1 score than both a standard baseline with no look ahead and a model with a look ahead of 1. It performed nearly as well as the variant with fixed look ahead of 2, but with much less waiting. On average the model waited for only 0.21 tokens of context.

Internationalization
Our best outcomes so far have been with English transcripts. This is mostly due to resourcing issues: while there are a number of relatively large labeled conversational datasets that include disfluencies in English, other languages often have very few such datasets available. So, in order to make disfluency detection models available outside English a method is needed to build models in a way that does not require finding and labeling hundreds of thousands of utterances in each target language. A promising solution is to leverage multi-language versions of BERT to transfer what a model has learned about English disfluencies to other languages in order to achieve similar performance with much less data. This is an area of active research, but we do have some promising results to outline here.

As a first effort to validate this approach, we added labels to about 10,000 lines of dialogue from the German CALLHOME dataset. We then started with the Geotrend English and German Bilingual BERT model (extracted from Multilingual BERT) and fine-tuned it with approximately 77,000 disfluency-labeled English Switchboard examples and 1.3 million examples of self-labeled transcripts from the Fisher Corpus. Then, we did further fine tuning with about 7,500 in-house–labeled examples from the German CALLHOME dataset.

Diagram illustrating the flow of labeled data and self-trained output in our best multilingual training setup. By training on both English and German data we are able to improve performance via transfer learning.

Our results indicate that fine-tuning on a large English corpus can produce acceptable precision using zero-shot transfer to similar languages like German, but at least a modest amount of German labels were needed to improve recall from less than 60% to greater than 80%. Two-stage fine-tuning of an English-German bilingual model produced the highest precision and overall F1 score.

Approach Precision Recall F1
German BERTBASE model fine-tuned on 7,300 human-labeled German CALLHOME examples 89.1% 81.3% 85.0
Same as above but with additional 7,500 self-labeled German CALLHOME examples 91.5% 83.3% 87.2
English/German Bilingual BERTbase model fine-tuned on English Switchboard+Fisher, evaluated on German CALLHOME (zero-shot language transfer) 87.2% 59.1% 70.4
Same as above but subsequently fine-tuned with 14,800 German CALLHOME (human- and self-labeled) examples 95.5% 82.6% 88.6

Conclusion
Cleaning up disfluencies from transcripts can improve not just their readability for people, but also the performance of other models that consume transcripts. We demonstrate effective methods for identifying disfluencies and expand our disfluency model to resource-constrained environments, new languages, and more interactive use cases.

Acknowledgements
Thank you to Vicky Zayats, Johann Rocholl, Angelica Chen, Noah Murad, Dirk Padfield, and Preeti Mohan for writing the code, running the experiments, and composing the papers discussed here. Wealso thank our technical product manager Aaron Schneider, Bobby Tran from the Cerebra Data Ops team, and Chetan Gupta from Speech Data Ops for their support obtaining additional data labels.

Read More

Minerva: Solving Quantitative Reasoning Problems with Language Models

Language models have demonstrated remarkable performance on a variety of natural language tasks — indeed, a general lesson from many works, including BERT, GPT-3, Gopher, and PaLM, has been that neural networks trained on diverse data at large scale in an unsupervised way can perform well on a variety of tasks.

Quantitative reasoning is one area in which language models still fall far short of human-level performance. Solving mathematical and scientific questions requires a combination of skills, including correctly parsing a question with natural language and mathematical notation, recalling relevant formulas and constants, and generating step-by-step solutions involving numerical calculations and symbolic manipulation. Due to these challenges, it is often believed that solving quantitative reasoning problems using machine learning will require significant advancements in model architecture and training techniques, granting models access to external tools such as Python interpreters, or possibly a more profound paradigm shift.

In “Solving Quantitative Reasoning Problems With Language Models” (to be released soon on the arXiv), we present Minerva, a language model capable of solving mathematical and scientific questions using step-by-step reasoning. We show that by focusing on collecting training data that is relevant for quantitative reasoning problems, training models at scale, and employing best-in-class inference techniques, we achieve significant performance gains on a variety of difficult quantitative reasoning tasks. Minerva solves such problems by generating solutions that include numerical calculations and symbolic manipulation without relying on external tools such as a calculator. The model parses and answers mathematical questions using a mix of natural language and mathematical notation. Minerva combines several techniques, including few-shot prompting, chain of thought or scratchpad prompting, and majority voting, to achieve state-of-the-art performance on STEM reasoning tasks. You can explore Minerva’s output with our interactive sample explorer!

Solving a multi-step problem: A question from the MATH dataset and Minerva’s solution. The model writes down a line equation, simplifies it, substitutes a variable, and solves for y.

A Model Built for Multi-step Quantitative Reasoning
To promote quantitative reasoning, Minerva builds on the Pathways Language Model (PaLM), with further training on a 118GB dataset of scientific papers from the arXiv preprint server and web pages that contain mathematical expressions using LaTeX, MathJax, or other mathematical typesetting formats. Standard text cleaning procedures often remove symbols and formatting that are essential to the semantic meaning of mathematical expressions. By maintaining this information in the training data, the model learns to converse using standard mathematical notation.

Example questions from the Joint Entrance Examination Main Math 2020 exam taken each year by almost 2M Indian high-school students intended to study engineering and similar fields (left), and the National Math Exam in Poland (May 2022) taken by approximately 270K high-school students every year (right).
A dataset for quantitative reasoning: Careful data processing preserves mathematical information, allowing the model to learn mathematics at a higher level.

Minerva also incorporates recent prompting and evaluation techniques to better solve mathematical questions. These include chain of thought or scratchpad prompting — where Minerva is prompted with several step-by-step solutions to existing questions before being presented with a new question — and majority voting. Like most language models, Minerva assigns probabilities to different possible outputs. When answering a question, rather than taking the single solution Minerva scores as most likely, multiple solutions are generated by sampling stochastically from all possible outputs. These solutions are different (e.g., the steps are not identical), but often arrive at the same final answer. Minerva uses majority voting on these sampled solutions, taking the most common result as the conclusive final answer.

Majority voting: Minerva generates multiple solutions to each question and chooses the most common answer as the solution, improving performance significantly.

Evaluation on STEM Benchmarks
To test Minerva’s quantitative reasoning abilities we evaluated the model on STEM benchmarks ranging in difficulty from grade school level problems to graduate level coursework.

  • MATH: High school math competition level problems
  • MMLU-STEM: A subset of the Massive Multitask Language Understanding benchmark focused on STEM, covering topics such as engineering, chemistry, math, and physics at high school and college level.
  • GSM8k: Grade school level math problems involving basic arithmetic operations that should all be solvable by a talented middle school student.

We also evaluated Minerva on OCWCourses, a collection of college and graduate level problems covering a variety of STEM topics such as solid state chemistry, astronomy, differential equations, and special relativity that we collected from MIT OpenCourseWare.

In all cases, Minerva obtains state-of-the-art results, sometimes by a wide margin.

Evaluation results on MATH and MMLU-STEM, which include high school and college level questions covering a range of STEM topics.
Model   MATH     MMLU-STEM     OCWCourses     GSM8k  
Minerva 50.3% 75% 30.8% 78.5%
Published state of the art    6.9% 55% 74.4%
Minerva 540B significantly improves state-of-the-art performance on STEM evaluation datasets.

What Minerva Gets Wrong
Minerva still makes its fair share of mistakes. To better identify areas where the model can be improved, we analyzed a sample of questions the model gets wrong, and found that most mistakes are easily interpretable. About half are calculation mistakes, and the other half are reasoning errors, where the solution steps do not follow a logical chain of thought.

It is also possible for the model to arrive at a correct final answer but with faulty reasoning. We call such cases “false positives”, as they erroneously count toward a model’s overall performance score. In our analysis, we find that the rate of false positives is relatively low (Minerva 62B produces less than 8% false positives on MATH).

Below are a couple of example mistakes the model makes.

Calculation mistake: The model incorrectly cancels the square root on both sides of the equation.
Reasoning mistake: The model computes the number of free throws at the fourth practice, but then uses this number as the final answer for the first practice.

Limitations
Our approach to quantitative reasoning is not grounded in formal mathematics. Minerva parses questions and generates answers using a mix of natural language and LaTeX mathematical expressions, with no explicit underlying mathematical structure. This approach has an important limitation, in that the model’s answers cannot be automatically verified. Even when the final answer is known and can be verified, the model can arrive at a correct final answer using incorrect reasoning steps, which cannot be automatically detected. This limitation is not present in formal methods for theorem proving (e.g., see Coq, Isabelle, HOL, Lean, Metamath, and Mizar). On the other hand, an advantage of the informal approach is that it can be applied to a highly diverse set of problems which may not lend themselves to formalization.

Future Directions
While machine learning models have become impressive tools in many scientific disciplines, they are often narrowly scoped to solve specific tasks. We hope that general models capable of solving quantitative reasoning problems will help push the frontiers of science and education. Models capable of quantitative reasoning have many potential applications, including serving as useful aids for researchers, and enabling new learning opportunities for students. We present Minerva as a small step in this direction. To see more samples from Minerva, such as the one below, please visit the interactive sample explorer!

Solving a problem using calculus and trigonoometry: A question from the MATH dataset asking for the speed of a particle in circular motion. Minerva finds a correct step-by-step solution. In the process, Minerva computes a time derivative and applies a trigonometric identity.

Acknowledgements
Minerva was a collaborative effort that spanned multiple teams in Google Research. We would like to thank our coauthors Aitor Lewkowycz, Ambrose Slone, Anders Andreassen, Behnam Neyshabur, Cem Anil, David Dohan, Henryk Michalewski, Imanol Schlag, Theo Gutman-Solo, Vedant Misra, Vinay Ramasesh, and Yuhuai Wu, as well as our collaborators Erik Zelikman and Yasaman Razeghi. Minerva builds upon the work of many others at Google, and we would like to thank the PaLM team, the T5X team, the Flaxformer team, and the JAX team for their efforts. We thank Tom Small for designing the animation in this post. We would also like to especially thank Vedant Misra for developing the Minerva sample explorer.

Read More

Mahima Pushkarna is making data easier to understand

Five years ago, information designer Mahima Pushkarna joined Google to make data easier to understand. As a senior interaction designer on the People + AI Research (PAIR) team, she designed Data Cards to help everyone better understand the contexts of the data they are using. The Data Cards Playbook puts Google’s AI Principles into practice by providing opportunities for feedback, relevant explanations and appeal.

Recently, Mahima’s paper on Data Cards (co-written with Googlers Andrew Zaldivar and Oddur Kjartansson) was accepted to the ACM Conference on Fairness, Accountability and Transparency (ACM FAccT). Let’s catch up with her and find out more about what brought her to Google.

How did your background lead you to the work you’re doing now?

I’ve always been fascinated by conjuring up solutions to things. The kind of questions that I’ve found meaningful are those that are never truly solved, or never have one correct answer. (The kind of questions that exasperate us!) Those have been the problems I am always drawn towards.

Early in my career, I realized the power in visualizing data, but spreadsheets were intimidating. I wondered how design could make communicating complexity easier. So I found myself in grad school in Boston studying information design and data visualization. I focused on how people experience data and how our relationships to each other and our contexts are mediated.

I joined Google Brain as the first visual designer in a full-time capacity, though I had no background in artificial intelligence or machine learning — this was the deep end of the pool. This opened up the space to explore human-AI interaction, and make AI more accessible to a broader class of developers. At PAIR, my work focuses on making information experiences more meaningful for developers, researchers and others who build AI technologies.

What’s it like to have a unique background as a designer on a technical AI research team?

When you’re an engineer and immersed in building technology, it’s easy to assume everyone has a similar experience to your own — especially when you’re surrounded by peers who share your expertise. The actual user experience is very personal and varies drastically across users and contexts. That particular clarity is what designers bring to the table.

I’ve been able to engage my engineering and research colleagues with simple, people-centered questions right in the very beginning. How are people using an AI tool? What are they learning from it? Who else might be involved in the conversation? Do they have the proficiency we assume they have?

Pull quote: “Identifying what we don’t know about data is just as important as articulating what we do know.”

How did you begin designing Data Cards?

This project started when I was working on another visualization toolkit, Facets, to communicate the skews and imbalances within datasets to help machine learning practitioners make informed decisions. At the time, transparency was a moving target. Andrew, Tulsee Doshi and I started to proactively think about fairness in data, and saw a huge gap in the documentation of human decisions that dot a dataset’s lifecycle.

This “invisible” information shapes how we use data and the outcomes of models trained on them. For example, a model trained on a dataset that captures age in just two or three buckets will have very different outcomes compared to a dataset with ten buckets. The goal of Data Cards is to make both visible and invisible information about datasets available and simple to understand, so people from a variety of backgrounds can knowledgeably make decisions.

As we cover in our FAccT paper, Andrew and Oddur and I arrived at two insights. The first is that identifying what we don’t know about data is just as important as articulating what we do know. In capturing these nuances, it is possible to narrow those knowledge gaps before even collecting data. The second thing that surprised us was the sheer number of people involved in a dataset’s life cycle, and how fragile knowledge is. Context is easily lost in translation both between and within teams, across documents, emails, people and time.

Data Cards stand on the shoulders of giants, like Data Sheets (Gebru, et al.) and Model Cards (Mitchell et al.). We’ve been immensely lucky to have had the support of many original authors on these seminal papers that have paved our path to FAccT.

How do you hope the paper is used across the tech industry?

Imagine a world in which finding verifiable information about the motivations of a dataset’s creators or performance of a model is as easy as learning about the ethical beliefs of a celebrity or the rating of a movie. Our vision for Data Cards is that they become a cultural mainstay — invisible, but their absence would be missed by ML practitioners.

In this paper, we introduce frameworks that other teams can use in their work. Alongside that, we’ve open-sourced the Data Cards Playbook, so we’re trying to lower the barrier to access in every way possible.

Read More

Reducing gender-based harms in AI with Sunipa Dev

Natural language processing (NLP) is a form of artificial intelligence that teaches computer programs how to take in, interpret, and produce language from large data sets. For example, grammar checkers use NLP to come up with grammar suggestions that help people write grammatically correct phrases. But as Google’s AI Principles note, it’s sometimes necessary to have human intervention to identify risks of unfair bias.

Sunipa Dev is a research scientist at Google who focuses on Responsible AI. Some of her work focuses specifically on ways to evaluate unfair bias in NLP outcomes, reducing harms for people with queer and non-binary identities. Sunipa’s work was recently featured at a workshop at the ACM Fairness, Accountability, and Transparency (FAcct) conference in Seoul, Korea.

In our interview, she emphasizes that her work is achievable only through forging collaborative partnerships between researchers, engineers, and AI practitioners with everyday users and communities.

What inspired you to take on this career path?

While working on my PhD at the University of Utah, I explored research questions such as, “How do we evaluate NLP tech if they contain biases?” As language models evolved, our questions about potential harms did, too. During my postdoc work at UCLA, we ran a study to evaluate challenges in various language models by surveying respondents who identified as non-binary and had some experience with AI. With a focus on gender bias, our respondents helped us understand that experiences with language technologies cannot be understood in isolation. Rather, we must consider how these technologies intersect with systemic discrimination, erasure, and marginalization. For example, the harm of misgendering by a language technology can be compounded for trans, non-binary, and gender-diverse individuals who are already fighting against society to defend their identities. And when it’s in your personal space, like on your devices while emailing or texting, these small jabs can build up to larger psychological damage.

What is your current role at Google?

I am currently a Research Scientist at the Responsible AI – Human Centered Technology team. In my current role, I am working to build a better understanding of how to avoid unfair bias in AI language models across different cultures and geographies, aligned with Google’s AI Principles.

This is a challenge because language changes, and so do cultures and regional laws as we move from one place to another. This can all impact how people express themselves, what identities they choose and how they experience discrimination on a daily basis. Gender bias can manifest in entirely different ways in different parts of the world. In some of my ongoing work that focuses on a non-Western point of view, we are working with social scientists and NGOs in India while engaging with local communities. We are using the voices of many people who are living in a specific region and asking, “What are the biases prevalent in their society?”

What is gender bias in NLP?

Written text and training data for language technologies can lack representation or misrepresent different gender identities; this can reflect social biases. As a result, some NLP technologies can reinforce gender stereotypes and slurs, erase people’s gender identities, or have reduced quality of service for marginalized communities. What drives me in my work is my goal to make language technologies more inclusive and usable.

Why does this matter for AI?

Gender can be such an integral part of someone’s identity, and having that wrongly assumed by an AI system can be triggering, unfair, and harmful. We need to work towards systems and societies that do not encode unfair biases and harmful stereotypes in order to break out of the cycle of perpetuating harms of stereotyping, misgendering, and erasure.

How can people who are not researchers, engineers or AI practitioners engage in this work?

A very direct way is for people to report potential harms as bugs within products they use. People can also participate in open discussions in workshops, panels and town halls. These are all helpful ways to build inclusive AI.

I want to emphasize, however, that the onus can’t only be on the user. It’s also on the side of the researcher, engineer and AI practitioner. The goal is to create a continuous feedback loop between humans and machines, with real people stepping in to ensure the creation of more responsible AI. As AI practitioners, we need to work with the people we’re trying to serve and have users collaborate with us to tell us what we need to do better.

Read More

Quantum Advantage in Learning from Experiments

In efforts to learn about the quantum world, scientists face a big obstacle: their classical experience of the world. Whenever a quantum system is measured, the act of this measurement destroys the “quantumness” of the state. For example, if the quantum state is in a superposition of two locations, where it can seem to be in two places at the same time, once it is measured, it will randomly appear either ”here” or “there”, but not both. We only ever see the classical shadows cast by this strange quantum world.

A growing number of experiments are implementing machine learning (ML) algorithms to aid in analyzing data, but these have the same limitations as the people they aim to help: They can’t directly access and learn from quantum information. But what if there were a quantum machine learning algorithm that could directly interact with this quantum data?

In “Quantum Advantage in Learning from Experiments”, a collaboration with researchers at Caltech, Harvard, Berkeley, and Microsoft published in Science, we show that a quantum learning agent can perform exponentially better than a classical learning agent at many tasks. Using Google’s quantum computer, Sycamore, we demonstrate the tremendous advantage that a quantum machine learning (QML) algorithm has over the best possible classical algorithm. Unlike previous quantum advantage demonstrations, no advances in classical computing power could overcome this gap. This is the first demonstration of a provable exponential advantage in learning about quantum systems that is robust even on today’s noisy hardware.

Quantum Speedup
QML combines the best of both quantum computing and the lesser-known field of quantum sensing.

Quantum computers will likely offer exponential improvements over classical systems for certain problems, but to realize their potential, researchers first need to scale up the number of qubits and to improve quantum error correction. What’s more, the exponential speed-up over classical algorithms promised by quantum computers relies on a big, unproven assumption about so-called “complexity classes” of problems — namely, that the class of problems that can be solved on a quantum computer is larger than those that can be solved on a classical computer.. It seems like a reasonable assumption, and yet, no one has proven it. Until it’s proven, every claim of quantum advantage will come with an asterisk: that it can do better than any known classical algorithm.

Quantum sensors, on the other hand, are already being used for some high-precision measurements and offer modest (and proven) advantages over classical sensors. Some quantum sensors work by exploiting quantum correlations between particles to extract more information about a system than it otherwise could have. For example, scientists can use a collection of N atoms to measure aspects of the atoms’ environment like the surrounding magnetic fields. Typically the sensitivity to the field that the atoms can measure scales with the square root of N. But if one uses quantum entanglement to create a complex web of correlations between the atoms, then one can improve the scaling to be proportional to N. But as with most quantum sensing protocols, this quadratic speed-up over classical sensors is the best one can ever do.

Enter QML, a technology that straddles the line between quantum computers and quantum sensors. QML algorithms make computations that are aided by quantum data. Instead of measuring the quantum state, a quantum computer can store quantum data and implement a QML algorithm to process the data without collapsing it. And when this data is limited, a QML algorithm can squeeze exponentially more information out of each piece it receives when considering particular tasks.

Comparison of a classical machine learning algorithm and a quantum machine learning algorithm. The classical machine learning algorithm measures a quantum system, then performs classical computations on the classical data it acquires to learn about the system. The quantum machine learning algorithm, on the other hand, interacts with the quantum states produced by the system, giving it a quantum advantage over the CML.

To see how a QML algorithm works, it’s useful to contrast with a standard quantum experiment. If a scientist wants to learn about a quantum system, they might send in a quantum probe, such as an atom or other quantum object whose state is sensitive to the system of interest, let it interact with the system, then measure the probe. They can then design new experiments or make predictions based on the outcome of the measurements. Classical machine learning (CML) algorithms follow the same process using an ML model, but the operating principle is the same — it’s a classical device processing classical information.

A QML algorithm instead uses an artificial “quantum learner.” After the quantum learner sends in a probe to interact with the system, it can choose to store the quantum state rather than measure it. Herein lies the power of QML. It can collect multiple copies of these quantum probes, then entangle them to learn more about the system faster.

Suppose, for example, the system of interest produces a quantum superposition state probabilistically by sampling from some distribution of possible states. Each state is composed of n quantum bits, or qubits, where each is a superposition of “0” and “1” — all learners are allowed to know the generic form of the state, but must learn its details.

In a standard experiment, where only classical data is accessible, every measurement provides a snapshot of the distribution of quantum states, but since it’s only a sample, it is necessary to measure many copies of the state to reconstruct it. In fact, it will take on the order of 2n copies.

A QML agent is more clever. By saving a copy of the n-qubit state, then entangling it with the next copy that comes along, it can learn about the global quantum state more quickly, giving a better idea of what the state looks like sooner.

Basic schematic of the QML algorithm. Two copies of a quantum state are saved, then a “Bell measurement” is performed, where each pair is entangled and their correlations measured.

<!–

Basic schematic of the QML algorithm. Two copies of a quantum state are saved, then a “Bell measurement” is performed, where each pair is entangled and their correlations measured.

–>

The classical reconstruction is like trying to find an image hiding in a sea of noisy pixels — it could take a very long time to average-out all the noise to know what the image is representing. The quantum reconstruction, on the other hand, uses quantum mechanics to isolate the true image faster by looking for correlations between two different images at once.

Results
To better understand the power of QML, we first looked at three different learning tasks and theoretically proved that in each case, the quantum learning agent would do exponentially better than the classical learning agent. Each task was related to the example given above:

  1. Learning about incompatible observables of the quantum state — i.e., observables that cannot be simultaneously known to arbitrary precision due to the Heisenberg uncertainty principle, like position and momentum. But we showed that this limit can be overcome by entangling multiple copies of a state.
  2. Learning about the dominant components of the quantum state. When noise is present, it can disturb the quantum state. But typically the “principal component” — the part of the superposition with the highest probability — is robust to this noise, so we can still glean information about the original state by finding this dominant part.
  3. Learning about a physical process that acts on a quantum system or probe. Sometimes the state itself is not the object of interest, but a physical process that evolves this state is. We can learn about various fields and interactions by analyzing the evolution of a state over time.

In addition to the theoretical work, we ran some proof-of-principle experiments on the Sycamore quantum processor. We started by implementing a QML algorithm to perform the first task. We fed an unknown quantum mixed state to the algorithm, then asked which of two observables of the state was larger. After training the neural network with simulation data, we found that the quantum learning agent needed exponentially fewer experiments to reach a prediction accuracy of 70% — equating to 10,000 times fewer measurements when the system size was 20 qubits. The total number of qubits used was 40 since two copies were stored at once.

Experimental comparison of QML vs. CML algorithms for predicting a quantum state’s observables. While the number of experiments needed to achieve 70% accuracy with a CML algorithm (“C” above) grows exponentially with the size of the quantum state n, the number of experiments the QML algorithm (“Q”) needs is only linear in n. The dashed line labeled “Rigorous LB (C)” represents the theoretical lower bound (LB) — the best possible performance — of a classical machine learning algorithm.

<!–

Experimental comparison of QML vs. CML algorithms for predicting a quantum state’s observables. While the number of experiments needed to achieve 70% accuracy with a CML algorithm (“C” above) grows exponentially with the size of the quantum state n, the number of experiments the QML algorithm (“Q”) needs is only linear in n. The dashed line labeled “Rigorous LB (C)” represents the theoretical lower bound (LB) — the best possible performance — of a classical machine learning algorithm.

–>

In a second experiment, relating to the task 3 above, we had the algorithm learn about the symmetry of an operator that evolves the quantum state of their qubits. In particular, if a quantum state might undergo evolution that is either totally random or random but also time-reversal symmetric, it can be difficult for a classical learner to tell the difference. In this task, the QML algorithm can separate the operators into two distinct categories, representing two different symmetry classes, while the CML algorithm fails outright. The QML algorithm was completely unsupervised, so this gives us hope that the approach could be used to discover new phenomena without needing to know the right answer beforehand.

Experimental comparison of QML vs. CML algorithms for predicting the symmetry class of an operator. While QML successfully separates the two symmetry classes, the CML fails to accomplish the task.

Conclusion
This experimental work represents the first demonstrated exponential advantage in quantum machine learning. And, distinct from a computational advantage, when limiting the number of samples from the quantum state, this type of quantum learning advantage cannot be challenged, even by unlimited classical computing resources.

So far, the technique has only been used in a contrived, “proof-of-principle” experiment, where the quantum state is deliberately produced and the researchers pretend not to know what it is. To use these techniques to make quantum-enhanced measurements in a real experiment, we’ll first need to work on current quantum sensor technology and methods to faithfully transfer quantum states to a quantum computer. But the fact that today’s quantum computers can already process this information to squeeze out an exponential advantage in learning bodes well for the future of quantum machine learning.

Acknowledgements
We would like to thank our Quantum Science Communicator Katherine McCormick for writing this blog post. Images reprinted with permission from Huang et al., Science, Vol 376:1182 (2022).

Read More

Mapping Urban Trees Across North America with the Auto Arborist Dataset

Over four billion people live in cities around the globe, and while most people interact daily with others — at the grocery store, on public transit, at work — they may take for granted their frequent interactions with the diverse plants and animals that comprise fragile urban ecosystems. Trees in cities, called urban forests, provide critical benefits for public health and wellbeing and will prove integral to urban climate adaptation. They filter air and water, capture stormwater runoff, sequester atmospheric carbon dioxide, and limit erosion and drought. Shade from urban trees reduces energy-expensive cooling costs and mitigates urban heat islands. In the US alone, urban forests cover 127M acres and produce ecosystem services valued at $18 billion. But as the climate changes these ecosystems are increasingly under threat.

Census data is typically not comprehensive, covering a subset of public trees and not including those in parks.

Urban forest monitoring — measuring the size, health, and species distribution of trees in cities over time — allows researchers and policymakers to (1) quantify ecosystem services, including air quality improvement, carbon sequestration, and benefits to public health; (2) track damage from extreme weather events; and (3) target planting to improve robustness to climate change, disease and infestation.

However, many cities lack even basic data about the location and species of their trees. Collecting such data via a tree census is costly (a recent Los Angeles census cost $2 million and took 18 months) and thus is typically conducted only by cities with substantial resources. Further, lack of access to urban greenery is a key aspect of urban social inequality, including socioeconomic and racial inequality. Urban forest monitoring enables the quantification of this inequality and the pursuit of its improvement, a key aspect of the environmental justice movement. But machine learning could dramatically lower tree census costs using a combination of street-level and aerial imagery. Such an automated system could democratize access to urban forest monitoring, especially for under-resourced cities that are already disproportionately affected by climate change. While there have been prior efforts to develop automated urban tree species recognition from aerial or street-level imagery, a major limitation has been a lack of large-scale labeled datasets.

Today we introduce the Auto Arborist Dataset, a multiview urban tree classification dataset that, at ~2.6 million trees and >320 genera, is two orders of magnitude larger than those in prior work. To build the dataset, we pulled from public tree censuses from 23 North American cities (shown above) and merged these records with Street View and overhead RGB imagery. As the first urban forest dataset to cover multiple cities, we analyze in detail how forest models can generalize with respect to geographic distribution shifts, crucial to building systems that scale. We are releasing all 2.6M tree records publicly, along with aerial and ground-level imagery for 1M trees.

The 23 cities in the dataset are spread across North America, and are categorized into West, Central, and East regions to enable analysis of spatial and hierarchical generalization.
The number of tree records and genera in the dataset, per city and per region. The holdout city (which is never seen during training in any capacity) for each region is in bold.

The Auto Arborist Dataset
To curate Auto Arborist, we started from existing tree censuses which are provided by many cities online. For each tree census considered, we verified that the data contained GPS locations and genus/species labels, and was available for public use. We then parsed these data into a common format, fixing common data entry errors (such as flipped latitude/longitude) and mapping ground-truth genus names (and their common misspellings or alternate names) to a unified taxonomy. We have chosen to focus on genus prediction (instead of species-level prediction) as our primary task to avoid taxonomic complexity arising from hybrid and subspecies and the fact that there is more universal consensus on genus names than species names.

Next, using the provided geolocation for each tree, we queried an RGB aerial image centered on the tree and all street-level images taken within 2-10 meters around it. Finally, we filtered these images to (1) maximize our chances that the tree of interest is visible from each image and (2) preserve user privacy. This latter concern involved a number of steps including the removal of images that included people as determined by semantic segmentation and manual blurring, among others.

Selected Street View imagery from the Auto Arborist dataset. Green boxes represent tree detections (using a model trained on Open Images) and blue dots represent projected GPS location of the labeled tree.

One of the most important challenges for urban forest monitoring is to do well in cities that were not part of the training set. Vision models must contend with distribution shifts, where the training distribution differs from the test distribution from a new city. Genus distributions vary geographically (e.g., there are more Douglas fir in western Canada than in California) and can also vary based on city size (LA is much larger than Santa Monica and contains many more genera). Another challenge is the long-tailed, fine-grained nature of tree genera, which can be difficult to disambiguate even for human experts, with many genera being quite rare.

The long-tailed distribution across Auto Arborist categories. Most examples come from a few frequent categories, and many categories have far fewer examples. We characterize each genus as frequent, common, or rare based on the number of training examples. Note that the test data is split spatially from the training data within each city, so not all rare genera are seen in the test set.

Finally, there are a number of ways in which tree images can have noise. For one, there is temporal variation in deciduous trees (for example, when aerial imagery includes leaves, but street-level images are bare). Moreover, public arboreal censuses are not always up-to-date. Thus, sometimes trees have died (and are no longer visible) in the time since the tree census was taken. In addition, aerial data quality can be poor (missing or obscured, e.g., by clouds).

Our curation process sought to minimize these issues by (1) only keeping images with sufficient tree pixels, as determined by a semantic segmentation model, (2) only keeping reasonably recent images, and (3) only keeping images where the tree position was sufficiently close to the street level camera. We considered also optimizing for trees seen in spring and summer, but decided seasonal variation could be a useful cue — we thus also released the date of each image to enable the community to explore the effects of seasonal variability.

Benchmark and Evaluation
To evaluate the dataset, we designed a benchmark to measure domain generalization and performance in the long tail of the distribution. We generated training and test splits at three levels. First, we split within each city (based on latitude or longitude) to see how well a city generalizes to itself. Second, we aggregate city-level training sets into three regions, West, Central, and East, holding out one city from each region. Finally, we merge the training sets across the three regions. For each of these splits, we report both accuracy and class-averaged recall for frequent, common and rare species on the corresponding held-out test sets.

Using these metrics, we establish a performance baseline using standard modern convolutional models (ResNet). Our results demonstrate the benefits of a large-scale, geospatially distributed dataset such as Auto Arborist. First, we see that more training data helps — training on the entire dataset is better than training on a region, which is better than training on a single city.

The performance on each city’s test set when training on itself, on the region, and on the full training set.

Second, training on similar cities helps (and thus, having more coverage of cities helps). For example, if focusing on Seattle, then it is better to train on trees in Vancouver than Pittsburgh.

Cross-set performance, looking at the pairwise combination of train and test sets for each city. Note the block-diagonal structure, which highlights regional structure in the dataset.

Third, more data modalities and views help. The best performing models combine inputs from multiple Street View angles and overhead views. There remains much room for improvement, however, and this is where we believe the larger community of researchers can help.

Get Involved
By releasing the Auto Arborist Dataset, we step closer to the goal of affordable urban forest monitoring, enabling the computer vision community to tackle urban forest monitoring at scale for the first time. In the future, we hope to expand coverage to more North American cities (particularly in the South of the US and Mexico) and even worldwide. Further, we are excited to push the dataset to the more fine-grained species level and investigate more nuanced monitoring, including monitoring tree health and growth over time, and studying the effects of environmental factors on urban forests.

For more details, see our CVPR 2022 paper. This dataset is part of Google’s broader efforts to empower cities with data about urban forests, through the Environmental Insights Explorer Tree Canopy Lab and is available on our GitHub repo. If you represent a city that is interested in being included in the dataset please email auto-arborist+managers@googlegroups.com.

Acknowledgements
We would like to thank our co-authors Guanhang Wu, Trevor Edwards, Filip Pavetic, Bo Majewski, Shreyasee Mukherjee, Stanley Chan, John Morgan, Vivek Rathod, and Chris Bauer. We also thank Ruth Alcantara, Tanya Birch, and Dan Morris from Google AI for Nature and Society, John Quintero, Stafford Marquardt, Xiaoqi Yin, Puneet Lall, and Matt Manolides from Google Geo, Karan Gill, Tom Duerig, Abhijit Kundu, David Ross, Vighnesh Birodkar from Google Research (Perception team), and Pietro Perona for their support. This work was supported in part by the Resnick Sustainability Institute and was undertaken while Sara Beery was a Student Researcher at Google.

Read More