Improving simulations of clouds and their effects on climate

Improving simulations of clouds and their effects on climate

Today’s climate models successfully capture broad global warming trends. However, because of uncertainties about processes that are small in scale yet globally important, such as clouds and ocean turbulence, these models’ predictions of upcoming climate changes are not very accurate in detail. For example, predictions of the time by which the global mean surface temperature of Earth will have warmed 2℃, relative to preindustrial times, vary by 40–50 years (a full human generation) among today’s models. As a result, we do not have the accurate and geographically granular predictions we need to plan resilient infrastructure, adapt supply chains to climate disruption, and assess the risks of climate-related hazards to vulnerable communities.

In large part this is because clouds dominate errors and uncertainties in climate predictions for the coming decades [1, 2, 3]. Clouds reflect sunlight and exert a greenhouse effect, making them crucial for regulating Earth’s energy balance and mediating the response of the climate system to changes in greenhouse gas concentrations. However, they are too small in scale to be directly resolvable in today’s climate models. Current climate models resolve motions at scales of tens to a hundred kilometers, with a few pushing toward the kilometer-scale. However, the turbulent air motions that sustain, for example, the low clouds that cover large swaths of tropical oceans have scales of meters to tens of meters. Because of this wide difference in scale, climate models use empirical parameterizations of clouds, rather than simulating them directly, which result in large errors and uncertainties.

While clouds cannot be directly resolved in global climate models, their turbulent dynamics can be simulated in limited areas by using high-resolution large eddy simulations (LES). However, the high computational cost of simulating clouds with LES has inhibited broad and systematic numerical experimentation, and it has held back the generation of large datasets for training parameterization schemes to represent clouds in coarser-resolution global climate models.

In “Accelerating Large-Eddy Simulations of Clouds with Tensor Processing Units”, published in Journal of Advances in Modeling Earth Systems (JAMES), and in collaboration with a Climate Modeling Alliance (CliMA) lead who is a visiting researcher at Google, we demonstrate that Tensor Processing Units (TPUs) — application-specific integrated circuits that were originally developed for machine learning (ML) applications — can be effectively used to perform LES of clouds. We show that TPUs, in conjunction with tailored software implementations, can be used to simulate particularly computationally challenging marine stratocumulus clouds in the conditions observed during the Dynamics and Chemistry of Marine Stratocumulus (DYCOMS) field study. This successful TPU-based LES code reveals the utility of TPUs, with their large computational resources and tight interconnects, for cloud simulations.

Climate model accuracy for critical metrics, like precipitation or the energy balance at the top of the atmosphere, has improved roughly 10% per decade in the last 20 years. Our goal is for this research to enable a 50% reduction in climate model errors by improving their representation of clouds.

Large-eddy simulations on TPUs

In this work, we focus on stratocumulus clouds, which cover ~20% of the tropical oceans and are the most prevalent cloud type on earth. Current climate models are not yet able to reproduce stratocumulus cloud behavior correctly, which has been one of the largest sources of errors in these models. Our work will provide a much more accurate ground truth for large-scale climate models.

Our simulations of clouds on TPUs exhibit unprecedented computational throughput and scaling, making it possible, for example, to simulate stratocumulus clouds with 10× speedup over real-time evolution across areas up to about 35 × 54 km2. Such domain sizes are close to the cross-sectional area of typical global climate model grid boxes. Our results open up new avenues for computational experiments, and for substantially enlarging the sample of LES available to train parameterizations of clouds for global climate models.

Rendering of the cloud evolution from a simulation of a 285 x 285 x 2 km3 stratocumulus cloud sheet. This is the largest cloud sheet of its kind ever simulated. Left: An oblique view of the cloud field with the camera cruising. Right: Top view of the cloud field with the camera gradually pulled away.

The LES code is written in TensorFlow, an open-source software platform developed by Google for ML applications. The code takes advantage of TensorFlow’s graph computation and Accelerated Linear Algebra (XLA) optimizations, which enable the full exploitation of TPU hardware, including the high-speed, low-latency inter-chip interconnects (ICI) that helped us achieve this unprecedented performance. At the same time, the TensorFlow code makes it easy to incorporate ML components directly within the physics-based fluid solver.

We validated the code by simulating canonical test cases for atmospheric flow solvers, such as a buoyant bubble that rises in neutral stratification, and a negatively buoyant bubble that sinks and impinges on the surface. These test cases show that the TPU-based code faithfully simulates the flows, with increasingly fine turbulent details emerging as the resolution increases. The validation tests culminate in simulations of the conditions during the DYCOMS field campaign. The TPU-based code reliably reproduces the cloud fields and turbulence characteristics observed by aircraft during a field campaign — a feat that is notoriously difficult to achieve for LES because of the rapid changes in temperature and other thermodynamic properties at the top of the stratocumulus decks.

One of the test cases used to validate our TPU Cloud simulator. The fine structures from the density current generated by the negatively buoyant bubble impinging on the surface are much better resolved with a high resolution grid (10m, bottom row) compared to a low resolution grid (200 m, top row).

Outlook

With this foundation established, our next goal is to substantially enlarge existing databases of high-resolution cloud simulations that researchers building climate models can use to develop better cloud parameterizations — whether these are for physics-based models, ML models, or hybrids of the two. This requires additional physical processes beyond that described in the paper; for example, the need to integrate radiative transfer processes into the code. Our goal is to generate data across a variety of cloud types, e.g., thunderstorm clouds.

Rendering of a thunderstorm simulation using the same simulator as the stratocumulus simulation work. Rainfall can also be observed near the ground.

This work illustrates how advances in hardware for ML can be surprisingly effective when repurposed in other research areas — in this case, climate modeling. These simulations provide detailed training data for processes such as in-cloud turbulence, which are not directly observable, yet are crucially important for climate modeling and prediction.

Acknowledgements

We would like to thank the co-authors of the paper: Sheide Chammas, Qing Wang, Matthias Ihme, and John Anderson. We’d also like to thank Carla Bromberg, Rob Carver, Fei Sha, and Tyler Russell for their insights and contributions to the work.

Read More

Open sourcing Project Guideline: A platform for computer vision accessibility technology

Open sourcing Project Guideline: A platform for computer vision accessibility technology

Two years ago we announced Project Guideline, a collaboration between Google Research and Guiding Eyes for the Blind that enabled people with visual impairments (e.g., blindness and low-vision) to walk, jog, and run independently. Using only a Google Pixel phone and headphones, Project Guideline leverages on-device machine learning (ML) to navigate users along outdoor paths marked with a painted line. The technology has been tested all over the world and even demonstrated during the opening ceremony at the Tokyo 2020 Paralympic Games.

Since the original announcement, we set out to improve Project Guideline by embedding new features, such as obstacle detection and advanced path planning, to safely and reliably navigate users through more complex scenarios (such as sharp turns and nearby pedestrians). The early version featured a simple frame-by-frame image segmentation that detected the position of the path line relative to the image frame. This was sufficient for orienting the user to the line, but provided limited information about the surrounding environment. Improving the navigation signals, such as alerts for obstacles and upcoming turns, required a much better understanding and mapping of the users’ environment. To solve these challenges, we built a platform that can be utilized for a variety of spatially-aware applications in the accessibility space and beyond.

Today, we announce the open source release of Project Guideline, making it available for anyone to use to improve upon and build new accessibility experiences. The release includes source code for the core platform, an Android application, pre-trained ML models, and a 3D simulation framework.

System design

The primary use-case is an Android application, however we wanted to be able to run, test, and debug the core logic in a variety of environments in a reproducible way. This led us to design and build the system using C++ for close integration with MediaPipe and other core libraries, while still being able to integrate with Android using the Android NDK.

Under the hood, Project Guideline uses ARCore to estimate the position and orientation of the user as they navigate the course. A segmentation model, built on the DeepLabV3+ framework, processes each camera frame to generate a binary mask of the guideline (see the previous blog post for more details). Points on the segmented guideline are then projected from image-space coordinates onto a world-space ground plane using the camera pose and lens parameters (intrinsics) provided by ARCore. Since each frame contributes a different view of the line, the world-space points are aggregated over multiple frames to build a virtual mapping of the real-world guideline. The system performs piecewise curve approximation of the guideline world-space coordinates to build a spatio-temporally consistent trajectory. This allows refinement of the estimated line as the user progresses along the path.

Project Guideline builds a 2D map of the guideline, aggregating detected points in each frame (red) to build a stateful representation (blue) as the runner progresses along the path.

A control system dynamically selects a target point on the line some distance ahead based on the user’s current position, velocity, and direction. An audio feedback signal is then given to the user to adjust their heading to coincide with the upcoming line segment. By using the runner’s velocity vector instead of camera orientation to compute the navigation signal, we eliminate noise caused by irregular camera movements common during running. We can even navigate the user back to the line while it’s out of camera view, for example if the user overshot a turn. This is possible because ARCore continues to track the pose of the camera, which can be compared to the stateful line map inferred from previous camera images.

Project Guideline also includes obstacle detection and avoidance features. An ML model is used to estimate depth from single images. To train this monocular depth model, we used SANPO, a large dataset of outdoor imagery from urban, park, and suburban environments that was curated in-house. The model is capable of detecting the depth of various obstacles, including people, vehicles, posts, and more. The depth maps are converted into 3D point clouds, similar to the line segmentation process, and used to detect the presence of obstacles along the user’s path and then alert the user through an audio signal.

Using a monocular depth ML model, Project Guideline constructs a 3D point cloud of the environment to detect and alert the user of potential obstacles along the path.

A low-latency audio system based on the AAudio API was implemented to provide the navigational sounds and cues to the user. Several sound packs are available in Project Guideline, including a spatial sound implementation using the Resonance Audio API. The sound packs were developed by a team of sound researchers and engineers at Google who designed and tested many different sound models. The sounds use a combination of panning, pitch, and spatialization to guide the user along the line. For example, a user veering to the right may hear a beeping sound in the left ear to indicate the line is to the left, with increasing frequency for a larger course correction. If the user veers further, a high-pitched warning sound may be heard to indicate the edge of the path is approaching. In addition, a clear “stop” audio cue is always available in the event the user veers too far from the line, an anomaly is detected, or the system fails to provide a navigational signal.

Project Guideline has been built specifically for Google Pixel phones with the Google Tensor chip. The Google Tensor chip enables the optimized ML models to run on-device with higher performance and lower power consumption. This is critical for providing real-time navigation instructions to the user with minimal delay. On a Pixel 8 there is a 28x latency improvement when running the depth model on the Tensor Processing Unit (TPU) instead of CPU, and 9x improvement compared to GPU.

Testing and simulation

Project Guideline includes a simulator that enables rapid testing and prototyping of the system in a virtual environment. Everything from the ML models to the audio feedback system runs natively within the simulator, giving the full Project Guideline experience without needing all the hardware and physical environment set up.

Screenshot of Project Guideline simulator.

Future direction

To launch the technology forward, WearWorks has become an early adopter and teamed up with Project Guideline to integrate their patented haptic navigation experience, utilizing haptic feedback in addition to sound to guide runners. WearWorks has been developing haptics for over 8 years, and previously empowered the first blind marathon runner to complete the NYC Marathon without sighted assistance. We hope that integrations like these will lead to new innovations and make the world a more accessible place.

The Project Guideline team is also working towards removing the painted line completely, using the latest advancements in mobile ML technology, such as the ARCore Scene Semantics API, which can identify sidewalks, buildings, and other objects in outdoor scenes. We invite the accessibility community to build upon and improve this technology while exploring new use cases in other fields.

Acknowledgements

Many people were involved in the development of Project Guideline and the technologies behind it. We’d like to thank Project Guideline team members: Dror Avalon, Phil Bayer, Ryan Burke, Lori Dooley, Song Chun Fan, Matt Hall, Amélie Jean-aimée, Dave Hawkey, Amit Pitaru, Alvin Shi, Mikhail Sirotenko, Sagar Waghmare, John Watkinson, Kimberly Wilber, Matthew Willson, Xuan Yang, Mark Zarich, Steven Clark, Jim Coursey, Josh Ellis, Tom Hoddes, Dick Lyon, Chris Mitchell, Satoru Arao, Yoojin Chung, Joe Fry, Kazuto Furuichi, Ikumi Kobayashi, Kathy Maruyama, Minh Nguyen, Alto Okamura, Yosuke Suzuki, and Bryan Tanaka. Thanks to ARCore contributors: Ryan DuToit, Abhishek Kar, and Eric Turner. Thanks to Alec Go, Jing Li, Liviu Panait, Stefano Pellegrini, Abdullah Rashwan, Lu Wang, Qifei Wang, and Fan Yang for providing ML platform support. We’d also like to thank Hartwig Adam, Tomas Izo, Rahul Sukthankar, Blaise Aguera y Arcas, and Huisheng Wang for their leadership support. Special thanks to our partners Guiding Eyes for the Blind and Achilles International.

Read More

Emerging practices for Society-Centered AI

Emerging practices for Society-Centered AI

The first of Google’s AI Principles is to “Be socially beneficial.” As AI practitioners, we’re inspired by the transformative potential of AI technologies to benefit society and our shared environment at a scale and swiftness that wasn’t possible before. From helping address the climate crisis to helping transform healthcare, to making the digital world more accessible, our goal is to apply AI responsibly to be helpful to more people around the globe. Achieving global scale requires researchers and communities to think ahead — and act — collectively across the AI ecosystem.

We call this approach Society-Centered AI. It is both an extension and an expansion of Human-Centered AI, focusing on the aggregate needs of society that are still informed by the needs of individual users, specifically within the context of the larger, shared human experience. Recent AI advances offer unprecedented, societal-level capabilities, and we can now methodically address those needs — if we apply collective, multi-disciplinary AI research to society-level, shared challenges, from forecasting hunger to predicting diseases to improving productivity.

The opportunity for AI to benefit society increases each day. We took a look at our work in these areas and at the research projects we have supported. Recently, Google announced that 70 professors were selected for the 2023 Award for Inclusion Research Program, which supports academic research that addresses the needs of historically marginalized groups globally. Through evaluation of this work, we identified a few emerging practices for Society-Centered AI:

  • Understand society’s needs
    Listening to communities and partners is crucial to understanding major issues deeply and identifying priority challenges to address. As an emerging general purpose technology, AI has the potential to address major global societal issues that can significantly impact people’s lives (e.g., educating workers, improving healthcare, and improving productivity). We have found the key to impact is to be centered on society’s needs. For this, we focus our efforts on goals society has agreed should be prioritized, such as the United Nations’ 17 Sustainable Development Goals, a set of interconnected goals jointly developed by more than 190 countries to address global challenges.
  • Collective efforts to address those needs
    Collective efforts bring stakeholders (e.g., local and academic communities, NGOs, private-public collaborations) into a joint process of design, development, implementation, and evaluation of AI technologies as they are being developed and deployed to address societal needs.
  • Measuring success by how well the effort addresses society’s needs
    It is important and challenging to measure how well AI solutions address society’s needs. In each of our cases, we identified primary and secondary indicators of impact that we optimized through our collaborations with stakeholders.

Why is Society-Centered AI important?

The case examples described below show how the Society-Centered AI approach has led to impact across topics, such as accessibility, health, and climate.

Understanding the needs of individuals with non-standard speech

There are millions of people with non-standard speech (e.g., impaired articulation, dysarthria, dysphonia) in the United States alone. In 2019, Google Research launched Project Euphonia, a methodology that allows individual users with non-standard speech to train personalized speech recognition models. Our success began with the impact we had on each individual who is now able to use voice dictation on their mobile device.

Euphonia started with a Society-Centered AI approach, including collective efforts with the non-profit organizations ALS Therapy Development Institute and ALS Residence Initiative to understand the needs of individuals with amyotrophic lateral sclerosis (ALS) and their ability to use automatic speech recognition systems. Later, we developed the world’s largest corpus of non-standard speech recordings, which enabled us to train a Universal Speech Model to better recognize disordered speech by 37% on real conversation word error rate (WER) measurement. This also led to the 2022 collaboration between the University of Illinois Urbana-Champaign, Alphabet, Apple, Meta, Microsoft, and Amazon to begin the Speech Accessibility Project, an ongoing initiative to create a publicly available dataset of disordered speech samples to improve products and make speech recognition more inclusive of diverse speech patterns. Other technologies that use AI to help remove barriers of modality and languages, include live transcribe, live caption and read aloud.

Focusing on society’s health needs

Access to timely maternal health information can save lives globally: every two minutes a woman dies during pregnancy or childbirth and 1 in 26 children die before reaching age five. In rural India, the education of expectant and new mothers around key health issues pertaining to pregnancy and infancy required scalable, low-cost technology solutions. Together with ARMMAN, Google Research supported a program that uses mobile messaging and machine learning (ML) algorithms to predict when women might benefit from receiving interventions (i.e., targeted preventative care information) and encourages them to engage with the mMitra free voice call program. Within a year, the mMitra program has shown a 17% increase in infants with tripled birth weight and a 36% increase in women understanding the importance of taking iron tablets during pregnancy. Over 175K mothers and growing have been reached through this automated solution, which public health workers use to improve the quality of information delivery.

These efforts have been successful in improving health due to the close collective partnership among the community and those building the AI technology. We have adopted this same approach via collaborations with caregivers to address a variety of medical needs. Some examples include: the use of the Automated Retinal Disease Assessment (ARDA) to help screen for diabetic retinopathy in 250,000 patients in clinics around the world; our partnership with iCAD to bring our mammography AI models to clinical settings to aid in breast cancer detection; and the development of Med-PaLM 2, a medical large language model that is now being tested with Cloud partners to help doctors provide better patient care.

Compounding impact from sustained efforts for crisis response

Google Research’s flood prediction efforts began in 2018 with flood forecasting in India and expanded to Bangladesh to help combat the catastrophic damage from yearly floods. The initial efforts began with partnerships with India’s Central Water Commission, local governments and communities. The implementation of these efforts used SOS Alerts on Search and Maps, and, more recently, broadly expanded access via Flood Hub. Continued collaborations and advancing an AI-based global flood forecasting model allowed us to expand this capability to over 80 countries across Africa, the Asia-Pacific region, Europe, and South, Central, and North America. We also partnered with networks of community volunteers to further amplify flood alerts. By working with governments and communities to measure the impact of these efforts on society, we refined our approach and algorithms each year.

We were able to leverage those methodologies and some of the underlying technology, such as SOS Alerts, from flood forecasting to similar societal needs, such as wildfire forecasting and heat alerts. Our continued engagements with organizations led to the support of additional efforts, such as the World Meteorological Organization’s (WMO) Early Warnings For All Initiative. The continued engagement with communities has allowed us to learn about our users’ needs on a societal level over time, expand our efforts, and compound the societal reach and impact of our efforts.

Further supporting Society-Centered AI research

We recently funded 18 university research proposals exemplifying a Society-Centered AI approach, a new track within the Google Award for Inclusion Research Program. These researchers are taking the Society-Centered AI methodology and helping create beneficial applications across the world. Examples of some of the projects funded include:

  • AI-Driven Monitoring of Attitude Polarization in Conflict-Affected Countries for Inclusive Peace Process and Women’s Empowerment: This project’s goal is to create LLM-powered tools that can be used to monitor peace in online conversations in developing nations. The initial target communities are where peace is in flux and the effort will put a particular emphasis on mitigating polarization that impacts women and promoting harmony.
  • AI-Assisted Distributed Collaborative Indoor Pollution Meters: A Case Study, Requirement Analysis, and Low-Cost Healthy Home Solution for Indian Communities: This project is looking at the usage of low-cost pollution monitors combined with AI-assisted methodology for identifying recommendations for communities to improve air quality and at home health. The initial target communities are highly impacted by pollution, and the joint work with them includes the goal of developing how to measure improvement in outcomes in the local community.
  • Collaborative Development of AI Solutions for Scaling Up Adolescent Access to Sexual and Reproductive Health Education and Services in Uganda: This project’s goal is to create LLM-powered tools to provide personalized coaching and learning for users’ needs on topics of sexual and reproductive health education in low-income settings in Sub-Saharan Africa. The local societal need is significant, with an estimated 25% rate of teenage pregnancy, and the project aims to address the needs with a collective development process for the AI solution.

Future direction

Focusing on society’s needs, working via multidisciplinary collective research, and measuring the impact on society helps lead to AI solutions that are relevant, long-lasting, empowering, and beneficial. See the AI for the Global Goals to learn more about potential Society-Centered AI research problems. Our efforts with non-profits in these areas is complementary to the research that we are doing and encouraging. We believe that further initiatives using Society-Centered AI will help the collective research community solve problems and positively impact society at large.

Acknowledgements

Many thanks to the many individuals who have worked on these projects at Google including Shruti Sheth, Reena Jana, Amy Chung-Yu Chou, Elizabeth Adkison, Sophie Allweis, Dan Altman, Eve Andersson, Ayelet Benjamini, Julie Cattiau, Yuval Carny, Richard Cave, Katherine Chou, Greg Corrado, Carlos De Segovia, Remi Denton, Dotan Emanuel, Ashley Gardner, Oren Gilon, Taylor Goddu, Brigitte Hoyer Gosselink, Jordan Green, Alon Harris, Avinatan Hassidim, Rus Heywood, Sunny Jansen, Pan-Pan Jiang, Anton Kast, Marilyn Ladewig, Ronit Levavi Morad, Bob MacDonald, Alicia Martin, Shakir Mohamed, Philip Nelson, Moriah Royz, Katie Seaver, Joel Shor, Milind Tambe, Aparna Taneja, Divy Thakkar, Jimmy Tobin, Katrin Tomanek, Blake Walsh, Gal Weiss, Kasumi Widner, Lihong Xi, and teams.

Read More

Responsible AI at Google Research: Adversarial testing for generative AI safety

Responsible AI at Google Research: Adversarial testing for generative AI safety

The Responsible AI and Human-Centered Technology (RAI-HCT) team within Google Research is committed to advancing the theory and practice of responsible human-centered AI through a lens of culturally-aware research, to meet the needs of billions of users today, and blaze the path forward for a better AI future. The BRAIDS (Building Responsible AI Data and Solutions) team within RAI-HCT aims to simplify the adoption of RAI practices through the utilization of scalable tools, high-quality data, streamlined processes, and novel research with a current emphasis on addressing the unique challenges posed by generative AI (GenAI).

GenAI models have enabled unprecedented capabilities leading to a rapid surge of innovative applications. Google actively leverages GenAI to enhance its products’ utility and to improve lives. While enormously beneficial, GenAI also presents risks for disinformation, bias, and security. In 2018, Google pioneered the AI Principles, emphasizing beneficial use and prevention of harm. Since then, Google has focused on effectively implementing our principles in Responsible AI practices through 1) a comprehensive risk assessment framework, 2) internal governance structures, 3) education, empowering Googlers to integrate AI Principles into their work, and 4) the development of processes and tools that identify, measure, and analyze ethical risks throughout the lifecycle of AI-powered products. The BRAIDS team focuses on the last area, creating tools and techniques for identification of ethical and safety risks in GenAI products that enable teams within Google to apply appropriate mitigations.

What makes GenAI challenging to build responsibly?

The unprecedented capabilities of GenAI models have been accompanied by a new spectrum of potential failures, underscoring the urgency for a comprehensive and systematic RAI approach to understanding and mitigating potential safety concerns before the model is made broadly available. One key technique used to understand potential risks is adversarial testing, which is testing performed to systematically evaluate the models to learn how they behave when provided with malicious or inadvertently harmful inputs across a range of scenarios. To that end, our research has focused on three directions:

  1. Scaled adversarial data generation
    Given the diverse user communities, use cases, and behaviors, it is difficult to comprehensively identify critical safety issues prior to launching a product or service. Scaled adversarial data generation with humans-in-the-loop addresses this need by creating test sets that contain a wide range of diverse and potentially unsafe model inputs that stress the model capabilities under adverse circumstances. Our unique focus in BRAIDS lies in identifying societal harms to the diverse user communities impacted by our models.
  2. Automated test set evaluation and community engagement
    Scaling the testing process so that many thousands of model responses can be quickly evaluated to learn how the model responds across a wide range of potentially harmful scenarios is aided with automated test set evaluation. Beyond testing with adversarial test sets, community engagement is a key component of our approach to identify “unknown unknowns” and to seed the data generation process.
  3. Rater diversity
    Safety evaluations rely on human judgment, which is shaped by community and culture and is not easily automated. To address this, we prioritize research on rater diversity.

Scaled adversarial data generation

High-quality, comprehensive data underpins many key programs across Google. Initially reliant on manual data generation, we’ve made significant strides to automate the adversarial data generation process. A centralized data repository with use-case and policy-aligned prompts is available to jump-start the generation of new adversarial tests. We have also developed multiple synthetic data generation tools based on large language models (LLMs) that prioritize the generation of data sets that reflect diverse societal contexts and that integrate data quality metrics for improved dataset quality and diversity.

Our data quality metrics include:

  • Analysis of language styles, including query length, query similarity, and diversity of language styles.
  • Measurement across a wide range of societal and multicultural dimensions, leveraging datasets such as SeeGULL, SPICE, the Societal Context Repository.
  • Measurement of alignment with Google’s generative AI policies and intended use cases.
  • Analysis of adversariality to ensure that we examine both explicit (the input is clearly designed to produce an unsafe output) and implicit (where the input is innocuous but the output is harmful) queries.

One of our approaches to scaled data generation is exemplified in our paper on AI-Assisted Red Teaming (AART). AART generates evaluation datasets with high diversity (e.g., sensitive and harmful concepts specific to a wide range of cultural and geographic regions), steered by AI-assisted recipes to define, scope and prioritize diversity within an application context. Compared to some state-of-the-art tools, AART shows promising results in terms of concept coverage and data quality. Separately, we are also working with MLCommons to contribute to public benchmarks for AI Safety.

Adversarial testing and community insights

Evaluating model output with adversarial test sets allows us to identify critical safety issues prior to deployment. Our initial evaluations relied exclusively on human ratings, which resulted in slow turnaround times and inconsistencies due to a lack of standardized safety definitions and policies. We have improved the quality of evaluations by introducing policy-aligned rater guidelines to improve human rater accuracy, and are researching additional improvements to better reflect the perspectives of diverse communities. Additionally, automated test set evaluation using LLM-based auto-raters enables efficiency and scaling, while allowing us to direct complex or ambiguous cases to humans for expert rating.

Beyond testing with adversarial test sets, gathering community insights is vital for continuously discovering “unknown unknowns”. To provide high quality human input that is required to seed the scaled processes, we partner with groups such as the Equitable AI Research Round Table (EARR), and with our internal ethics and analysis teams to ensure that we are representing the diverse communities who use our models. The Adversarial Nibbler Challenge engages external users to understand potential harms of unsafe, biased or violent outputs to end users at scale. Our continuous commitment to community engagement includes gathering feedback from diverse communities and collaborating with the research community, for example during The ART of Safety workshop at the Asia-Pacific Chapter of the Association for Computational Linguistics Conference (IJCNLP-AACL 2023) to address adversarial testing challenges for GenAI.

Rater diversity in safety evaluation

Understanding and mitigating GenAI safety risks is both a technical and social challenge. Safety perceptions are intrinsically subjective and influenced by a wide range of intersecting factors. Our in-depth study on demographic influences on safety perceptions explored the intersectional effects of rater demographics (e.g., race/ethnicity, gender, age) and content characteristics (e.g., degree of harm) on safety assessments of GenAI outputs. Traditional approaches largely ignore inherent subjectivity and the systematic disagreements among raters, which can mask important cultural differences. Our disagreement analysis framework surfaced a variety of disagreement patterns between raters from diverse backgrounds including also with “ground truth” expert ratings. This paves the way to new approaches for assessing quality of human annotation and model evaluations beyond the simplistic use of gold labels. Our NeurIPS 2023 publication introduces the DICES (Diversity In Conversational AI Evaluation for Safety) dataset that facilitates nuanced safety evaluation of LLMs and accounts for variance, ambiguity, and diversity in various cultural contexts.

Summary

GenAI has resulted in a technology transformation, opening possibilities for rapid development and customization even without coding. However, it also comes with a risk of generating harmful outputs. Our proactive adversarial testing program identifies and mitigates GenAI risks to ensure inclusive model behavior. Adversarial testing and red teaming are essential components of a Safety strategy, and conducting them in a comprehensive manner is essential. The rapid pace of innovation demands that we constantly challenge ourselves to find “unknown unknowns” in cooperation with our internal partners, diverse user communities, and other industry experts.

Read More

Scaling multimodal understanding to long videos

Scaling multimodal understanding to long videos

When building machine learning models for real-life applications, we need to consider inputs from multiple modalities in order to capture various aspects of the world around us. For example, audio, video, and text all provide varied and complementary information about a visual input. However, building multimodal models is challenging due to the heterogeneity of the modalities. Some of the modalities might be well synchronized in time (e.g., audio, video) but not aligned with text. Furthermore, the large volume of data in video and audio signals is much larger than that in text, so when combining them in multimodal models, video and audio often cannot be fully consumed and need to be disproportionately compressed. This problem is exacerbated for longer video inputs.

In “Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities”, we introduce a multimodal autoregressive model (Mirasol3B) for learning across audio, video, and text modalities. The main idea is to decouple the multimodal modeling into separate focused autoregressive models, processing the inputs according to the characteristics of the modalities. Our model consists of an autoregressive component for the time-synchronized modalities (audio and video) and a separate autoregressive component for modalities that are not necessarily time-aligned but are still sequential, e.g., text inputs, such as a title or description. Additionally, the time-aligned modalities are partitioned in time where local features can be jointly learned. In this way, audio-video inputs are modeled in time and are allocated comparatively more parameters than prior works. With this approach, we can effortlessly handle much longer videos (e.g., 128-512 frames) compared to other multimodal models. At 3B parameters, Mirasol3B is compact compared to prior Flamingo (80B) and PaLI-X (55B) models. Finally, Mirasol3B outperforms the state-of-the-art approaches on video question answering (video QA), long video QA, and audio-video-text benchmarks.

The Mirasol3B architecture consists of an autoregressive model for the time-aligned modalities (audio and video), which are partitioned in chunks, and a separate autoregressive model for the unaligned context modalities (e.g., text). Joint feature learning is conducted by the Combiner, which learns compact but sufficiently informative features, allowing the processing of long video/audio inputs.

Coordinating time-aligned and contextual modalities

Video, audio and text are diverse modalities with distinct characteristics. For example, video is a spatio-temporal visual signal with 30–100 frames per second, but due to the large volume of data, typically only 32–64 frames per video are consumed by current models. Audio is a one-dimensional temporal signal obtained at much higher frequency than video (e.g., at 16 Hz), whereas text inputs that apply to the whole video, are typically 200–300 word-sequence and serve as a context to the audio-video inputs. To that end, we propose a model consisting of an autoregressive component that fuses and jointly learns the time-aligned signals, which occur at high frequencies and are roughly synchronized, and another autoregressive component for processing non-aligned signals. Learning between the components for the time-aligned and contextual modalities is coordinated via cross-attention mechanisms that allow the two to exchange information while learning in a sequence without having to synchronize them in time.

Time-aligned autoregressive modeling of video and audio

Long videos can convey rich information and activities happening in a sequence. However, present models approach video modeling by extracting all the information at once, without sufficient temporal information. To address this, we apply an autoregressive modeling strategy where we condition jointly learned video and audio representations for one time interval on feature representations from previous time intervals. This preserves temporal information.

The video is first partitioned into smaller video chunks. Each chunk itself can be 4–64 frames. The features corresponding to each chunk are then processed by a learning module, called the Combiner (described below), which generates a joint audio and video feature representation at the current step — this step extracts and compacts the most important information per chunk. Next, we process this joint feature representation with an autoregressive Transformer, which applies attention to the previous feature representation and generates the joint feature representation for the next step. Consequently, the model learns how to represent not only each individual chunk, but also how the chunks relate temporally.

We use an autoregressive modeling of the audio and video inputs, partitioning them in time and learning joint feature representations, which are then autoregressively learned in sequence.

Modeling long videos with a modality combiner

To combine the signals from the video and audio information in each video chunk, we propose a learning module called the Combiner. Video and audio signals are aligned by taking the audio inputs that correspond to a specific video timeframe. We then process video and audio inputs spatio-temporally, extracting information particularly relevant to changes in the inputs (for videos we use sparse video tubes, and for audio we apply the spectrogram representation, both of which are processed by a Vision Transformer). We concatenate and input these features to the Combiner, which is designed to learn a new feature representation capturing both these inputs. To address the challenge of the large volume of data in video and audio signals, another goal of the Combiner is to reduce the dimensionality of the joint video/audio inputs, which is done by selecting a smaller number of output features to be produced. The Combiner can be implemented simply as a causal Transformer, which processes the inputs in the direction of time, i.e., using only inputs of the prior steps or the current one. Alternatively, the Combiner can have a learnable memory, described below.

Combiner styles

A simple version of the Combiner adapts a Transformer architecture. More specifically, all audio and video features from the current chunk (and optionally prior chunks) are input to a Transformer and projected to a lower dimensionality, i.e., a smaller number of features are selected as the output “combined” features. While Transformers are not typically used in this context, we find it effective for reducing the dimensionality of the input features, by selecting the last m outputs of the Transformer, if m is the desired output dimension (shown below). Alternatively, the Combiner can have a memory component. For example, we use the Token Turing Machine (TTM), which supports a differentiable memory unit, accumulating and compressing features from all previous timesteps. Using a fixed memory allows the model to work with a more compact set of features at every step, rather than process all the features from previous steps, which reduces computation.

We use a simple Transformer-based Combiner (left) and a Memory Combiner (right), based on the Token Turing Machine (TTM), which uses memory to compress previous history of features.

Results

We evaluate our approach on several benchmarks, MSRVTT-QA, ActivityNet-QA and NeXT-QA, for the video QA task, where a text-based question about a video is issued and the model needs to answer. This evaluates the ability of the model to understand both the text-based question and video content, and to form an answer, focusing on only relevant information. Of these benchmarks, the latter two target long video inputs and feature more complex questions.

We also evaluate our approach in the more challenging open-ended text generation setting, wherein the model generates the answers in an unconstrained fashion as free form text, requiring an exact match to the ground truth answer. While this stricter evaluation counts synonyms as incorrect, it may better reflect a model’s ability to generalize.

Our results indicate improved performance over state-of-the-art approaches for most benchmarks, including all with open-ended generation evaluation — notable considering our model is only 3B parameters, considerably smaller than prior approaches, e.g., Flamingo 80B. We used only video and text inputs to be comparable to other work. Importantly, our model can process 512 frames without needing to increase the model parameters, which is crucial for handling longer videos. Finally with the TTM Combiner, we see both better or comparable performance while reducing compute by 18%.

Results on the MSRVTT-QA (video QA) dataset.
Results on NeXT-QA benchmark, which features long videos for the video QA task.

Results on audio-video benchmarks

Results on the popular audio-video datasets VGG-Sound and EPIC-SOUNDS are shown below. Since these benchmarks are classification-only, we treat them as an open-ended text generative setting where our model produces the text of the desired class; e.g., for the class ID corresponding to the “playing drums” activity, we expect the model to generate the text “playing drums”. In some cases our approach outperforms the prior state of the art by large margins, even though our model outputs the results in the generative open-ended setting.

Results on the VGG-Sound (audio-video QA) dataset.
Results on the EPIC-SOUNDS (audio-video QA) dataset.

Benefits of autoregressive modeling

We conduct an ablation study comparing our approach to a set of baselines that use the same input information but with standard methods (i.e., without autoregression and the Combiner). We also compare the effects of pre-training. Because standard methods are ill-suited for processing longer video, this experiment is conducted for 32 frames and four chunks only, across all settings for fair comparison. We see that Mirasol3B’s improvements are still valid for relatively short videos.

Ablation experiments comparing the main components of our model. Using the Combiner, the autoregressive modeling, and pre-training all improve performance.

Conclusion

We present a multimodal autoregressive model that addresses the challenges associated with the heterogeneity of multimodal data by coordinating the learning between time-aligned and time-unaligned modalities. Time-aligned modalities are further processed autoregressively in time with a Combiner, controlling the sequence length and producing powerful representations. We demonstrate that a relatively small model can successfully represent long video and effectively combine with other modalities. We outperform the state-of-the-art approaches (including some much bigger models) on video- and audio-video question answering.

Acknowledgements

This research is co-authored by AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael Ryoo, Victor Gomes, and Anelia Angelova. We thank Claire Cui, Tania Bedrax-Weiss, Abhijit Ogale, Yunhsuan Sung, Ching-Chung Chang, Marvin Ritter, Kristina Toutanova, Ming-Wei Chang, Ashish Thapliyal, Xiyang Luo, Weicheng Kuo, Aren Jansen, Bryan Seybold, Ibrahim Alabdulmohsin, Jialin Wu, Luke Friedman, Trevor Walker, Keerthana Gopalakrishnan, Jason Baldridge, Radu Soricut, Mojtaba Seyedhosseini, Alexander D’Amour, Oliver Wang, Paul Natsev, Tom Duerig, Younghui Wu, Slav Petrov, Zoubin Ghahramani for their help and support. We also thank Tom Small for preparing the animation.

Read More

Enabling large-scale health studies for the research community

Enabling large-scale health studies for the research community

As consumer technologies like fitness trackers and mobile phones become more widely used for health-related data collection, so does the opportunity to leverage these data pathways to study and advance our understanding of medical conditions. We have previously touched upon how our work explores the use of this technology within the context of chronic diseases, in particular multiple sclerosis (MS). This effort leverages the FDA MyStudies platform, an open-source platform used to create clinical study apps, that makes it easier for anyone to run their own studies and collect good quality healthcare data, in a trusted and safe way.

Today, we describe the setup that we developed by expanding the FDA MyStudies platform and demonstrate how it can be used to set up a digital health study. We also present our exploratory research study created through this platform, called MS Signals, which consists of a symptom tracking app for MS patients. The goal for this app is twofold: 1) to ensure that the enhancements to the FDA MyStudies platform made for a more streamlined study creation experience; and 2) to understand how new data collection mechanisms can be used to revolutionize patients’ chronic disease management and tracking. We have open sourced our extension to the FDA MyStudies platform under the Apache 2.0 license to provide a resource for the community to build their own studies.

Extending the FDA MyStudies platform

The original FDA MyStudies platform allowed people to configure their own study apps, manage participants, and create separate iOS and Android apps. To simplify the study creation process and ensure increased study engagement, we made a number of accessibility changes. Some of the main improvements include: cross-platform (iOS and Android) app generation through the use of Flutter, an open source framework by Google for building multi-platform applications from a single codebase; a simplified setup, so that users can prototype their study quickly (under a day in most cases); and, most importantly, an emphasis on accessibility so that diverse patient’s voices are heard. The accessibility enhancements include changes to the underlying features of the platform and to the particular study design of the MS Signals study app.

Multi-platform support with rapid prototyping

We decided on the use of Flutter as it would be a single point that would generate both iOS and Android apps in one go, reducing the work required to support multiple platforms. Flutter also provides hot-reloading, which allows developers to build & preview features quickly. The design-system in the app takes advantage of this feature to provide a central point from which the branding & theme of the app can be changed to match the tone of a new study and previewed instantly. The demo environment in the app also utilizes this feature to allow developers to mock and preview questionnaires locally on their machines. In our experience this has been a huge time-saver in A/B testing the UX and the format and wording of questions live with clinicians.

System accessibility enhancements

To improve the accessibility of the platform for more users, we made several usability enhancements:

  1. Light & dark theme support
  2. Bold text & variable font-sizes
  3. High-contrast mode
  4. Improving user awareness of accessibility settings

Extended exposure to bright light themes can strain the eyes, so supporting dark theme features was necessary to make it easier to use the study app frequently. Some small or light text-elements are illegible to users with vision impairments, so we added 1) bold-text and support for larger font-sizes and 2) high-contrast color-schemes. To ensure that accessibility settings are easy to find, we placed an introductory one-time screen that was presented during the app’s first launch, which would directly take users to their system accessibility settings.

Study accessibility enhancements

To make the study itself easier to interact with and reduce cognitive overload, we made the following changes:

  1. Clarified the onboarding process
  2. Improved design for questionnaires

First, we clarified the on-boarding process by presenting users with a list of required steps when they first open the app in order to reduce confusion and participant drop-off.

The original questionnaire design in the app presented each question in a card format, which utilizes part of the screen for shadows and depth effects of the card. In many situations, this is a pleasant aesthetic, but in apps where accessibility is priority, these visual elements restrict the space available on the screen. Thus, when more accessible, larger font-sizes are used there are more frequent word breaks, which reduces readability. We fixed this simply by removing the card design elements and instead using the entire screen, allowing for better visuals with larger font-sizes.

The MS Signals prototype study

To test the usability of these changes, we used our redesigned platform to create a prototype study app called MS Signals, which uses surveys to gather information about a participant’s MS-related symptoms.

MS Signals app screenshots.

<!–

MS Signals app screenshots.

–>

MS Studies app design

As a first step, before entering any study information, participants are asked to complete an eligibility and study comprehension questionnaire to ensure that they have read through the potentially lengthy terms of study participation. This might include, for example, questions like “In what country is the study available?” or “Can you withdraw from the study?” A section like this is common in most health studies, and it tends to be the first drop-off point for participants.

To minimize study drop-off at this early stage, we kept the eligibility test brief and reflected correct answers for the comprehension test back to the participants. This helps minimize the number of times a user may need to go through the initial eligibility questionnaire and ensures that the important aspects of the study protocol are made clear to them.

After successful enrollment, participants are taken to the main app view, which consists of three pages:

  • Activities:

    This page lists the questionnaires available to the participant and is where the majority of their time is spent. The questionnaires vary in frequency — some are one-time surveys created to gather medical history, while others are repeated daily, weekly or monthly, depending on the symptom or area they are exploring. For the one-time survey we provide a counter above each question to signal to users how far they have come and how many questions are left, similar to the questionnaire during the eligibility and comprehension step.
  • Dashboard:
    To ensure that participants get something back in return for the information they enter during a study, the Dashboard area presents a summary of their responses in graph or pie chart form. Participants could potentially show this data to their care provider as a summary of their condition over the last 6 months, an improvement over the traditional pen and paper methods that many employ today.
  • Resources:
    A set of useful links, help articles and common questions related to MS.

Questionnaire design

Since needing to frequently input data can lead to cognitive overload, participant drop off, and bad data quality, we reduced the burden in two ways:

  1. We break down large questionnaires into smaller ones, resulting in 6 daily surveys, containing 3–5 questions each, where each question is multiple choice and related to a single symptom. This way we cover a total of 20 major symptoms, and present them in a similar way to how a clinician would ask these questions in an in-clinic setting.
  2. We ensure previously entered information is readily available in the app, along with the time of the entry.

In designing the survey content, we collaborated closely with experienced clinicians and researchers to finalize the wording and layout. While studies in this field typically use the Likert scale to gather symptom information, we defined a more intuitive verbose scale to provide better experience for participants tracking their disease and the clinicians or researchers viewing the disease history. For example, in the case of vision issues, rather than asking participants to rate their symptoms on a scale from 1 to 10, we instead present a multiple choice question where we detail common vision problems that they may be experiencing.

This verbose scale helps patients track their symptoms more accurately by including context that helps them more clearly define their symptoms. This approach also allows researchers to answer questions that go beyond symptom correlation. For example, for vision issues, data collected using the verbose scale would reveal to researchers whether nystagmus is more prominent in patients with MS compared to double vision.

Side-by-side comparison with a Likert scale on the left, and a Verbose scale on the right.

<!–

Side by side comparison with a Likert scale on the left, and a Verbose scale on the right.

–>

Focusing on accessibility

Mobile-based studies can often present additional challenges for participants with chronic conditions: the text can be hard to read, the color contrast could make it difficult to see certain bits of information, or it may be challenging to scroll through pages. This may result in participant drop off, which, in turn, could yield a biased dataset if the people who are experiencing more advanced forms of a disease are unable to provide data.

In order to prevent such issues, we include the following accessibility features:

  • Throughout, we employ color blind accessible color schemes. This includes improving the contrast between crucial text and important additional information, which might otherwise be presented in a smaller font and a faded text color.
  • We reduced the amount of movement required to access crucial controls by placing all buttons close to the bottom of the page and ensuring that pop-ups are controllable from the bottom part of the screen.

To test the accessibility of MS Signals, we collaborated with the National MS Society to recruit participants for a user experience study. For this, a call for participation was sent out by the Society to their members, and 9 respondents were asked to test out the various app flows. The majority indicated that they would like a better way than their current method to track their symptom data, that they considered MS Signals to be a unique and valuable tool that would enhance the accuracy of their symptom tracking, and that they would want to share the dashboard view with their healthcare providers.

Next steps

We want to encourage everyone to make use of the open source platform to start setting up and running their own studies. We are working on creating a set of standard study templates, which would incorporate what we learned from above, and we hope to release those soon. For any issues, comments or questions please check out our resource page.

Read More