Google AI – Page 37

Scaling multimodal understanding to long videos

November 14, 2023

by Google AI Google AI

Posted by Isaac Noble, Software Engineer, Google Research, and Anelia Angelova, Research Scientist, Google DeepMind

When building machine learning models for real-life applications, we need to consider inputs from multiple modalities in order to capture various aspects of the world around us. For example, audio, video, and text all provide varied and complementary information about a visual input. However, building multimodal models is challenging due to the heterogeneity of the modalities. Some of the modalities might be well synchronized in time (e.g., audio, video) but not aligned with text. Furthermore, the large volume of data in video and audio signals is much larger than that in text, so when combining them in multimodal models, video and audio often cannot be fully consumed and need to be disproportionately compressed. This problem is exacerbated for longer video inputs.

In “Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities”, we introduce a multimodal autoregressive model (Mirasol3B) for learning across audio, video, and text modalities. The main idea is to decouple the multimodal modeling into separate focused autoregressive models, processing the inputs according to the characteristics of the modalities. Our model consists of an autoregressive component for the time-synchronized modalities (audio and video) and a separate autoregressive component for modalities that are not necessarily time-aligned but are still sequential, e.g., text inputs, such as a title or description. Additionally, the time-aligned modalities are partitioned in time where local features can be jointly learned. In this way, audio-video inputs are modeled in time and are allocated comparatively more parameters than prior works. With this approach, we can effortlessly handle much longer videos (e.g., 128-512 frames) compared to other multimodal models. At 3B parameters, Mirasol3B is compact compared to prior Flamingo (80B) and PaLI-X (55B) models. Finally, Mirasol3B outperforms the state-of-the-art approaches on video question answering (video QA), long video QA, and audio-video-text benchmarks.

The Mirasol3B architecture consists of an autoregressive model for the time-aligned modalities (audio and video), which are partitioned in chunks, and a separate autoregressive model for the unaligned context modalities (e.g., text). Joint feature learning is conducted by the Combiner, which learns compact but sufficiently informative features, allowing the processing of long video/audio inputs.

Coordinating time-aligned and contextual modalities

Video, audio and text are diverse modalities with distinct characteristics. For example, video is a spatio-temporal visual signal with 30–100 frames per second, but due to the large volume of data, typically only 32–64 frames per video are consumed by current models. Audio is a one-dimensional temporal signal obtained at much higher frequency than video (e.g., at 16 Hz), whereas text inputs that apply to the whole video, are typically 200–300 word-sequence and serve as a context to the audio-video inputs. To that end, we propose a model consisting of an autoregressive component that fuses and jointly learns the time-aligned signals, which occur at high frequencies and are roughly synchronized, and another autoregressive component for processing non-aligned signals. Learning between the components for the time-aligned and contextual modalities is coordinated via cross-attention mechanisms that allow the two to exchange information while learning in a sequence without having to synchronize them in time.

Time-aligned autoregressive modeling of video and audio

Long videos can convey rich information and activities happening in a sequence. However, present models approach video modeling by extracting all the information at once, without sufficient temporal information. To address this, we apply an autoregressive modeling strategy where we condition jointly learned video and audio representations for one time interval on feature representations from previous time intervals. This preserves temporal information.

The video is first partitioned into smaller video chunks. Each chunk itself can be 4–64 frames. The features corresponding to each chunk are then processed by a learning module, called the Combiner (described below), which generates a joint audio and video feature representation at the current step — this step extracts and compacts the most important information per chunk. Next, we process this joint feature representation with an autoregressive Transformer, which applies attention to the previous feature representation and generates the joint feature representation for the next step. Consequently, the model learns how to represent not only each individual chunk, but also how the chunks relate temporally.

We use an autoregressive modeling of the audio and video inputs, partitioning them in time and learning joint feature representations, which are then autoregressively learned in sequence.

Modeling long videos with a modality combiner

To combine the signals from the video and audio information in each video chunk, we propose a learning module called the Combiner. Video and audio signals are aligned by taking the audio inputs that correspond to a specific video timeframe. We then process video and audio inputs spatio-temporally, extracting information particularly relevant to changes in the inputs (for videos we use sparse video tubes, and for audio we apply the spectrogram representation, both of which are processed by a Vision Transformer). We concatenate and input these features to the Combiner, which is designed to learn a new feature representation capturing both these inputs. To address the challenge of the large volume of data in video and audio signals, another goal of the Combiner is to reduce the dimensionality of the joint video/audio inputs, which is done by selecting a smaller number of output features to be produced. The Combiner can be implemented simply as a causal Transformer, which processes the inputs in the direction of time, i.e., using only inputs of the prior steps or the current one. Alternatively, the Combiner can have a learnable memory, described below.

Combiner styles

A simple version of the Combiner adapts a Transformer architecture. More specifically, all audio and video features from the current chunk (and optionally prior chunks) are input to a Transformer and projected to a lower dimensionality, i.e., a smaller number of features are selected as the output “combined” features. While Transformers are not typically used in this context, we find it effective for reducing the dimensionality of the input features, by selecting the last m outputs of the Transformer, if m is the desired output dimension (shown below). Alternatively, the Combiner can have a memory component. For example, we use the Token Turing Machine (TTM), which supports a differentiable memory unit, accumulating and compressing features from all previous timesteps. Using a fixed memory allows the model to work with a more compact set of features at every step, rather than process all the features from previous steps, which reduces computation.

We use a simple Transformer-based Combiner (left) and a Memory Combiner (right), based on the Token Turing Machine (TTM), which uses memory to compress previous history of features.

Results

We evaluate our approach on several benchmarks, MSRVTT-QA, ActivityNet-QA and NeXT-QA, for the video QA task, where a text-based question about a video is issued and the model needs to answer. This evaluates the ability of the model to understand both the text-based question and video content, and to form an answer, focusing on only relevant information. Of these benchmarks, the latter two target long video inputs and feature more complex questions.

We also evaluate our approach in the more challenging open-ended text generation setting, wherein the model generates the answers in an unconstrained fashion as free form text, requiring an exact match to the ground truth answer. While this stricter evaluation counts synonyms as incorrect, it may better reflect a model’s ability to generalize.

Our results indicate improved performance over state-of-the-art approaches for most benchmarks, including all with open-ended generation evaluation — notable considering our model is only 3B parameters, considerably smaller than prior approaches, e.g., Flamingo 80B. We used only video and text inputs to be comparable to other work. Importantly, our model can process 512 frames without needing to increase the model parameters, which is crucial for handling longer videos. Finally with the TTM Combiner, we see both better or comparable performance while reducing compute by 18%.

Results on the MSRVTT-QA (video QA) dataset.

Results on NeXT-QA benchmark, which features long videos for the video QA task.

Results on audio-video benchmarks

Results on the popular audio-video datasets VGG-Sound and EPIC-SOUNDS are shown below. Since these benchmarks are classification-only, we treat them as an open-ended text generative setting where our model produces the text of the desired class; e.g., for the class ID corresponding to the “playing drums” activity, we expect the model to generate the text “playing drums”. In some cases our approach outperforms the prior state of the art by large margins, even though our model outputs the results in the generative open-ended setting.

Results on the VGG-Sound (audio-video QA) dataset.

Results on the EPIC-SOUNDS (audio-video QA) dataset.

Benefits of autoregressive modeling

We conduct an ablation study comparing our approach to a set of baselines that use the same input information but with standard methods (i.e., without autoregression and the Combiner). We also compare the effects of pre-training. Because standard methods are ill-suited for processing longer video, this experiment is conducted for 32 frames and four chunks only, across all settings for fair comparison. We see that Mirasol3B’s improvements are still valid for relatively short videos.

Ablation experiments comparing the main components of our model. Using the Combiner, the autoregressive modeling, and pre-training all improve performance.

Conclusion

We present a multimodal autoregressive model that addresses the challenges associated with the heterogeneity of multimodal data by coordinating the learning between time-aligned and time-unaligned modalities. Time-aligned modalities are further processed autoregressively in time with a Combiner, controlling the sequence length and producing powerful representations. We demonstrate that a relatively small model can successfully represent long video and effectively combine with other modalities. We outperform the state-of-the-art approaches (including some much bigger models) on video- and audio-video question answering.

Acknowledgements

This research is co-authored by AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael Ryoo, Victor Gomes, and Anelia Angelova. We thank Claire Cui, Tania Bedrax-Weiss, Abhijit Ogale, Yunhsuan Sung, Ching-Chung Chang, Marvin Ritter, Kristina Toutanova, Ming-Wei Chang, Ashish Thapliyal, Xiyang Luo, Weicheng Kuo, Aren Jansen, Bryan Seybold, Ibrahim Alabdulmohsin, Jialin Wu, Luke Friedman, Trevor Walker, Keerthana Gopalakrishnan, Jason Baldridge, Radu Soricut, Mojtaba Seyedhosseini, Alexander D’Amour, Oliver Wang, Paul Natsev, Tom Duerig, Younghui Wu, Slav Petrov, Zoubin Ghahramani for their help and support. We also thank Tom Small for preparing the animation.

An opportunity agenda for AI

November 14, 2023

by Google AI

Today we’re sharing an AI Opportunity Agenda to provide concrete policy recommendations to help AI benefit as many people as possible.Read More

Taking legal action to protect users of AI and small businesses

November 13, 2023

by Google AI

Today we’re taking action to protect users of Google’s Bard AI as well as against fraudsters who sought to weaponize copyright law for profitRead More

Enabling large-scale health studies for the research community

November 10, 2023

by Google AI Google AI

Posted by Chintan Ghate, Software Engineer, and Diana Mincu, Research Engineer, Google Research

As consumer technologies like fitness trackers and mobile phones become more widely used for health-related data collection, so does the opportunity to leverage these data pathways to study and advance our understanding of medical conditions. We have previously touched upon how our work explores the use of this technology within the context of chronic diseases, in particular multiple sclerosis (MS). This effort leverages the FDA MyStudies platform, an open-source platform used to create clinical study apps, that makes it easier for anyone to run their own studies and collect good quality healthcare data, in a trusted and safe way.

Today, we describe the setup that we developed by expanding the FDA MyStudies platform and demonstrate how it can be used to set up a digital health study. We also present our exploratory research study created through this platform, called MS Signals, which consists of a symptom tracking app for MS patients. The goal for this app is twofold: 1) to ensure that the enhancements to the FDA MyStudies platform made for a more streamlined study creation experience; and 2) to understand how new data collection mechanisms can be used to revolutionize patients’ chronic disease management and tracking. We have open sourced our extension to the FDA MyStudies platform under the Apache 2.0 license to provide a resource for the community to build their own studies.

Extending the FDA MyStudies platform

The original FDA MyStudies platform allowed people to configure their own study apps, manage participants, and create separate iOS and Android apps. To simplify the study creation process and ensure increased study engagement, we made a number of accessibility changes. Some of the main improvements include: cross-platform (iOS and Android) app generation through the use of Flutter, an open source framework by Google for building multi-platform applications from a single codebase; a simplified setup, so that users can prototype their study quickly (under a day in most cases); and, most importantly, an emphasis on accessibility so that diverse patient’s voices are heard. The accessibility enhancements include changes to the underlying features of the platform and to the particular study design of the MS Signals study app.

Multi-platform support with rapid prototyping

We decided on the use of Flutter as it would be a single point that would generate both iOS and Android apps in one go, reducing the work required to support multiple platforms. Flutter also provides hot-reloading, which allows developers to build & preview features quickly. The design-system in the app takes advantage of this feature to provide a central point from which the branding & theme of the app can be changed to match the tone of a new study and previewed instantly. The demo environment in the app also utilizes this feature to allow developers to mock and preview questionnaires locally on their machines. In our experience this has been a huge time-saver in A/B testing the UX and the format and wording of questions live with clinicians.

System accessibility enhancements

To improve the accessibility of the platform for more users, we made several usability enhancements:

Light & dark theme support
Bold text & variable font-sizes
High-contrast mode
Improving user awareness of accessibility settings

Extended exposure to bright light themes can strain the eyes, so supporting dark theme features was necessary to make it easier to use the study app frequently. Some small or light text-elements are illegible to users with vision impairments, so we added 1) bold-text and support for larger font-sizes and 2) high-contrast color-schemes. To ensure that accessibility settings are easy to find, we placed an introductory one-time screen that was presented during the app’s first launch, which would directly take users to their system accessibility settings.

Study accessibility enhancements

To make the study itself easier to interact with and reduce cognitive overload, we made the following changes:

Clarified the onboarding process
Improved design for questionnaires

First, we clarified the on-boarding process by presenting users with a list of required steps when they first open the app in order to reduce confusion and participant drop-off.

The original questionnaire design in the app presented each question in a card format, which utilizes part of the screen for shadows and depth effects of the card. In many situations, this is a pleasant aesthetic, but in apps where accessibility is priority, these visual elements restrict the space available on the screen. Thus, when more accessible, larger font-sizes are used there are more frequent word breaks, which reduces readability. We fixed this simply by removing the card design elements and instead using the entire screen, allowing for better visuals with larger font-sizes.

The MS Signals prototype study

To test the usability of these changes, we used our redesigned platform to create a prototype study app called MS Signals, which uses surveys to gather information about a participant’s MS-related symptoms.

MS Signals app screenshots.

<!–

MS Signals app screenshots.

–>

MS Studies app design

As a first step, before entering any study information, participants are asked to complete an eligibility and study comprehension questionnaire to ensure that they have read through the potentially lengthy terms of study participation. This might include, for example, questions like “In what country is the study available?” or “Can you withdraw from the study?” A section like this is common in most health studies, and it tends to be the first drop-off point for participants.

To minimize study drop-off at this early stage, we kept the eligibility test brief and reflected correct answers for the comprehension test back to the participants. This helps minimize the number of times a user may need to go through the initial eligibility questionnaire and ensures that the important aspects of the study protocol are made clear to them.

After successful enrollment, participants are taken to the main app view, which consists of three pages:

Activities:

This page lists the questionnaires available to the participant and is where the majority of their time is spent. The questionnaires vary in frequency — some are one-time surveys created to gather medical history, while others are repeated daily, weekly or monthly, depending on the symptom or area they are exploring. For the one-time survey we provide a counter above each question to signal to users how far they have come and how many questions are left, similar to the questionnaire during the eligibility and comprehension step.
Dashboard:
To ensure that participants get something back in return for the information they enter during a study, the Dashboard area presents a summary of their responses in graph or pie chart form. Participants could potentially show this data to their care provider as a summary of their condition over the last 6 months, an improvement over the traditional pen and paper methods that many employ today.
Resources:
A set of useful links, help articles and common questions related to MS.

Questionnaire design

Since needing to frequently input data can lead to cognitive overload, participant drop off, and bad data quality, we reduced the burden in two ways:

We break down large questionnaires into smaller ones, resulting in 6 daily surveys, containing 3–5 questions each, where each question is multiple choice and related to a single symptom. This way we cover a total of 20 major symptoms, and present them in a similar way to how a clinician would ask these questions in an in-clinic setting.
We ensure previously entered information is readily available in the app, along with the time of the entry.

In designing the survey content, we collaborated closely with experienced clinicians and researchers to finalize the wording and layout. While studies in this field typically use the Likert scale to gather symptom information, we defined a more intuitive verbose scale to provide better experience for participants tracking their disease and the clinicians or researchers viewing the disease history. For example, in the case of vision issues, rather than asking participants to rate their symptoms on a scale from 1 to 10, we instead present a multiple choice question where we detail common vision problems that they may be experiencing.

This verbose scale helps patients track their symptoms more accurately by including context that helps them more clearly define their symptoms. This approach also allows researchers to answer questions that go beyond symptom correlation. For example, for vision issues, data collected using the verbose scale would reveal to researchers whether nystagmus is more prominent in patients with MS compared to double vision.

Side-by-side comparison with a Likert scale on the left, and a Verbose scale on the right.

<!–

Side by side comparison with a Likert scale on the left, and a Verbose scale on the right.

–>

Focusing on accessibility

Mobile-based studies can often present additional challenges for participants with chronic conditions: the text can be hard to read, the color contrast could make it difficult to see certain bits of information, or it may be challenging to scroll through pages. This may result in participant drop off, which, in turn, could yield a biased dataset if the people who are experiencing more advanced forms of a disease are unable to provide data.

In order to prevent such issues, we include the following accessibility features:

Throughout, we employ color blind accessible color schemes. This includes improving the contrast between crucial text and important additional information, which might otherwise be presented in a smaller font and a faded text color.
We reduced the amount of movement required to access crucial controls by placing all buttons close to the bottom of the page and ensuring that pop-ups are controllable from the bottom part of the screen.

To test the accessibility of MS Signals, we collaborated with the National MS Society to recruit participants for a user experience study. For this, a call for participation was sent out by the Society to their members, and 9 respondents were asked to test out the various app flows. The majority indicated that they would like a better way than their current method to track their symptom data, that they considered MS Signals to be a unique and valuable tool that would enhance the accuracy of their symptom tracking, and that they would want to share the dashboard view with their healthcare providers.

Next steps

We want to encourage everyone to make use of the open source platform to start setting up and running their own studies. We are working on creating a set of standard study templates, which would incorporate what we learned from above, and we hope to release those soon. For any issues, comments or questions please check out our resource page.

How we taught Google Translate to recognize homonyms

November 10, 2023

by Google AI

How Google Translate’s neural model taught it to understand bass from bass.Read More

Responsible AI at Google Research: Context in AI Research (CAIR)

November 9, 2023

by Google AI Google AI

Posted by Katherine Heller, Research Scientist, Google Research, on behalf of the CAIR Team

Artificial intelligence (AI) and related machine learning (ML) technologies are increasingly influential in the world around us, making it imperative that we consider the potential impacts on society and individuals in all aspects of the technology that we create. To these ends, the Context in AI Research (CAIR) team develops novel AI methods in the context of the entire AI pipeline: from data to end-user feedback. The pipeline for building an AI system typically starts with data collection, followed by designing a model to run on that data, deployment of the model in the real world, and lastly, compiling and incorporation of human feedback. Originating in the health space, and now expanded to additional areas, the work of the CAIR team impacts every aspect of this pipeline. While specializing in model building, we have a particular focus on building systems with responsibility in mind, including fairness, robustness, transparency, and inclusion.

Data

The CAIR team focuses on understanding the data on which ML systems are built. Improving the standards for the transparency of ML datasets is instrumental in our work. First, we employ documentation frameworks to elucidate dataset and model characteristics as guidance in the development of data and model documentation techniques — Datasheets for Datasets and Model Cards for Model Reporting.

For example, health datasets are highly sensitive and yet can have high impact. For this reason, we developed Healthsheets, a health-contextualized adaptation of a Datasheet. Our motivation for developing a health-specific sheet lies in the limitations of existing regulatory frameworks for AI and health. Recent research suggests that data privacy regulation and standards (e.g., HIPAA, GDPR, California Consumer Privacy Act) do not ensure ethical collection, documentation, and use of data. Healthsheets aim to fill this gap in ethical dataset analysis. The development of Healthsheets was done in collaboration with many stakeholders in relevant job roles, including clinical, legal and regulatory, bioethics, privacy, and product.

Further, we studied how Datasheets and Healthsheets could serve as diagnostic tools that surface the limitations and strengths of datasets. Our aim was to start a conversation in the community and tailor Healthsheets to dynamic healthcare scenarios over time.

To facilitate this effort, we joined the STANDING Together initiative, a consortium that aims to develop international, consensus-based standards for documentation of diversity and representation within health datasets and to provide guidance on how to mitigate risk of bias translating to harm and health inequalities. Being part of this international, interdisciplinary partnership that spans academic, clinical, regulatory, policy, industry, patient, and charitable organizations worldwide enables us to engage in the conversation about responsibility in AI for healthcare internationally. Over 250 stakeholders from across 32 countries have contributed to refining the standards.

Healthsheets and STANDING Together: towards health data documentation and standards.

Model

When ML systems are deployed in the real world, they may fail to behave in expected ways, making poor predictions in new contexts. Such failures can occur for a myriad of reasons and can carry negative consequences, especially within the context of healthcare. Our work aims to identify situations where unexpected model behavior may be discovered, before it becomes a substantial problem, and to mitigate the unexpected and undesired consequences.

Much of the CAIR team’s modeling work focuses on identifying and mitigating when models are underspecified. We show that models that perform well on held-out data drawn from a training domain are not equally robust or fair under distribution shift because the models vary in the extent to which they rely on spurious correlations. This poses a risk to users and practitioners because it can be difficult to anticipate model instability using standard model evaluation practices. We have demonstrated that this concern arises in several domains, including computer vision, natural language processing, medical imaging, and prediction from electronic health records.

We have also shown how to use knowledge of causal mechanisms to diagnose and mitigate fairness and robustness issues in new contexts. Knowledge of causal structure allows practitioners to anticipate the generalizability of fairness properties under distribution shift in real-world medical settings. Further, investigating the capability for specific causal pathways, or “shortcuts”, to introduce bias in ML systems, we demonstrate how to identify cases where shortcut learning leads to predictions in ML systems that are unintentionally dependent on sensitive attributes (e.g., age, sex, race). We have shown how to use causal directed acyclic graphs to adapt ML systems to changing environments under complex forms of distribution shift. Our team is currently investigating how a causal interpretation of different forms of bias, including selection bias, label bias, and measurement error, motivates the design of techniques to mitigate bias during model development and evaluation.

Shortcut Learning: For some models, age may act as a shortcut in classification when using medical images.

The CAIR team focuses on developing methodology to build more inclusive models broadly. For example, we also have work on the design of participatory systems, which allows individuals to choose whether to disclose sensitive attributes, such as race, when an ML system makes predictions. We hope that our methodological research positively impacts the societal understanding of inclusivity in AI method development.

Deployment

The CAIR team aims to build technology that improves the lives of all people through the use of mobile device technology. We aim to reduce suffering from health conditions, address systemic inequality, and enable transparent device-based data collection. As consumer technology, such as fitness trackers and mobile phones, become central in data collection for health, we explored the use of these technologies within the context of chronic disease, in particular, for multiple sclerosis (MS). We developed new data collection mechanisms and predictions that we hope will eventually revolutionize patient’s chronic disease management, clinical trials, medical reversals and drug development.

First, we extended the open-source FDA MyStudies platform, which is used to create clinical study apps, to make it easier for anyone to run their own studies and collect good quality data, in a trusted and safe way. Our improvements include zero-config setups, so that researchers can prototype their study in a day, cross-platform app generation through the use of Flutter and, most importantly, an emphasis on accessibility so that all patient’s voices are heard. We are excited to announce this work has now been open sourced as an extension to the original FDA-Mystudies platform. You can start setting up your own studies today!

To test this platform, we built a prototype app, which we call MS Signals, that uses surveys to interface with patients in a novel consumer setting. We collaborated with the National MS Society to recruit participants for a user experience study for the app, with the goal of reducing dropout rates and improving the platform further.

MS Signals app screenshots. Left: Study welcome screen. Right: Questionnaire.

Once data is collected, researchers could potentially use it to drive the frontier of ML research in MS. In a separate study, we established a research collaboration with the Duke Department of Neurology and demonstrated that ML models can accurately predict the incidence of high-severity symptoms within three months using continuously collected data from mobile apps. Results suggest that the trained models can be used by clinicians to evaluate the symptom trajectory of MS participants, which may inform decision making for administering interventions.

The CAIR team has been involved in the deployment of many other systems, for both internal and external use. For example, we have also partnered with Learning Ally to build a book recommendation system for children with learning disabilities, such as dyslexia. We hope that our work positively impacts future product development.

Human feedback

As ML models become ubiquitous throughout the developed world, it can be far too easy to leave voices in less developed countries behind. A priority of the CAIR team is to bridge this gap, develop deep relationships with communities, and work together to address ML-related concerns through community-driven approaches.

One of the ways we are doing this is through working with grassroots organizations for ML, such as Sisonkebiotik, an open and inclusive community of researchers, practitioners and enthusiasts at the intersection of ML and healthcare working together to build capacity and drive forward research initiatives in Africa. We worked in collaboration with the Sisonkebiotik community to detail limitations of historical top-down approaches for global health, and suggested complementary health-based methods, specifically those of grassroots participatory communities (GPCs). We jointly created a framework for ML and global health, laying out a practical roadmap towards setting up, growing and maintaining GPCs, based on common values across various GPCs such as Masakhane, Sisonkebiotik and Ro’ya.

We are engaging with open initiatives to better understand the role, perceptions and use cases of AI for health in non-western countries through human feedback, with an initial focus in Africa. Together with Ghana NLP, we have worked to detail the need to better understand algorithmic fairness and bias in health in non-western contexts. We recently launched a study to expand on this work using human feedback.

Biases along the ML pipeline and their associations with African-contextualized axes of disparities.

The CAIR team is committed to creating opportunities to hear more perspectives in AI development. We partnered with Sisonkebiotik to co-organize the Data Science for Health Workshop at Deep Learning Indaba 2023 in Ghana. Everyone’s voice is crucial to developing a better future using AI technology.

Acknowledgements

We would like to thank Negar Rostamzadeh, Stephen Pfohl, Subhrajit Roy, Diana Mincu, Chintan Ghate, Mercy Asiedu, Emily Salkey, Alexander D’Amour, Jessica Schrouff, Chirag Nagpal, Eltayeb Ahmed, Lev Proleev, Natalie Harris, Mohammad Havaei, Ben Hutchinson, Andrew Smart, Awa Dieng, Mahima Pushkarna, Sanmi Koyejo, Kerrie Kauer, Do Hee Park, Lee Hartsell, Jennifer Graves, Berk Ustun, Hailey Joren, Timnit Gebru and Margaret Mitchell for their contributions and influence, as well as our many friends and collaborators at Learning Ally, National MS Society, Duke University Hospital, STANDING Together, Sisonkebiotik, and Masakhane.

Overcoming leakage on error-corrected quantum processors

November 9, 2023

by Google AI Google AI

Posted by Kevin Miao and Matt McEwen, Research Scientists, Quantum AI Team

The qubits that make up Google quantum devices are delicate and noisy, so it’s necessary to incorporate error correction procedures that identify and account for qubit errors on the way to building a useful quantum computer. Two of the most prevalent error mechanisms are bit-flip errors (where the energy state of the qubit changes) and phase-flip errors (where the phase of the encoded quantum information changes). Quantum error correction (QEC) promises to address and mitigate these two prominent errors. However, there is an assortment of other error mechanisms that challenges the effectiveness of QEC.

While we want qubits to behave as ideal two-level systems with no loss mechanisms, this is not the case in reality. We use the lowest two energy levels of our qubit (which form the computational basis) to carry out computations. These two levels correspond to the absence (computational ground state) or presence (computational excited state) of an excitation in the qubit, and are labeled |0⟩ (“ket zero”) and |1⟩ (“ket one”), respectively. However, our qubits also host many higher levels called leakage states, which can become occupied. Following the convention of labeling the level by indicating how many excitations are in the qubit, we specify them as |2⟩, |3⟩, |4⟩, and so on.

In “Overcoming leakage in quantum error correction”, published in Nature Physics, we identify when and how our qubits leak energy to higher states, and show that the leaked states can corrupt nearby qubits through our two-qubit gates. We then identify and implement a strategy that can remove leakage and convert it to an error that QEC can efficiently fix. Finally, we show that these operations lead to notably improved performance and stability of the QEC process. This last result is particularly critical, since additional operations take time, usually leading to more errors.

Working with imperfect qubits

Our quantum processors are built from superconducting qubits called transmons. Unlike an ideal qubit, which only has two computational levels — a computational ground state and a computational excited state — transmon qubits have many additional states with higher energy than the computational excited state. These higher leakage states are useful for particular operations that generate entanglement, a necessary resource in quantum algorithms, and also keep transmons from becoming too non-linear and difficult to operate. However, the transmon can also be inadvertently excited into these leakage states through a variety of processes, including imperfections in the control pulses we apply to perform operations or from the small amount of stray heat leftover in our cryogenic refrigerator. These processes are collectively referred to as leakage, which describes the transition of the qubit from computational states to leakage states.

Consider a particular two-qubit operation that is used extensively in our QEC experiments: the CZ gate. This gate operates on two qubits, and when both qubits are in their |1⟩ level, an interaction causes the two individual excitations to briefly “bunch” together in one of the qubits to form |2⟩, while the other qubit becomes |0⟩, before returning to the original configuration where each qubit is in |1⟩. This bunching underlies the entangling power of the CZ gate. However, with a small probability, the gate can encounter an error and the excitations do not return to their original configuration, causing the operation to leave a qubit in |2⟩, a leakage state. When we execute hundreds or more of these CZ gates, this small leakage error probability accumulates.

Transmon qubits support many leakage states (|2⟩, |3⟩, |4⟩, …) beyond the computational basis (|0⟩ and |1⟩). While we typically only use the computational basis to represent quantum information, sometimes the qubit enters these leakage states, and disrupts the normal operation of our qubits.

A single leakage event is especially damaging to normal qubit operation because it induces many individual errors. When one qubit starts in a leaked state, the CZ gate no longer correctly entangles the qubits, preventing the algorithm from executing correctly. Not only that, but CZ gates applied to one qubit in leaked states can cause the other qubit to leak as well, spreading leakage through the device. Our work includes extensive characterization of how leakage is caused and how it interacts with the various operations we use in our quantum processor.

Once the qubit enters a leakage state, it can remain in that state for many operations before relaxing back to the computational states. This means that a single leakage event interferes with many operations on that qubit, creating operational errors that are bunched together in time (time-correlated errors). The ability for leakage to spread between the different qubits in our device through the CZ gates means we also concurrently see bunches of errors on neighboring qubits (space-correlated errors). The fact that leakage induces patterns of space- and time-correlated errors makes it especially hard to diagnose and correct from the perspective of QEC algorithms.

The effect of leakage in QEC

We aim to mitigate qubit errors by implementing surface code QEC, a set of operations applied to a collection of imperfect physical qubits to form a logical qubit, which has properties much closer to an ideal qubit. In a nutshell, we use a set of qubits called data qubits to hold the quantum information, while another set of measure qubits check up on the data qubits, reporting on whether they have suffered any errors, without destroying the delicate quantum state of the data qubits. One of the key underlying assumptions of QEC is that errors occur independently for each operation, but leakage can persist over many operations and cause a correlated pattern of multiple errors. The performance of our QEC strategies is significantly limited when leakage causes this assumption to be violated.

Once leakage manifests in our surface code transmon grid, it persists for a long time relative to a single surface code QEC cycle. To make matters worse, leakage on one qubit can cause its neighbors to leak as well.

Our previous work has shown that we can remove leakage from measure qubits using an operation called multi-level reset (MLR). This is possible because once we perform a measurement on measure qubits, they no longer hold any important quantum information. At this point, we can interact the qubit with a very lossy frequency band, causing whichever state the qubit was in (including leakage states) to decay to the computational ground state |0⟩. If we picture a Jenga tower representing the excitations in the qubit, we tumble the entire stack over. Removing just one brick, however, is much more challenging. Likewise, MLR doesn’t work with data qubits because they always hold important quantum information, so we need a new leakage removal approach that minimally disturbs the computational basis states.

Gently removing leakage

We introduce a new quantum operation called data qubit leakage removal (DQLR), which targets leakage states in a data qubit and converts them into computational states in the data qubit and a neighboring measure qubit. DQLR consists of a two-qubit gate (dubbed Leakage iSWAP — an iSWAP operation with leakage states) inspired by and similar to our CZ gate, followed by a rapid reset of the measure qubit to further remove errors. The Leakage iSWAP gate is very efficient and greatly benefits from our extensive characterization and calibration of CZ gates within the surface code experiment.

Recall that a CZ gate takes two single excitations on two different qubits and briefly brings them to one qubit, before returning them to their respective qubits. A Leakage iSWAP gate operates similarly, but almost in reverse, so that it takes a single qubit with two excitations (otherwise known as |2⟩) and splits them into |1⟩ on two qubits. The Leakage iSWAP gate (and for that matter, the CZ gate) is particularly effective because it does not operate on the qubits if there are fewer than two excitations present. We are precisely removing the |2⟩ Jenga brick without toppling the entire tower.

By carefully measuring the population of leakage states on our transmon grid, we find that DQLR can reduce average leakage state populations over all qubits to about 0.1%, compared to nearly 1% without it. Importantly, we no longer observe a gradual rise in the amount of leakage on the data qubits, which was always present to some extent prior to using DQLR.

This outcome, however, is only half of the puzzle. As mentioned earlier, an operation such as MLR could be used to effectively remove leakage on the data qubits, but it would also completely erase the stored quantum state. We also need to demonstrate that DQLR is compatible with the preservation of a logical quantum state.

The second half of the puzzle comes from executing the QEC experiment with this operation interleaved at the end of each QEC cycle, and observing the logical performance. Here, we use a metric called detection probability to gauge how well we are executing QEC. In the presence of leakage, time- and space-correlated errors will cause a gradual rise in detection probabilities as more and more qubits enter and stay in leakage states. This is most evident when we perform no reset at all, which rapidly leads to a transmon grid plagued by leakage, and it becomes inoperable for the purposes of QEC.

The prior state-of-the-art in our QEC experiments was to use MLR on the measure qubits to remove leakage. While this kept leakage population on the measure qubits (green circles) sufficiently low, data qubit leakage population (green squares) would grow and saturate to a few percent. With DQLR, leakage population on both the measure (blue circles) and data qubits (blue squares) remain acceptably low and stable.

With MLR, the large reduction in leakage population on the measure qubits drastically decreases detection probabilities and mitigates a considerable degree of the gradual rise. This reduction in detection probability happens even though we spend more time dedicated to the MLR gate, when other errors can potentially occur. Put another way, the correlated errors that leakage causes on the grid can be much more damaging than the uncorrelated errors from the qubits waiting idle, and it is well worth it for us to trade the former for the latter.

When only using MLR, we observed a small but persistent residual rise in detection probabilities. We ascribed this residual increase in detection probability to leakage accumulating on the data qubits, and found that it disappeared when we implemented DQLR. And again, the observation that the detection probabilities end up lower compared to only using MLR indicates that our added operation has removed a damaging error mechanism while minimally introducing uncorrelated errors.

Leakage manifests during surface code operation as increased errors (shown as error detection probabilities) over the number of cycles. With DQLR, we no longer see a notable rise in detection probability over more surface code cycles.

Prospects for QEC scale-up

Given these promising results, we are eager to implement DQLR in future QEC experiments, where we expect error mechanisms outside of leakage to be greatly improved, and sensitivity to leakage to be enhanced as we work with larger and larger transmon grids. In particular, our simulations indicate that scale-up of our surface code will almost certainly require a large reduction in leakage generation rates, or an active leakage removal technique over all qubits, such as DQLR.

Having laid the groundwork by understanding where leakage is generated, capturing the dynamics of leakage after it presents itself in a transmon grid, and showing that we have an effective mitigation strategy in DQLR, we believe that leakage and its associated errors no longer pose an existential threat to the prospects of executing a surface code QEC protocol on a large grid of transmon qubits. With one fewer challenge standing in the way of demonstrating working QEC, the pathway to a useful quantum computer has never been more promising.

Acknowledgements

This work would not have been possible without the contributions of the entire Google Quantum AI Team.

Generative AI in Search expands to more than 120 new countries and territories

November 8, 2023

by Google AI

Generative AI in Search, or Search Generative Experience (SGE), is expanding around the world, and adding four new languages.Read More

Boosting sustainable solutions from Sweden

November 8, 2023

by Google AI

Today, we’re announcing the Swedish recipients of Google.org Impact Challenge: Tech for Social Good – receiving technical support and 3 million Euros in funding for char…Read More

Alternating updates for efficient transformers

November 7, 2023

by Google AI Google AI

Posted by Xin Wang, Software Engineer, and Nishanth Dikkala, Research Scientist, Google Research

Contemporary deep learning models have been remarkably successful in many domains, ranging from natural language to computer vision. Transformer neural networks (transformers) are a popular deep learning architecture that today comprise the foundation for most tasks in natural language processing and also are starting to extend to applications in other domains, such as computer vision, robotics, and autonomous driving. Moreover, they form the backbone of all the current state-of-the-art language models.

Increasing scale in Transformer networks has led to improved performance and the emergence of behavior not present in smaller networks. However, this increase in scale often comes with prohibitive increases in compute cost and inference latency. A natural question is whether we can reap the benefits of larger models without incurring the computational burden.

In “Alternating Updates for Efficient Transformers”, accepted as a Spotlight at NeurIPS 2023, we introduce AltUp, a method to take advantage of increased token representation without increasing the computation cost. AltUp is easy to implement, widely applicable to any transformer architecture, and requires minimal hyperparameter tuning. For instance, using a variant of AltUp on a 770M parameter T5-Large model, the addition of ~100 parameters yields a model with a significantly better quality.

Background

To understand how we can achieve this, we dig into how transformers work. First, they partition the input into a sequence of tokens. Each token is then mapped to an embedding vector (via the means of an embedding table) called the token embedding. We call the dimension of this vector the token representation dimension. The Transformer then operates on this sequence of token embeddings by applying a series of computation modules (called layers) using its network parameters. The number of parameters in each transformer layer is a function of the layer’s width, which is determined by the token representation dimension.

To achieve benefits of scale without incurring the compute burden, prior works such as sparse mixture-of-experts (Sparse MoE) models (e.g., Switch Transformer, Expert Choice, V-MoE) have predominantly focused on efficiently scaling up the network parameters (in the self-attention and feedforward layers) by conditionally activating a subset based on the input. This allows us to scale up network size without significantly increasing compute per input. However, there is a research gap on scaling up the token representation dimension itself by conditionally activating parts of the token representation vector.

Recent works (for example, scaling laws and infinite-width networks) have empirically and theoretically established that a wider token representation helps in learning more complicated functions. This phenomenon is also evident in modern architectures of increasing capability. For instance, the representation dimension grows from 512 (small) to 768 (base) and 1024 (corresponding to models with 770M, 3B, and 11B parameters respectively) in T5 models, and from 4096 (8B) to 8192 (64B) and 18432 (540B) in PaLM models. A widened representation dimension also significantly improves performance for dual encoder retrieval models. However, naïvely widening the representation vector requires one to increase the model dimension accordingly, which quadratically¹ increases the amount of computation in the feedforward computation.

Method

AltUp works by partitioning a widened representation vector into equal sized blocks, processing only a single block at each layer, and using an efficient prediction-correction mechanism to infer the outputs of the other blocks (shown below on the right). This allows AltUp to simultaneously keep the model dimension, hence the computation cost, roughly constant and take advantage of using an increased token dimension. The increased token dimension allows the model to pack more information into each token’s embedding. By keeping the width of each transformer layer constant, AltUp avoids incurring the quadratic increase in computation cost that would otherwise be present with a naïve expansion of the representation.

An illustration of widening the token representation without (left) and with AltUp (right). This widening causes a near-quadratic increase in computation in a vanilla transformer due to the increased layer width. In contrast, Alternating Updates keeps the layer width constant and efficiently computes the output by operating on a sub-block of the representation at each layer.

More specifically, the input to each layer is two or more blocks, one of which is passed into the 1x width transformer layer (see figure below). We refer to this block as the “activated” block. This computation results in the exact output for the activated block. In parallel, we invoke a lightweight predictor that computes a weighted combination of all the input blocks. The predicted values, along with the computed value of the activated block, are passed on to a lightweight corrector that updates the predictions based on the observed values. This correction mechanism enables the inactivated blocks to be updated as a function of the activated one. Both the prediction and correction steps only involve a limited number of vector additions and multiplications and hence are much faster than a regular transformer layer. We note that this procedure can be generalized to an arbitrary number of blocks.

The predictor and corrector computations: The predictor mixes sub-blocks with trainable scalar coefficients; the corrector returns a weighted average of the predictor output and the transformer output. The predictor and corrector perform scalar-vector multiplications and incur negligible computation cost compared to the transformer. The predictor outputs a linear mixing of blocks with scalar mixing coefficients p_{i, j} , and the corrector combines predictor output and transformer output with weights g_i.

At a higher level, AltUp is similar to sparse MoE in that it is a method to add capacity to a model in the form of conditionally accessed (external) parameters. In sparse MoE, the additional parameters take the form of feed forward network (FFN) experts and the conditionality is with respect to the input. In AltUp, the external parameters come from the widened embedding table and the conditionality takes the form of alternating block-wise activation of the representation vector, as in the figure above. Hence, AltUp has the same underpinning as sparse MoE models.

An advantage of AltUp over sparse MoE is that it does not necessitate sharding since the number of additional parameters introduced is a factor² of the embedding table size, which typically makes up a small fraction of the overall model size. Moreover, since AltUp focuses on conditionally activating parts of a wider token representation, it can be applied synergistically with orthogonal techniques like MoE to obtain complementary performance gains.

Evaluation

AltUp was evaluated on T5 models on various benchmark language tasks. Models augmented with AltUp are uniformly faster than the extrapolated dense models at the same accuracy. For example, we observe that a T5 Large model augmented with AltUp leads to a 27%, 39%, 87%, and 29% speedup on GLUE, SuperGLUE, SQuAD, and Trivia-QA benchmarks, respectively.

Evaluations of AltUp on T5 models of various sizes and popular benchmarks. AltUp consistently leads to sizable speedups relative to baselines at the same accuracy. Latency is measured on TPUv3 with 8 cores. Speedup is defined as the change in latency divided by the AltUp latency (B = T5 Base, L = T5 Large, XL = T5 XL models).

AltUp’s relative performance improves as we apply it to larger models — compare the relative speedup of T5 Base + AltUp to that of T5 Large + AltUp. This demonstrates the scalability of AltUp and its improved performance on even larger models. Overall, AltUp consistently leads to models with better predictive performance than the corresponding baseline models with the same speed on all evaluated model sizes and benchmarks.

Extensions: Recycled AltUp

The AltUp formulation adds an insignificant amount of per-layer computation, however, it does require using a wider embedding table. In certain scenarios where the vocabulary size (i.e., the number of distinct tokens the tokenizer can produce) is very large, this may lead to a non-trivial amount of added computation for the initial embedding lookup and the final linear + softmax operation. A very large vocabulary may also lead to an undesirable amount of added embedding parameters. To address this, Recycled-AltUp is an extension of AltUp that avoids these computational and parameter costs by keeping the embedding table’s width the same.

Illustration of the Architecture for Recycled-AltUp with K = 2.

In Recycled-AltUp, instead of widening the initial token embeddings, we replicate the embeddings K times to form a wider token representation. Hence, Recycled-AltUp adds virtually no additional parameters relative to the baseline transformer, while benefiting from a wider token representation.

Recycled-AltUp on T5-B/L/XL compared to baselines. Recycled-AltUp leads to strict improvements in pre-training performance without incurring any perceptible slowdown.

We also evaluate the lightweight extension of AltUp, Recycled-AltUp, with K = 2 on T5 base, large, and XL models and compare its pre-trained accuracy and speed to those of baselines. Since Recycled-AltUp does not require an expansion in the embedding table dimension, the models augmented with it have virtually the same number of trainable parameters as the baseline models. We again observe consistent improvements compared to the dense baselines.

Why does AltUp work?

AltUp increases a model’s capacity by adding and efficiently leveraging auxiliary parameters to the embedding table, and maintaining the higher dimensional representation across the layers. We believe that a key ingredient in this computation lies in AltUp’s prediction mechanism that performs an ensemble of the different blocks. This weighted combination enables continuous message passing to the entire vector despite activating only sub-blocks of it in each layer. Recycled-AltUp, on the other hand, does not add any additional parameters to the token embeddings. However, it still confers the benefit of simulating computation in a higher dimensional representation space since a higher dimensional representation vector is maintained when moving from one transformer layer to another. We conjecture that this aids the training by augmenting the flow of information through the network. An interesting research direction is to explore whether the benefits of Recycled-AltUp can be explained entirely by more favorable training dynamics.

Acknowledgements

We thank our collaborators Cenk Baykal, Dylan Cutler, and Rina Panigrahy at Google Research, and Nikhil Ghosh at University of California, Berkeley (work done during research internship at Google).

¹This is because the feedforward layers of a Transformer are typically scaled quadratically with the model dimension. ^↩

²This factor depends on the user-specified expansion factor, but is typically 1, i.e., we double the embedding table dimension. ^↩

Coordinating time-aligned and contextual modalities

Time-aligned autoregressive modeling of video and audio

Modeling long videos with a modality combiner

Combiner styles

Results

Results on audio-video benchmarks

Benefits of autoregressive modeling

Conclusion

Acknowledgements

Extending the FDA MyStudies platform

Multi-platform support with rapid prototyping

System accessibility enhancements

Study accessibility enhancements

The MS Signals prototype study

MS Studies app design

Questionnaire design

Focusing on accessibility

Next steps

Data

Model

Deployment

Human feedback

Acknowledgements

Working with imperfect qubits

The effect of leakage in QEC

Gently removing leakage

Prospects for QEC scale-up

Acknowledgements

Background

Method

Evaluation

Extensions: Recycled AltUp

Why does AltUp work?

Acknowledgements

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.