Developing industrial use cases for physical simulation on future error-corrected quantum computers

Developing industrial use cases for physical simulation on future error-corrected quantum computers

If you’ve paid attention to the quantum computing space, you’ve heard the claim that in the future, quantum computers will solve certain problems exponentially more efficiently than classical computers can. They have the potential to transform many industries, from pharmaceuticals to energy.

For the most part, these claims have rested on arguments about the asymptotic scaling of algorithms as the problem size approaches infinity, but this tells us very little about the practical performance of quantum computers for finite-sized problems. We want to be more concrete: Exactly which problems are quantum computers more suited to tackle than their classical counterparts, and exactly what quantum algorithms could we run to solve these problems? Once we’ve designed an algorithm, we can go beyond analysis based on asymptotic scaling — we can determine the actual resources required to compile and run the algorithm on a quantum computer, and how that compares to a classical computation.

Over the last few years, Google Quantum AI has collaborated with industry and academic partners to assess the prospects for quantum simulation to revolutionize specific technologies and perform concrete analyses of the resource requirements. In 2022, we developed quantum algorithms to analyze the chemistry of an important enzyme family called cytochrome P450. Then, in our paper released this fall, we demonstrated how to use a quantum computer to study sustainable alternatives to cobalt for use in lithium ion batteries. And most recently, as we report in a preprint titled “Quantum computation of stopping power for inertial fusion target design,” we’ve found a new application in modeling the properties of materials in inertial confinement fusion experiments, such as those at the National Ignition Facility (NIF) at Lawrence Livermore National Laboratory, which recently made headlines for a breakthrough in nuclear fusion.

Below, we describe these three industrially relevant applications for simulations with quantum computers. While running the algorithms will require an error-corrected quantum computer, which is still years away, working on this now will ensure that we are ready with efficient quantum algorithms when such a quantum computer is built. Already, our work has reduced the cost of compiling and running the algorithms significantly, as we have reported in the past. Our work is essential for demonstrating the potential of quantum computing, but it also provides our hardware team with target specifications for the number of qubits and time needed to run useful quantum algorithms in the future.

Application 1: The CYP450 mechanism

The pharmaceutical industry is often touted as a field ripe for discovery using quantum computers. But concrete examples of such potential applications are few and far between. Working with collaborators at the pharmaceutical company Boehringer Ingelheim, our partners at the startup QSimulate, and academic colleagues at Columbia University, we explored one example in the 2022 PNAS article, “Reliably assessing the electronic structure of cytochrome P450 on today’s classical computers and tomorrow’s quantum computers”.

Cytochrome P450 is an enzyme family naturally found in humans that helps us metabolize drugs. It excels at its job: more than 70% of all drug metabolism is performed by enzymes of the P450 family. The enzymes work by oxidizing the drug — a process that depends on complex correlations between electrons. The details of the interactions are too complicated for scientists to know a priori how effective the enzyme will be on a particular drug.

In the paper, we showed how a quantum computer could approach this problem. The CYP450 metabolic process is a complex chain of reactions with many intermediate changes in the electronic structure of the enzymes throughout. We first use state-of-the-art classical methods to determine the resources required to simulate this problem on a classical computer. Then we imagine implementing a phase-estimation algorithm — which is needed to compute the ground-state energies of the relevant electronic configurations throughout the reaction chain — on a surface-code error-corrected quantum computer.

With a quantum computer, we could follow the chain of changing electronic structure with greater accuracy and fewer resources. In fact, we find that the higher accuracy offered by a quantum computer is needed to correctly resolve the chemistry in this system, so not only will a quantum computer be better, it will be necessary. And as the system size gets bigger, i.e., the more quantum energy levels we include in the simulation, the more the quantum computer wins over the classical computer. Ultimately, we show that a few million physical qubits would be required to reach quantum advantage for this problem.

Left: Example of an electron orbital (red and blue) of a CYP enzyme. More than 60 such orbitals are required to model the CYP system. Right: Comparison of actual runtime (CPU) of various classical techniques (blue) to hypothetical runtime (QPU) of a quantum algorithm (green). The lower slope of the quantum algorithm demonstrates the favorable asymptotic scaling over classical methods. Already at about 20-30 orbitals, we see a crossover to the regime where a quantum algorithm would be more efficient than classical methods.

Application 2: Lithium-ion batteries

Lithium-ion batteries rely on the electrochemical potential difference between two lithium containing materials. One material used today for the cathodes of Li-ion batteries is LiCoO2. Unfortunately, it has drawbacks from a manufacturing perspective. Cobalt mining is expensive, destructive to the environment, and often utilizes unsafe or abusive labor practices. Consequently, many in the field are interested in alternatives to cobalt for lithium-ion cathodes.

In the 1990’s, researchers discovered that nickel could replace cobalt to form LiNiO2 (called “lithium nickel oxide” or “LNO”) for cathodes. While pure LNO was found to be unstable in production, many cathode materials used in the automotive industry today use a high fraction of nickel and hence, resemble LNO. Despite its applications to industry, however, not all of the chemical properties of LNO are understood — even the properties of its ground state remains a subject of debate.

In our recent paper, “Fault tolerant quantum simulation of materials using Bloch orbitals,” we worked with the chemical company, BASF, the molecular modeling startup, QSimulate, and collaborators at Macquarie University in Australia to develop techniques to perform quantum simulations on systems with periodic, regularly spaced atomic structure, such as LNO. We then applied these techniques to design algorithms to study the relative energies of a few different candidate structures of LNO. With classical computers, high accuracy simulations of the quantum wavefunction are considered too expensive to perform. In our work, we found that a quantum computer would need tens of millions of physical qubits to calculate the energies of each of the four candidate ground-state LNO structures. This is out of reach of the first error-corrected quantum computers, but we expect this number to come down with future algorithmic improvements.

Four candidate structures of LNO. In the paper, we consider the resources required to compare the energies of these structures in order to find the ground state of LNO.

Application 3: Fusion reactor dynamics

In our third and most recent example, we collaborated with theorists at Sandia National Laboratories and our Macquarie University collaborators to put our hypothetical quantum computer to the task of simulating dynamics of charged particles in the extreme conditions typical of inertial confinement fusion (ICF) experiments, like those at the National Ignition Facility. In those experiments, high-intensity lasers are focused into a metallic cavity (hohlraum) that holds a target capsule consisting of an ablator surrounding deuterium–tritium fuel. When the lasers heat the inside of the hohlraum, its walls radiate x-rays that compress the capsule, heating the deuterium and tritium inside to 10s of millions of Kelvin. This allows the nucleons in the fuel to overcome their mutual electrostatic repulsion and start fusing into helium nuclei, also called alpha particles.

Simulations of these experiments are computationally demanding and rely on models of material properties that are themselves uncertain. Even testing these models, using methods similar to those in quantum chemistry, is extremely computationally expensive. In some cases, such test calculations have consumed >100 million CPU hours. One of the most expensive and least accurate aspects of the simulation is the dynamics of the plasma prior to the sustained fusion stage (>10s of millions of Kelvin), when parts of the capsule and fuel are a more balmy 100k Kelvin. In this “warm dense matter” regime, quantum correlations play a larger role in the behavior of the system than in the “hot dense matter” regime when sustained fusion takes place.

In our new preprint, “Quantum computation of stopping power for inertial fusion target design”, we present a quantum algorithm to compute the so-called “stopping power” of the warm dense matter in a nuclear fusion experiment. The stopping power is the rate at which a high energy alpha particle slows down due to Coulomb interactions with the surrounding plasma. Understanding the stopping power of the system is vital for optimizing the efficiency of the reactor. As the alpha particle is slowed by the plasma around it, it transfers its energy to the plasma, heating it up. This self-heating process is the mechanism by which fusion reactions sustain the burning plasma. Detailed modeling of this process will help inform future reactor designs.

We estimate that the quantum algorithm needed to calculate the stopping power would require resources somewhere between the P450 application and the battery application. But since this is the first case study on first-principles dynamics (or any application at finite temperature), such estimates are just a starting point and we again expect to find algorithmic improvements to bring this cost down in the future. Despite this uncertainty, it is still certainly better than the classical alternative, for which the only tractable approaches for these simulations are mean-field methods. While these methods incur unknown systematic errors when describing the physics of these systems, they are currently the only meaningful means of performing such simulations.

Left: A projectile (red) passing through a medium (blue) with initial velocity vproj. Right: To calculate the stopping power, we monitor the energy transfer between the projectile and the medium (blue solid line) and determine its average slope (red dashed line).

Discussion and conclusion

The examples described above are just three of a large and growing body of concrete applications for a future error-corrected quantum computer in simulating physical systems. This line of research helps us understand the classes of problems that will most benefit from the power of quantum computing. In particular, the last example is distinct from the other two in that it is simulating a dynamical system. In contrast to the other problems, which focus on finding the lowest energy, static ground state of a quantum system, quantum dynamics is concerned with how a quantum system changes over time. Since quantum computers are inherently dynamic — the qubit states evolve and change as each operation is performed — they are particularly well suited to solving these kinds of problems. Together with collaborators at Columbia, Harvard, Sandia National Laboratories and Macquarie University in Australia we recently published a paper in Nature Communications demonstrating that quantum algorithms for simulating electron dynamics can be more efficient even than approximate, “mean-field” classical calculations, while simultaneously offering much higher accuracy.

Developing and improving algorithms today prepares us to take full advantage of them when an error-corrected quantum computer is eventually realized. Just as in the classical computing case, we expect improvements at every level of the quantum computing stack to further lower the resource requirements. But this first step helps separate hyperbole from genuine applications amenable to quantum computational speedups.

Acknowledgements


We would like to thank Katie McCormick, our Quantum Science Communicator, for helping to write this blog post.

Read More

SANPO: A Scene understanding, Accessibility, Navigation, Pathfinding, & Obstacle avoidance dataset

SANPO: A Scene understanding, Accessibility, Navigation, Pathfinding, & Obstacle avoidance dataset

As most people navigate their everyday world, they process visual input from the environment using an eye-level perspective. Unlike robots and self-driving cars, people don’t have any “out-of-body” sensors to help guide them. Instead, a person’s sensory input is completely “egocentric”, or “from the self.” This also applies to new technologies that understand the world around us from a human-like perspective, e.g., robots navigating through unknown buildings, AR glasses that highlight objects, or assistive technology to help people run independently.

In computer vision, scene understanding is the subfield that studies how visible objects relate to the scene’s 3D structure and layout by focusing on the spatial, functional, and semantic relationships between objects and their environment. For example, autonomous drivers must understand the 3D structure of the road, sidewalks, and surrounding buildings while identifying and recognizing street signs and stop lights, a task made easier with 3D data from a special laser scanner mounted on the top of the car rather than 2D images from the driver’s perspective. Robots navigating a park must understand where the path is and what obstacles might interfere, which is simplified with a map of their surroundings and GPS positioning data. Finally, AR glasses that help users find their way need to understand where the user is and what they are looking at.

The computer vision community typically studies scene understanding tasks in contexts like self-driving, where many other sensors (GPS, wheel positioning, maps, etc.) beyond egocentric imagery are available. Yet most datasets in this space do not focus exclusively on egocentric data, so they are less applicable to human-centered navigation tasks. While there are plenty of self-driving focused scene understanding datasets, they have limited generalization to egocentric human scene understanding. A comprehensive human egocentric dataset would help build systems for related applications and serve as a challenging benchmark for the scene understanding community.

To that end, we present the Scene understanding, Accessibility, Navigation, Pathfinding, Obstacle avoidance dataset, or SANPO (also the Japanese word for ”brisk stroll”), a multi-attribute video dataset for outdoor human egocentric scene understanding. The dataset consists of real world data and synthetic data, which we call SANPO-Real and SANPO-Synthetic, respectively. It supports a wide variety of dense prediction tasks, is challenging for current models, and includes real and synthetic data with depth maps and video panoptic masks in which each pixel is assigned a semantic class label (and for some semantic classes, each pixel is also assigned a semantic instance ID that uniquely identifies that object in the scene). The real dataset covers diverse environments and has videos from two stereo cameras to support multi-view methods, including 11.4 hours captured at 15 frames per second (FPS) with dense annotations. Researchers can download and use SANPO here.

3D scene of a real session built using the provided annotations (segmentation, depth and camera positions). The top center video shows the depth map, and the top right shows the RGB or semantic annotations.

SANPO-Real

SANPO-Real is a multiview video dataset containing 701 sessions recorded with two stereo cameras: a head-mounted ZED Mini and a chest-mounted ZED-2i. That’s four RGB streams per session at 15 FPS. 597 sessions are recorded at a resolution of 2208×1242 pixels, and the remainder are recorded at a resolution of 1920×1080 pixels. Each session is approximately 30 seconds long, and the recorded videos are rectified using Zed software and saved in a lossless format. Each session has high-level attribute annotations, camera pose trajectories, dense depth maps from CREStereo, and sparse depth maps provided by the Zed SDK. A subset of sessions have temporally consistent panoptic segmentation annotations of each instance.

The SANPO data collection system for collecting real-world data. Right: (i) a backpack with ZED 2i and ZED Mini cameras for data collection (bottom), (ii) the inside of the backpack showing the ZED box and battery pack mounted on a 3D printed container (middle), and (iii) an Android app showing the live feed from the ZED cameras (top). Left: The chest-mounted ZED-2i has a stereo baseline of 12cm with a 2.1mm focal length, and the head-mounted ZED Mini has a baseline of 6.3cm with a 2.1mm focal length.

Temporally consistent panoptic segmentation annotation protocol

SANPO includes thirty different class labels, including various surfaces (road, sidewalk, curb, etc.), fences (guard rails, walls,, gates), obstacles (poles, bike racks, trees), and creatures (pedestrians, riders, animals). Gathering high-quality annotations for these classes is an enormous challenge. To provide temporally consistent panoptic segmentation annotation we divide each video into 30-second sub-videos and annotate every fifth frame (90 frames per sub-video), using a cascaded annotation protocol. At each stage, we ask annotators to draw borders around five mutually exclusive labels at a time. We send the same image to different annotators with as many stages as it takes to collect masks until all labels are assigned, with annotations from previous subsets frozen and shown to the annotator. We use AOT, a machine learning model that reduces annotation effort by giving annotators automatic masks from which to start, taken from previous frames during the annotation process. AOT also infers segmentation annotations for intermediate frames using the manually annotated preceding and following frames. Overall, this approach reduces annotation time, improves boundary precision, and ensures temporally consistent annotations for up to 30 seconds.

Temporally consistent panoptic segmentation annotations. The segmentation mask’s title indicates whether it was manually annotated or AOT propagated.

SANPO-Synthetic

Real-world data has imperfect ground truth labels due to hardware, algorithms, and human mistakes, whereas synthetic data has near-perfect ground truth and can be customized. We partnered with Parallel Domain, a company specializing in lifelike synthetic data generation, to create SANPO-Synthetic, a high-quality synthetic dataset to supplement SANPO-Real. Parallel Domain is skilled at creating handcrafted synthetic environments and data for machine learning applications. Thanks to their work, SANPO-Synthetic matches real-world capture conditions with camera parameters, placement, and scenery.

3D scene of a synthetic session built using the provided annotations (segmentation, depth and odometry). The top center video shows the depth map, and the top right shows the RGB or semantic annotations.

SANPO-Synthetic is a high quality video dataset, handcrafted to match real world scenarios. It contains 1961 sessions recorded using virtualized Zed cameras, evenly split between chest-mounted and head-mounted positions and calibrations. These videos are monocular, recorded from the left lens only. These sessions vary in length and FPS (5, 14.28, and 33.33) for a mix of temporal resolution / length tradeoffs, and are saved in a lossless format. All the sessions have precise camera pose trajectories, dense pixel accurate depth maps and temporally consistent panoptic segmentation masks.

SANPO-Synthetic data has pixel-perfect annotations, even for small and distant instances. This helps develop challenging datasets that mimic the complexity of real-world scenes. SANPO-Synthetic and SANPO-Real are also drop-in replacements for each other, so researchers can study domain transfer tasks or use synthetic data during training with few domain-specific assumptions.

An even sampling of real and synthetic scenes.

Statistics

Semantic classes

We designed our SANPO taxonomy: i) with human egocentric navigation in mind, ii) with the goal of being reasonably easy to annotate, and iii) to be as close as possible to the existing segmentation taxonomies. Though built with human egocentric navigation in mind, it can be easily mapped or extended to other human egocentric scene understanding applications. Both SANPO-Real and SANPO-Synthetic feature a wide variety of objects one would expect in egocentric obstacle detection data, such as roads, buildings, fences, and trees. SANPO-Synthetic includes a broad distribution of hand-modeled objects, while SANPO-Real features more “long-tailed” classes that appear infrequently in images, such as gates, bus stops, or animals.

Distribution of images across the classes in the SANPO taxonomy.

Instance masks

SANPO-Synthetic and a portion of SANPO-Real are also annotated with panoptic instance masks, which assign each pixel to a class and instance ID. Because it is generally human-labeled, SANPO-Real has a large number of frames with generally less than 20 instances per frame. Similarly, SANPO-Synthetic’s virtual environment offers pixel-accurate segmentation of most unique objects in the scene. This means that synthetic images frequently feature many more instances within each frame.

When considering per-frame instance counts, synthetic data frequently features many more instances per frame than the labeled portions of SANPO-Real.

Comparison to other datasets

We compare SANPO to other important video datasets in this field, including SCAND, MuSoHu, Ego4D, VIPSeg, and Waymo Open. Some of these are intended for robot navigation (SCAND) or autonomous driving (Waymo) tasks. Across these datasets, only Waymo Open and SANPO have both panoptic segmentations and depth maps, and only SANPO has both real and synthetic data.

Comparison to other video datasets. For stereo vs mono video, datasets marked with ★ have stereo video for all scenes and those marked ☆ provide stereo video for a subset. For depth maps, ★ indicates dense depth while ☆ represents sparse depth, e.g., from a lower-resolution LIDAR scanner.

Conclusion and future work

We present SANPO, a large-scale and challenging video dataset for human egocentric scene understanding, which includes real and synthetic samples with dense prediction annotations. We hope SANPO will help researchers build visual navigation systems for the visually impaired and advance visual scene understanding. Additional details are available in the preprint and on the SANPO dataset GitHub repository.

Acknowledgements

This dataset was the outcome of hard work of many individuals from various teams within Google and our external partner, Parallel Domain.

Core Team: Mikhail Sirotenko, Dave Hawkey, Sagar Waghmare, Kimberly Wilber, Xuan Yang, Matthew Wilson

Parallel Domain: Stuart Park, Alan Doucet, Alex Valence-Lanoue, & Lars Pandikow.

We would also like to thank following team members: Hartwig Adam, Huisheng Wang, Lucian Ionita, Nitesh Bharadwaj, Suqi Liu, Stephanie Debats, Cattalyya Nuengsigkapian, Astuti Sharma, Alina Kuznetsova, Stefano Pellegrini, Yiwen Luo, Lily Pagan, Maxine Deines, Alex Siegman, Maura O’Brien, Rachel Stigler, Bobby Tran, Supinder Tohra, Umesh Vashisht, Sudhindra Kopalle, Reet Bhatia.

Read More

Scalable spherical CNNs for scientific applications

Scalable spherical CNNs for scientific applications

Typical deep learning models for computer vision, like convolutional neural networks (CNNs) and vision transformers (ViT), process signals assuming planar (flat) spaces. For example, digital images are represented as a grid of pixels on a plane. However, this type of data makes up only a fraction of the data we encounter in scientific applications. Variables sampled from the Earth’s atmosphere, like temperature and humidity, are naturally represented on the sphere. Some kinds of cosmological data and panoramic photos are also spherical signals, and are better treated as such.

Using methods designed for planar images to process spherical signals is problematic for a couple of reasons. First, there is a sampling problem, i.e., there is no way of defining uniform grids on the sphere, which are needed for planar CNNs and ViTs, without heavy distortion.

When projecting the sphere into a plane, the patch represented by the red circle is heavily distorted near the poles. This sampling problem hurts the accuracy of conventional CNNs and ViTs on spherical inputs.

Second, signals and local patterns on the sphere are often complicated by rotations, so models need a way to address that. We would like equivariance to 3D rotations, which ensures that learned features follow the rotations of the input. This leads to better utilization of the model parameters and allows training with less data. Equivariance to 3D rotations is also useful in most settings where inputs don’t have a preferred orientation, such as 3D shapes and molecules.

Drone racing with panoramic cameras. Here the sharp turns result in large 3D rotations of the spherical image. We would like our models to be robust to such rotations. Source: https://www.youtube.com/watch?v=_J7qXbbXY80 (licensed under CC BY)
In the atmosphere, it is common to see similar patterns appearing at different positions and orientations. We would like our models to share parameters to recognize these patterns.

With the above challenges in mind, in “Scaling Spherical CNNs”, presented at ICML 2023, we introduce an open-source library in JAX for deep learning on spherical surfaces. We demonstrate how applications of this library match or surpass state-of-the-art performance on weather forecasting and molecular property prediction benchmarks, tasks that are typically addressed with transformers and graph neural networks.

Background on spherical CNNs

Spherical CNNs solve both the problems of sampling and of robustness to rotation by leveraging spherical convolution and cross-correlation operations, which are typically computed via generalized Fourier transforms. For planar surfaces, however, convolution with small filters is faster, because it can be performed on regular grids without using Fourier transforms. The higher computational cost for spherical inputs has so far restricted the application of spherical CNNs to small models and datasets and low resolution datasets.

Our contributions

We have implemented the spherical convolutions from spin-weighted spherical CNNs in JAX with a focus on speed, and have enabled distributed training over a large number of TPUs using data parallelism. We also introduced a new phase collapse activation and spectral batch normalization layer, and a new residual block that improves accuracy and efficiency, which allows training more accurate models up to 100x larger than before. We apply these new models on molecular property regression and weather forecasting.

We scale spherical CNNs by up to two orders of magnitude in terms of feature sizes and model capacity, compared to the literature: Cohen’18Esteves’18Esteves’20, and Cobb’21VGG-19 is included as a conventional CNN reference. Our largest model for weather forecasting has 256 x 256 x 78 inputs and outputs, and runs 96 convolutional layers during training with a lowest internal resolution of 128 x 128 x 256.

Molecular property regression

Predicting properties of molecules has applications in drug discovery, where the goal is to quickly screen numerous molecules in search of those with desirable properties. Similar models may also be relevant in the design of drugs targeting the interaction between proteins. Current methods in computational or experimental quantum chemistry are expensive, which motivates the use of machine learning.

Molecules can be represented by a set of atoms and their positions in 3D space; rotations of the molecule change the positions but not the molecular properties. This motivates the application of spherical CNNs because of their rotation equivariance. However, molecules are not defined as signals on the sphere so the first step is to map them to a set of spherical functions. We do so by leveraging physics-based interactions between the atoms of the molecule.

Each atom is represented by a set of spherical signals accumulating physical interactions with other atoms of each type (shown in the three panels on the right). For example, the oxygen atom (O; top panel) has a channel for oxygen (indicated by the sphere labeled “O” on the left) and hydrogen (“H”, right). The accumulated Coulomb forces on the oxygen atom with respect to the two hydrogen atoms is indicated by the red shaded regions on the bottom of the sphere labeled “H”. Because the oxygen atom contributes no forces to itself, the “O” sphere is uniform. We include extra channels for the Van der Waals forces.

Spherical CNNs are applied to each atom’s features, and results are later combined to produce the property predictions. This results in state-of-the art performance in most properties as typically evaluated in the QM9 benchmark:

Error comparison against the state-of-the-art on 12 properties of QM9 (see the dataset paper for details). We show TorchMD-Net and PaiNN results, normalizing TorchMD-Net errors to 1.0 (lower is better). Our model, shown in green, outperforms the baselines in most targets.

Weather forecasting

Accurate climate forecasts serve as invaluable tools for providing timely warnings of extreme weather events, enabling effective water resource management, and guiding informed infrastructure planning. In a world increasingly threatened by climate disasters, there is an urgency to deliver forecasts much faster and more accurately over a longer time horizon than general circulation models. Forecasting models will also be important for predicting the safety and effectiveness of efforts intended to combat climate change, such as climate interventions. The current state-of-the-art uses costly numerical models based on fluid dynamics and thermodynamics, which tend to drift after a few days.

Given these challenges, there is an urgency for machine learning researchers to address climate forecasting problems, as data-driven techniques have the potential of both reducing the computational cost and improving long range accuracy. Spherical CNNs are suitable for this task since atmospheric data is natively presented on the sphere. They can also efficiently handle repeating patterns at different positions and orientations that are common in such data.

We apply our models to several weather forecasting benchmarks and outperform or match neural weather models based on conventional CNNs (specifically, 1, 2, and 3). Below we show results in a test setting where the model takes a number of atmospheric variables as input and predicts their values six hours ahead. The model is then iteratively applied on its own predictions to produce longer forecasts. During training, the model predicts up to three days ahead, and is evaluated up to five days. Keisler proposed a graph neural network for this task, but we show that spherical CNNs can match the GNN accuracy in the same setting.

Iterative weather forecasting up to five days (120h) ahead with spherical CNNs. The animations show the specific humidity forecast at a given pressure and its error.
Wind speed and temperature forecasts with spherical CNNs.

Additional resources

Our JAX library for efficient spherical CNNs is now available. We have shown applications to molecular property regression and weather forecasting, and we believe the library will be helpful in other scientific applications, as well as in computer vision and 3D vision.

Weather forecasting is an active area of research at Google with the goal of building more accurate and robust models — like Graphcast, a recent ML-based mid-range forecasting model — and to build tools that enable further advancement across the research community, such as the recently released WeatherBench 2.

Acknowledgements

This work was done in collaboration with Jean-Jacques Slotine, and is based on previous collaborations with Kostas Daniilidis and Christine Allen-Blanchette. We thank Stephan Hoyer, Stephan Rasp, and Ignacio Lopez-Gomez for helping with data processing and evaluation, and Fei Sha, Vivian Yang, Anudhyan Boral, Leonardo Zepeda-Núñez, and Avram Hershko for suggestions and discussions. We are thankful to Michael Riley and Corinna Cortes for supporting and encouraging this project.

Read More

Google at ICCV 2023

Google at ICCV 2023

Google is proud to be a Platinum Sponsor of the International Conference on Computer Vision (ICCV 2023), a premier annual conference, which is being held this week in Paris, France. As a leader in computer vision research, Google has a strong presence at this year’s conference with 60 accepted papers and active involvement in 27 workshops and tutorials. Google is also proud to be a Platinum Sponsor for the LatinX in CV workshop. We look forward to sharing some of our extensive computer vision research and expanding our partnership with the broader research community.

Attending ICCV 2023? We hope you’ll visit the Google booth to chat with researchers who are actively pursuing the latest innovations in computer vision, and check out some of the scheduled booth activities (e.g., demos and Q&A sessions listed below). Visit the @GoogleAI Twitter account to find out more about the Google booth activities at ICCV 2023.

Take a look below to learn more about the Google research being presented at ICCV 2023 (Google affiliations in bold).

Board and Organizing Committee

General Chair: Cordelia Schmid

Finance Chair: Ramin Zabih

Industrial Relations Chair: Rahul Sukthankar

Publicity and Social Media Co-Chair: Boqing Gong

Google Research booth activities

Title: ImagenThings: Instant Personalized Image-to-Image Generation

Presenters: Xuhui Jia, Suraj Kothawade

Wednesday, October 4th at 12:30 PM CEST

Title: Open Images V7 (paper, dataset, blog post)

Presenters: Rodrigo Benenson, Jasper Uijlings, Jordi Pont-Tuset

Wednesday, October 4th at 3:30 PM CEST

Title: AI4Design (paper)

Presenters: Andrew Marmon, Peggy Chi, C.K. Ng

Thursday, October 5th at 10:30 AM CEST

Title: Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face Synthesis

Presenters: Marcel Bühler, Kripasindhu Sarkar

Thursday, October 5th at 12:30 PM CEST

Title: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images

Presenters: Yonatan Bitton

Thursday, October 5th at 1:00 PM CEST

Title: Image Search in Fact Check Explorer (blog post)

Presenters: Yair Alon, Avneesh Sud

Thursday, October 5th at 3:30 PM CEST

Title: UnLoc: A Unified Framework for Video Localization Tasks (paper)

Presenters: Arsha Nagrani, Xuehan Xiong

Friday, October 6th at 10:30 AM CEST

Title: Prompt-Tuning Latent Diffusion Models for Inverse Problems

Presenters: Hyungjin Chung

Friday, October 6th at 12:30 PM CEST

Title: Neural Implicit Representations for Real World Applications

Presenters: Federico Tombari, Fabian Manhardt, Marie-Julie Rakotosaona

Friday, October 6th at 3:30 PM CEST

Accepted papers

Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor

Xinyang Liu, Yijin Li, Yanbin Teng, Hujun Bao, Guofeng Zhang, Yinda Zhang, Zhaopeng Cui

ITI-GEN: Inclusive Text-to-Image Generation

Cheng Zhang, Xuanbai Chen, Siqi Chai, Chen Henry Wu, Dmitry Lagun, Thabo Beeler, Fernando De la Torre

ASIC: Aligning Sparse in-the-wild Image Collections

Kamal Gupta, Varun Jampani, Carlos Esteves, Abhinav Shrivastava, Ameesh Makadia, Noah Snavely, Abhishek Kar

VQ3D: Learning a 3D-Aware Generative Model on ImageNet

Kyle Sargent, Jing Yu Koh, Han Zhang, Huiwen Chang, Charles Herrmann, Pratul Srinivasan, Jiajun Wu, Deqing Sun

Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

Hexiang Hu, Yi Luan, Yang Chen*, Urvashi Khandelwal, Mandar Joshi, Kenton Lee, Kristina Toutanova, Ming-Wei Chang

Sigmoid Loss for Language Image Pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer

Tracking Everything Everywhere All at Once

Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, Noah Snavely

Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields

Jonathan T. Barron, Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman

Delta Denoising Score

Amir Hertz*, Kfir Aberman, Daniel Cohen-Or*

DreamBooth3D: Subject-Driven Text-to-3D Generation

Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, Yuanzhen Li, Varun Jampani

Encyclopedic VQA: Visual Questions about Detailed Properties of Fine-grained Categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel*, Felipe Cadar*, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari

GECCO: Geometrically-Conditioned Point Diffusion Models

Michał J. Tyszkiewicz, Pascal Fua, Eduard Trulls

Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition

Qitong Wang, Long Zhao, Liangzhe Yuan, Ting Liu, Xi Peng

Neural Microfacet Fields for Inverse Rendering

Alexander Mai, Dor Verbin, Falko Kuester, Sara Fridovich-Keil

Rosetta Neurons: Mining the Common Units in a Model Zoo

Amil Dravid, Yossi Gandelsman, Alexei A. Efros, Assaf Shocher

Teaching CLIP to Count to Ten

Roni Paiss*, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, Tali Dekel

Vox-E: Text-guided Voxel Editing of 3D Objects

Etai Sella, Gal Fiebelman, Peter Hedman, Hadar Averbuch-Elor

CC3D: Layout-Conditioned Generation of Compositional 3D Scenes

Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Xingguang Yan, Gordon Wetzstein, Leonidas Guibas, Andrea Tagliasacchi

Delving into Motion-Aware Matching for Monocular 3D Object Tracking

Kuan-Chih Huang, Ming-Hsuan Yang, Yi-Hsuan Tsai

Generative Multiplane Neural Radiance for 3D-Aware Image Generation

Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan

M2T: Masking Transformers Twice for Faster Decoding

Fabian Mentzer, Eirikur Agustsson, Michael Tschannen

MULLER: Multilayer Laplacian Resizer for Vision

Zhengzhong Tu, Peyman Milanfar, Hossein Talebi

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

Ligong Han*, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, Feng Yang

Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond

Yang Zhao, Tingbo Hou, Yu-Chuan Su, Xuhui Jia, Yandong Li, Matthias Grundmann

Unified Visual Relationship Detection with Vision and Language Models

Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu

3D Motion Magnification: Visualizing Subtle Motions from Time-Varying Radiance Fields

Brandon Y. Feng, Hadi Alzayer, Michael Rubinstein, William T. Freeman, Jia-Bin Huang

Global Features are All You Need for Image Retrieval and Reranking

Shihao Shao, Kaifeng Chen, Arjun Karpur, Qinghua Cui, André Araujo, Bingyi Cao

Introducing Language Guidance in Prompt-Based Continual Learning

Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, Muhammad Zeshan Afzal

Multiscale Structure Guided Diffusion for Image Deblurring

Mengwei Ren*, Mauricio Delbracio, Hossein Talebi, Guido Gerig, Peyman Milanfar

Robust Monocular Depth Estimation under Challenging Conditions

Stefano Gasperini, Nils Morbitzer, HyunJun Jung, Nassir Navab, Federico Tombari

Score-Based Diffusion Models as Principled Priors for Inverse Imaging

Berthy T. Feng*, Jamie Smith, Michael Rubinstein, Huiwen Chang, Katherine L. Bouman, William T. Freeman

Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations

Nikolaos-Antonios Ypsilantis, Kaifeng Chen, Bingyi Cao, Mario Lipovsky, Pelin Dogan-Schonberger, Grzegorz Makosa, Boris Bluntschli, Mojtaba Seyedhosseini, Ondrej Chum, André Araujo

U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds

Yan Di, Chenyangguang Zhang, Ruida Zhang, Fabian Manhardt, Yongzhi Su, Jason Rambach, Didier Stricker, Xiangyang Ji, Federico Tombari

AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control

Ruixiang Jiang, Can Wang, Jingbo Zhang, Menglei Chai, Mingming He, Dongdong Chen, Jing Liao

Learning Versatile 3D Shape Generation with Improved AR Models

Simian Luo, Xuelin Qian, Yanwei Fu, Yinda Zhang, Ying Tai, Zhenyu Zhang, Chengjie Wang, Xiangyang Xue

Novel-view Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views

Wentian Qu, Zhaopeng Cui, Yinda Zhang, Chenyu Meng, Cuixia Ma, Xiaoming Deng, Hongan Wang

PreSTU: Pre-Training for Scene-Text Understanding

Jihyung Kil*, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects

Baowen Zhang, Jiahe Li, Xiaoming Deng, Yinda Zhang, Cuixia Ma, Hongan Wang

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

Muhammad Uzair Khattak, Syed Talal Wasi, Muzammal Nasee, Salman Kha, Ming-Hsuan Yan, Fahad Shahbaz Khan

Spectral Graphormer: Spectral Graph-Based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images

Tze Ho Elden Tse*, Franziska Mueller, Zhengyang Shen, Danhang Tang, Thabo Beeler, Mingsong Dou, Yinda Zhang, Sasa Petrovic, Hyung Jin Chang, Jonathan Taylor, Bardia Doosti

Synthesizing Diverse Human Motions in 3D Indoor Scenes

Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, Siyu Tang

Tracking by 3D Model Estimation of Unknown Objects in Videos

Denys Rozumnyi, Jiri Matas, Marc Pollefeys, Vittorio Ferrari, Martin R. Oswald

UnLoc: A Unified Framework for Video Localization Tasks

Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang*, Weina Ge, David Ross, Cordelia Schmid

Verbs in Action: Improving Verb Understanding in Video-language Models

Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, Cordelia Schmid

VLSlice: Interactive Vision-and-Language Slice Discovery

Eric Slyman, Minsuk Kahng, Stefan Lee

Yes, we CANN: Constrained Approximate Nearest Neighbors for Local Feature-Based Visual Localization

Dror Aiger, André Araujo, Simon Lynen

Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu*, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab

CLR: Channel-wise Lightweight Reprogramming for Continual Learning

Yunhao Ge, Yuecheng Li, Shuo Ni, Jiaping Zhao, Ming-Hsuan Yang, Laurent Itti

LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs

Zezhou Cheng*, Carlos Esteves, Varun Jampani, Abhishek Kar, Subhransu Maji, Ameesh Makadia

Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering

Dongting Hu, Zhenkai Zhang, Tingbo Hou, Tongliang Liu, Huan Fu, Mingming Gong

Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs

Frederik Warburg, Ethan Weber, Matthew Tancik, Aleksander Holynski, Angjoo Kanazawa

Segmenting Known Objects and Unseen Unknowns without Prior Knowledge

Stefano Gasperini, Alvaro Marcos-Ramiro, Michael Schmidt, Nassir Navab, Benjamin Busam, Federico Tombari

SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection

Yichen Xie, Chenfeng Xu, Marie-Julie Rakotosaona, Patrick Rim, Federico Tombari, Kurt Keutzer, Masayoshi Tomizuka, Wei Zhan

SwiftFormer: Efficient Additive Attention for Transformer-Based Real-time Mobile Vision Applications

Abdelrahman Shaker, Muhammad Maa, Hanoona Rashee, Salman Kha, Ming-Hsuan Yan, Fahad Shahbaz Kha

Agile Modeling: From Concept to Classifier in Minutes

Otilia Stretcu, Edward Vendrow, Kenji Hata, Krishnamurthy Viswanathan, Vittorio Ferrari, Sasan Tavakkol, Wenlei Zhou, Aditya Avinash, Enming Luo, Neil Gordon Alldrin, MohammadHossein Bateni, Gabriel Berger, Andrew Bunner, Chun-Ta Lu, Javier A Rey, Giulia DeSalvo, Ranjay Krishna, Ariel Fuxman

CAD-Estate: Large-Scale CAD Model Annotation in RGB Videos

Kevis-Kokitsi Maninis, Stefan Popov, Matthias Niessner, Vittorio Ferrari

Counting Crowds in Bad Weather

Zhi-Kai Huang, Wei-Ting Chen, Yuan-Chun Chiang, Sy-Yen Kuo, Ming-Hsuan Yang

DreamPose: Fashion Video Synthesis with Stable Diffusion

Johanna Karras, Aleksander Holynski, Ting-Chun Wang, Ira Kemelmacher-Shlizerman

InfiniCity: Infinite-Scale City Synthesis

Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, Sergey Tulyakov

SAMPLING: Scene-Adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image

Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang

Tutorial

Learning with Noisy and Unlabeled Data for Large Models beyond Categorization

Sifei Liu, Hongxu Yin, Shalini De Mello, Pavlo Molchanov, Jose M. Alvarez, Jan Kautz, Xiaolong Wang, Anima Anandkumar, Ming-Hsuan Yang, Trevor Darrell

Speaker: Varun Jampani

Workshops

LatinX in AI

Platinum Sponsor

Panelists: Daniel Castro Chin, Andre Araujo

Invited Speaker: Irfan Essa

Volunteers: Ming-Hsuan Yang, Liangzhe Yuan, Pedro Velez, Vincent Etter

Scene Graphs and Graph Representation Learning

Organizer: Federico Tombari

International Workshop on Analysis and Modeling of Faces and Gestures

Speaker: Todd Zickler

3D Vision and Modeling Challenges in eCommerce

Speaker: Leonidas Guibas

BigMAC: Big Model Adaptation for Computer Vision

Organizer: Mathilde Caron

Adversarial Robustness In the Real World (AROW)

Organizer: Yutong Bai

GeoNet: 1st Workshop on Robust Computer Vision across Geographies

Speaker: Sara Beery

Organizer: Tarun Kalluri

Quo Vadis, Computer Vision?

Speaker: Bill Freeman

To NeRF or not to NeRF: A View Synthesis Challenge for Human Heads

Speaker: Thabo Beeler

Organizer: Stefanos Zafeiriou

New Ideas in Vision Transformers

Speaker: Cordelia Schmid

Organizer: Ming-Hsuan Yang

Representation Learning with Very Limited Images: The Potential of Self, Synthetic and Formula Supervision

Speaker: Manel Baradad Jurjo

Resource Efficient Deep Learning for Computer Vision

Speaker: Prateek Jain

Organizer: Jiahui Yu, Rishabh Tiwari, Jai Gupta

Computer Vision Aided Architectural Design

Speaker: Noah Snavely

AV4D: Visual Learning of Sounds in Spaces

Organizer: David Harwath

Vision-and-Language Algorithmic Reasoning

Speaker: François Chollet

Neural Fields for Autonomous Driving and Robotics

Speaker: Jon Barron

International Challenge on Compositional and Multimodal Perception

Organizer: Ranjay Krishna

Open-Vocabulary 3D Scene Understanding (OpenSUN3D)

Speaker: Thomas Funkhouser

Organizer: Francis Engelmann, Johanna Wald, Federico Tombari, Leonidas Guibas

Frontiers of Monocular 3D Perception: Geometric Foundation Models

Speaker: Leonidas Guibas

PerDream: PERception, Decision Making and REAsoning Through Multimodal Foundational Modeling

Organizer: Daniel McDuff

Recovering 6D Object Pose

Speaker: Fabian Manhardt, Martin Sundermeyer

Organizer: Martin Sundermeyer

Women in Computer Vision (WiCV)

Panelist: Arsha Nagrani

Language for 3D Scenes

Organizer: Leonidas Guibas

AI for 3D Content Creation

Speaker: Kai-Hung Chang

Organizer: Leonidas Guibas

Computer Vision for Metaverse

Speaker: Jon Barron, Thomas Funkhouser

Towards the Next Generation of Computer Vision Datasets

Speaker: Tom Duerig


* Work done while at Google

Read More