Infinite Nature: Generating 3D Flythroughs from Still Photos

Infinite Nature: Generating 3D Flythroughs from Still Photos

We live in a world of great natural beauty — of majestic mountains, dramatic seascapes, and serene forests. Imagine seeing this beauty as a bird does, flying past richly detailed, three-dimensional landscapes. Can computers learn to synthesize this kind of visual experience? Such a capability would allow for new kinds of content for games and virtual reality experiences: for instance, relaxing within an immersive flythrough of an infinite nature scene. But existing methods that synthesize new views from images tend to allow for only limited camera motion.

In a research effort we call Infinite Nature, we show that computers can learn to generate such rich 3D experiences simply by viewing nature videos and photographs. Our latest work on this theme, InfiniteNature-Zero (presented at ECCV 2022) can produce high-resolution, high-quality flythroughs starting from a single seed image, using a system trained only on still photographs, a breakthrough capability not seen before. We call the underlying research problem perpetual view generation: given a single input view of a scene, how can we synthesize a photorealistic set of output views corresponding to an arbitrarily long, user-controlled 3D path through that scene? Perpetual view generation is very challenging because the system must generate new content on the other side of large landmarks (e.g., mountains), and render that new content with high realism and in high resolution.

Example flythrough generated with InfiniteNature-Zero. It takes a single input image of a natural scene and synthesizes a long camera path flying into that scene, generating new scene content as it goes.

Background: Learning 3D Flythroughs from Videos

To establish the basics of how such a system could work, we’ll describe our first version, “Infinite Nature: Perpetual View Generation of Natural Scenes from a Single Image” (presented at ICCV 2021). In that work we explored a “learn from video” approach, where we collected a set of online videos captured from drones flying along coastlines, with the idea that we could learn to synthesize new flythroughs that resemble these real videos. This set of online videos is called the Aerial Coastline Imagery Dataset (ACID). In order to learn how to synthesize scenes that respond dynamically to any desired 3D camera path, however, we couldn’t simply treat these videos as raw collections of pixels; we also had to compute their underlying 3D geometry, including the camera position at each frame.

The basic idea is that we learn to generate flythroughs step-by-step. Given a starting view, like the first image in the figure below, we first compute a depth map using single-image depth prediction methods. We then use that depth map to render the image forward to a new camera viewpoint, shown in the middle, resulting in a new image and depth map from that new viewpoint.

However, this intermediate image has some problems — it has holes where we can see behind objects into regions that weren’t visible in the starting image. It is also blurry, because we are now closer to objects, but are stretching the pixels from the previous frame to render these now-larger objects.

To handle these problems, we learn a neural image refinement network that takes this low-quality intermediate image and outputs a complete, high-quality image and corresponding depth map. These steps can then be repeated, with this synthesized image as the new starting point. Because we refine both the image and the depth map, this process can be iterated as many times as desired — the system automatically learns to generate new scenery, like mountains, islands, and oceans, as the camera moves further into the scene.

Our Infinite Nature methods take an input view and its corresponding depth map (left). Using this depth map, the system renders the input image to a new desired viewpoint (center). This intermediate image has problems, such as missing pixels revealed behind foreground content (shown in magenta). We learn a deep network that refines this image to produce a new high-quality image (right). This process can be repeated to produce a long trajectory of views. We thus call this approach “render-refine-repeat”.

We train this render-refine-repeat synthesis approach using the ACID dataset. In particular, we sample a video from the dataset and then a frame from that video. We then use this method to render several new views moving into the scene along the same camera trajectory as the ground truth video, as shown in the figure below, and compare these rendered frames to the corresponding ground truth video frames to derive a training signal. We also include an adversarial setup that tries to distinguish synthesized frames from real images, encouraging the generated imagery to appear more realistic.

Infinite Nature can synthesize views corresponding to any camera trajectory. During training, we run our system for T steps to generate T views along a camera trajectory calculated from a training video sequence, then compare the resulting synthesized views to the ground truth ones. In the figure, each camera viewpoint is generated from the previous one by performing a warp operation R, followed by the neural refinement operation gθ.

The resulting system can generate compelling flythroughs, as featured on the project webpage, along with a “flight simulator” Colab demo. Unlike prior methods on video synthesis, this method allows the user to interactively control the camera and can generate much longer camera paths.

InfiniteNature-Zero: Learning Flythroughs from Still Photos

One problem with this first approach is that video is difficult to work with as training data. High-quality video with the right kind of camera motion is challenging to find, and the aesthetic quality of an individual video frame generally cannot compare to that of an intentionally captured nature photograph. Therefore, in “InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images”, we build on the render-refine-repeat strategy above, but devise a way to learn perpetual view synthesis from collections of still photos — no videos needed. We call this method InfiniteNature-Zero because it learns from “zero” videos. At first, this might seem like an impossible task — how can we train a model to generate video flythroughs of scenes when all it’s ever seen are isolated photos?

To solve this problem, we had the key insight that if we take an image and render a camera path that forms a cycle — that is, where the path loops back such that the last image is from the same viewpoint as the first — then we know that the last synthesized image along this path should be the same as the input image. Such cycle consistency provides a training constraint that helps the model learn to fill in missing regions and increase image resolution during each step of view generation.

However, training with these camera cycles is insufficient for generating long and stable view sequences, so as in our original work, we include an adversarial strategy that considers long, non-cyclic camera paths, like the one shown in the figure above. In particular, if we render T frames from a starting frame, we optimize our render-refine-repeat model such that a discriminator network can’t tell which was the starting frame and which was the final synthesized frame. Finally, we add a component trained to generate high-quality sky regions to increase the perceived realism of the results.

With these insights, we trained InfiniteNature-Zero on collections of landscape photos, which are available in large quantities online. Several resulting videos are shown below — these demonstrate beautiful, diverse natural scenery that can be explored along arbitrarily long camera paths. Compared to our prior work — and to prior video synthesis methods — these results exhibit significant improvements in quality and diversity of content (details available in the paper).

Several nature flythroughs generated by InfiniteNature-Zero from single starting photos.

Conclusion

There are a number of exciting future directions for this work. For instance, our methods currently synthesize scene content based only on the previous frame and its depth map; there is no persistent underlying 3D representation. Our work points towards future algorithms that can generate complete, photorealistic, and consistent 3D worlds.

Acknowledgements

Infinite Nature and InfiniteNature-Zero are the result of a collaboration between researchers at Google Research, UC Berkeley, and Cornell University. The key contributors to the work represented in this post include Angjoo Kanazawa, Andrew Liu, Richard Tucker, Zhengqi Li, Noah Snavely, Qianqian Wang, Varun Jampani, and Ameesh Makadia.

Read More

Beyond Tabula Rasa: Reincarnating Reinforcement Learning

Beyond Tabula Rasa: Reincarnating Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning that focuses on training intelligent agents using related experiences so they can learn to solve decision making tasks, such as playing video games, flying stratospheric balloons, and designing hardware chips. Due to the generality of RL, the prevalent trend in RL research is to develop agents that can efficiently learn tabula rasa, that is, from scratch without using previously learned knowledge about the problem. However, in practice, tabula rasa RL systems are typically the exception rather than the norm for solving large-scale RL problems. Large-scale RL systems, such as OpenAI Five, which achieves human-level performance on Dota 2, undergo multiple design changes (e.g., algorithmic or architectural changes) during their developmental cycle. This modification process can last months and necessitates incorporating such changes without re-training from scratch, which would be prohibitively expensive. 

Furthermore, the inefficiency of tabula rasa RL research can exclude many researchers from tackling computationally-demanding problems. For example, the quintessential benchmark of training a deep RL agent on 50+ Atari 2600 games in ALE for 200M frames (the standard protocol) requires 1,000+ GPU days. As deep RL moves towards more complex and challenging problems, the computational barrier to entry in RL research will likely become even higher.

To address the inefficiencies of tabula rasa RL, we present “Reincarnating Reinforcement Learning: Reusing Prior Computation To Accelerate Progress” at NeurIPS 2022. Here, we propose an alternative approach to RL research, where prior computational work, such as learned models, policies, logged data, etc., is reused or transferred between design iterations of an RL agent or from one agent to another. While some sub-areas of RL leverage prior computation, most RL agents are still largely trained from scratch. Until now, there has been no broader effort to leverage prior computational work for the training workflow in RL research. We have also released our code and trained agents to enable researchers to build on this work.

Tabula rasa RL vs. Reincarnating RL (RRL). While tabula rasa RL focuses on learning from scratch, RRL is based on the premise of reusing prior computational work (e.g., prior learned agents) when training new agents or improving existing agents, even in the same environment. In RRL, new agents need not be trained from scratch, except for initial forays into new problems.

Why Reincarnating RL?

Reincarnating RL (RRL) is a more compute and sample-efficient workflow than training from scratch. RRL can democratize research by allowing the broader community to tackle complex RL problems without requiring excessive computational resources. Furthermore, RRL can enable a benchmarking paradigm where researchers continually improve and update existing trained agents, especially on problems where improving performance has real-world impact, such as balloon navigation or chip design. Finally, real-world RL use cases will likely be in scenarios where prior computational work is available (e.g., existing deployed RL policies).

RRL as an alternative research workflow. Imagine a researcher who has trained an agent A1 for some time, but now wants to experiment with better architectures or algorithms. While the tabula rasa workflow requires retraining another agent from scratch, RRL provides the more viable option of transferring the existing agent A1 to another agent and training this agent further, or simply fine-tuning A1.

While there have been some ad hoc large-scale reincarnation efforts with limited applicability, e.g., model surgery in Dota2, policy distillation in Rubik’s cube, PBT in AlphaStar, RL fine-tuning a behavior-cloned policy in AlphaGo / Minecraft, RRL has not been studied as a research problem in its own right. To this end, we argue for developing general-purpose RRL approaches as opposed to prior ad-hoc solutions.

Case Study: Policy to Value Reincarnating RL

Different RRL problems can be instantiated depending on the kind of prior computational work provided. As a step towards developing broadly applicable RRL approaches, we present a case study on the setting of Policy to Value reincarnating RL (PVRL) for efficiently transferring an existing sub-optimal policy (teacher) to a standalone value-based RL agent (student). While a policy directly maps a given environment state (e.g., a game screen in Atari) to an action, value-based agents estimate the effectiveness of an action at a given state in terms of achievable future rewards, which allows them to learn from previously collected data.

For a PVRL algorithm to be broadly useful, it should satisfy the following requirements:

  • Teacher Agnostic: The student shouldn’t be constrained by the existing teacher policy’s architecture or training algorithm.
  • Weaning off the teacher: It is undesirable to maintain dependency on past suboptimal teachers for successive reincarnations.
  • Compute / Sample Efficient: Reincarnation is only useful if it is cheaper than training from scratch.

Given the PVRL algorithm requirements, we evaluate whether existing approaches, designed with closely related goals, will suffice. We find that such approaches either result in small improvements over tabula rasa RL or degrade in performance when weaning off the teacher.

To address these limitations, we introduce a simple method, QDagger, in which the agent distills knowledge from the suboptimal teacher via an imitation algorithm while simultaneously using its environment interactions for RL. We start with a deep Q-network (DQN) agent trained for 400M environment frames (a week of single-GPU training) and use it as the teacher for reincarnating student agents trained on only 10M frames (a few hours of training), where the teacher is weaned off over the first 6M frames. For benchmark evaluation, we report the interquartile mean (IQM) metric from the RLiable library. As shown below for the PVRL setting on Atari games, we find that the QDagger RRL method outperforms prior approaches.

Benchmarking PVRL algorithms on Atari, with teacher-normalized scores aggregated across 10 games. Tabula rasa DQN (–·–) obtains a normalized score of 0.4. Standard baseline approaches include kickstarting, JSRL, rehearsal, offline RL pre-training and DQfD. Among all methods, only QDagger surpasses teacher performance within 10 million frames and outperforms the teacher in 75% of the games.

Reincarnating RL in Practice

We further examine the RRL approach on the Arcade Learning Environment, a widely used deep RL benchmark. First, we take a Nature DQN agent that uses the RMSProp optimizer and fine-tune it with the Adam optimizer to create a DQN (Adam) agent. While it is possible to train a DQN (Adam) agent from scratch, we demonstrate that fine-tuning Nature DQN with the Adam optimizer matches the from-scratch performance using 40x less data and compute.

Reincarnating DQN (Adam) via Fine-Tuning. The vertical separator corresponds to loading network weights and replay data for fine-tuning. Left: Tabula rasa Nature DQN nearly converges in performance after 200M environment frames. Right: Fine-tuning this Nature DQN agent using a reduced learning rate with the Adam optimizer for 20 million frames obtains similar results to DQN (Adam) trained from scratch for 400M frames.

Given the DQN (Adam) agent as a starting point, fine-tuning is restricted to the 3-layer convolutional architecture. So, we consider a more general reincarnation approach that leverages recent architectural and algorithmic advances without training from scratch. Specifically, we use QDagger to reincarnate another RL agent that uses a more advanced RL algorithm (Rainbow) and a better neural network architecture (Impala-CNN ResNet) from the fine-tuned DQN (Adam) agent.

Reincarnating a different architecture / algorithm via QDagger. The vertical separator is the point at which we apply offline pre-training using QDagger for reincarnation. Left: Fine-tuning DQN with Adam. Right: Comparison of a tabula rasa Impala-CNN Rainbow agent (sky blue) to an Impala-CNN Rainbow agent (pink) trained using QDagger RRL from the fine-tuned DQN (Adam). The reincarnated Impala-CNN Rainbow agent consistently outperforms its scratch counterpart. Note that further fine-tuning DQN (Adam) results in diminishing returns (yellow).

Overall, these results indicate that past research could have been accelerated by incorporating a RRL approach to designing agents, instead of re-training agents from scratch. Our paper also contains results on the Balloon Learning Environment, where we demonstrate that RRL allows us to make progress on the problem of navigating stratospheric balloons using only a few hours of TPU-compute by reusing a distributed RL agent trained on TPUs for more than a month.

Discussion

Fairly comparing reincarnation approaches involves using the exact same computational work and workflow. Furthermore, the research findings in RRL that broadly generalize would be about how effective an algorithm is given access to existing computational work, e.g., we successfully applied QDagger developed using Atari for reincarnation on Balloon Learning Environment. As such, we speculate that research in reincarnating RL can branch out in two directions:

  • Standardized benchmarks with open-sourced computational work: Akin to NLP and vision, where typically a small set of pre-trained models are common, research in RRL may also converge to a small set of open-sourced computational work (e.g., pre-trained teacher policies) on a given benchmark.
  • Real-world domains: Since obtaining higher performance has real-world impact in some domains, it incentivizes the community to reuse state-of-the-art agents and try to improve their performance.

See our paper for a broader discussion on scientific comparisons, generalizability and reproducibility in RRL. Overall, we hope that this work motivates researchers to release computational work (e.g., model checkpoints) on which others could directly build. In this regard, we have open-sourced our code and trained agents with their final replay buffers. We believe that reincarnating RL can substantially accelerate research progress by building on prior computational work, as opposed to always starting from scratch.

Acknowledgements

This work was done in collaboration with Pablo Samuel Castro, Aaron Courville and Marc Bellemare. We’d like to thank Tom Small for the animated figure used in this post. We are also grateful for feedback by the anonymous NeurIPS reviewers and several members of the Google Research team, DeepMind and Mila.

Read More

3 ways AI is scaling helpful technologies worldwide

3 ways AI is scaling helpful technologies worldwide

I was first introduced to neural networks as an undergraduate in 1990. Back then, many people in the AI community were excited about the potential of neural networks, which were impressive, but couldn’t yet accomplish important, real-world tasks. I was excited, too! I did my senior thesis on using parallel computation to train neural networks, thinking we only needed 32X more compute power to get there. I was way off. At that time, we needed 1 million times as much computational power.

A short 21 years later, with exponentially more computational power, it was time to take another crack at neural networks. In 2011, I and a few others at Google started training very large neural networks using millions of randomly selected frames from videos online. The results were remarkable. Without explicit training, the system automatically learned to recognize different objects (especially cats, the Internet is full of cats). This was one transformational discovery in AI among a long string of successes that is still ongoing — at Google and elsewhere.

I share my own history of neural networks to illustrate that, while progress in AI might feel especially fast right now, it’s come from a long arc of progress. In fact, prior to 2012, computers had a really difficult time seeing, hearing, or understanding spoken or written language. Over the past 10 years we’ve made especially rapid progress in AI.

Today, we’re excited about many recent advances in AI that Google is leading — not just on the technical side, but in responsibly deploying it in ways that help people around the world. That means deploying AI in Google Cloud, in our products from Pixel phones to Google Search, and in many fields of science and other human endeavors.

We’re aware of the challenges and risks that AI poses as an emerging technology. We were the first major company to release and operationalize a set of AI Principles, and following them has actually (and some might think counterintuitively) allowed us to focus on making rapid progress on technologies that can be helpful to everyone. Getting AI right needs to be a collective effort — involving not just researchers, but domain experts, developers, community members, businesses, governments and citizens.

I’m happy to make announcements in three transformative areas of AI today: first, using AI to make technology accessible in many more languages. Second, exploring how AI might bolster creativity. And third, in AI for Social Good, including climate adaptation.

1. Supporting 1,000 languages with AI

Language is fundamental to how people communicate and make sense of the world. So it’s no surprise it’s also the most natural way people engage with technology. But more than 7,000 languages are spoken around the world, and only a few are well represented online today. That means traditional approaches to training language models on text from the web fail to capture the diversity of how we communicate globally. This has historically been an obstacle in the pursuit of our mission to make the world’s information universally accessible and useful.

That’s why today we’re announcing the 1,000 Languages Initiative, an ambitious commitment to build an AI model that will support the 1,000 most spoken languages, bringing greater inclusion to billions of people in marginalized communities all around the world. This will be a many years undertaking – some may even call it a moonshot – but we are already making meaningful strides here and see the path clearly. Technology has been changing at a rapid clip – from the way people use it to what it’s capable of. Increasingly, we see people finding and sharing information via new modalities like images, videos, and speech. And our most advanced language models are multimodal – meaning they’re capable of unlocking information across these many different formats. With these seismic shifts come new opportunities.

spinning globe with languages

As part of our this initiative and our focus on multimodality, we’ve developed a Universal Speech Model — or USM — that’s trained on over 400 languages, making it the largest language coverage seen in a speech model to date. As we expand on this work, we’re partnering with communities across the world to source representative speech data. We recently announced voice typing for 9 more African languages on Gboard by working closely with researchers and organizations in Africa to create and publish data. And in South Asia, we are actively working with local governments, NGOs, and academic institutions to eventually collect representative audio samples from across all the regions’ dialects and languages.

2. Empowering creators and artists with AI

AI-powered generative models have the potential to unlock creativity, helping people across cultures express themselves using video, imagery, and design in ways that they previously could not.

Our researchers have been hard at work developing models that lead the field in terms of quality, generating images that human raters prefer over other models. We recently shared important breakthroughs, applying our diffusion model to video sequences and generating long coherent videos for a sequence of text prompts. We can combine these techniques to produce video — for the first time, today we’re sharing AI-generated super-resolution video:

We’ll soon be bringing our text-to-image generation technologies to AI Test Kitchen, which provides a way for people to learn about, experience, and give feedback on emerging AI technology. We look forward to hearing feedback from users on these demos in AI Test Kitchen Season 2. You’ll be able to build themed cities with “City Dreamer” and design friendly monster characters that can move, dance, and jump with “Wobble” — all by using text prompts.

In addition to 2D images, text-to-3D is now a reality with DreamFusion, which produces a three-dimensional model that can be viewed from any angle and can be composited into any 3D environment. Researchers are also making significant progress in the audio generation space with AudioLM, a model that learns to generate realistic speech and piano music by listening to audio only. In the same way a language model might predict the words and sentences that follow a text prompt, AudioLM can predict which sounds should follow after a few seconds of an audio prompt.

We’re collaborating with creative communities globally as we develop these tools. For example, we’re working with writers using Wordcraft, which is built on our state-of-the-art dialog system LaMDA, to experiment with AI-powered text generation. You can read the first volume of these stories at the Wordcraft Writers Workshop.

3. Addressing climate change and health challenges with AI

AI also has great potential to address the effects of climate change, including helping people adapt to new challenges. One of the worst is wildfires, which affect hundreds of thousands of people today, and are increasing in frequency and scale.

Today, I’m excited to share that we’ve advanced our use of satellite imagery to train AI models to identify and track wildfires in real time, helping predict how they will evolve and spread. We’ve launched this wildfire tracking system in the U.S., Canada, Mexico, and are rolling out in parts of Australia, and since July we’ve covered more than 30 big wildfire events in the U.S. and Canada, helping inform our users and firefighting teams with over 7 million views in Google Search and Maps.

wildfire alert on phone

We’re also using AI to forecast floods, another extreme weather pattern exacerbated by climate change. We’ve already helped communities to predict when floods will hit and how deep the waters will get — in 2021, we sent 115 million flood alert notifications to 23 million people over Google Search and Maps, helping save countless lives. Today, we’re sharing that we’re now expanding our coverage to more countries in South America (Brazil and Colombia), Sub-Saharan Africa (Burkina Faso, Cameroon, Chad, Democratic Republic of Congo, Ivory Coast, Ghana, Guinea, Malawi, Nigeria, Sierra Leone, Angola, South Sudan, Namibia, Liberia, and South Africa), and South Asia (Sri Lanka). We’ve used an AI technique called transfer learning to make it work in areas where there’s less data available. We’re also announcing the global launch of Google FloodHub, a new platform that displays when and where floods may occur. We’ll also be bringing this information to Google Search and Maps in the future to help more people to reach safety in flooding situations.

flood alert on a phone

Finally, AI is helping provide ever more access to healthcare in under-resourced regions. For example, we’re researching ways AI can help read and analyze outputs from low-cost ultrasound devices, giving parents the information they need to identify issues earlier in a pregnancy. We also plan to continue to partner with caregivers and public health agencies to expand access to diabetic retinopathy screening through our Automated Retinal Disease Assessment tool (ARDA). Through ARDA, we’ve successfully screened more than 150,000 patients in countries like India, Thailand, Germany, the United States, and the United Kingdom across deployed use and prospective studies — more than half of those in 2022 alone. Further, we’re exploring how AI can help your phone detect respiratory and heart rates. This work is part of Google Health’s broader vision, which includes making healthcare more accessible for anyone with a smartphone.

AI in the years ahead

Our advancements in neural network architectures, machine learning algorithms and new approaches to hardware for machine learning have helped AI solve important, real-world problems for billions of people. Much more is to come. What we’re sharing today is a hopeful vision for the future — AI is letting us reimagine how technology can be helpful. We hope you’ll join us as we explore these new capabilities and use this technology to improve people’s lives around the world.

Read More

How we're using AI to help address the climate crisis

How we’re using AI to help address the climate crisis

Communities around the world are facing the effects of climate change — from devastating floods and wildfires to challenges around food security. As global leaders meet in Egypt for COP27, a key area of focus will be on how we can work together to adapt to climate change and implement sustainable solutions. At Google, we’re investing in technologies that can help communities prepare for and respond to climate-related disasters and threats.

Tools to alert people and governments about immediate risks

Natural disasters are increasing in frequency and intensity due to climate change. As part of our Crisis Response efforts, we’re working to bring trusted information to people in critical moments to keep them safe and informed. To do so, we rely on the research and development of our AI-powered technologies and longstanding partnerships with frontline emergency workers and organizations. Here’s a look at some of our crisis response efforts and new ways we’re expanding these tools.

  • Floods: Catastrophic damage from flooding affects more than 250 million people every year. In 2018, we launched our flood forecasting initiative that uses machine learning models to provide people with detailed alerts. In 2021, we sent 115 million flood alert notifications to 23 million people over Search and Maps, helping save countless lives. Today, we’re expanding our flood forecasts to river basins in 18 additional countries across Africa, Latin America and Southeast Asia. We’re also announcing the global launch of the new FloodHub, a platform that displays flood forecasts and shows when and where floods may occur to help people directly at risk and provide critical information to aid organizations and governments. This expansion in geographic coverage is possible thanks to our recent breakthroughs in AI-based flood forecasting models, and we’re committed to expanding to more countries.
An image of a FloodHub map showing areas where riverine floods my occur

The new Google FloodHub at g.co/floodhub shows forecasts for riverine floods. Forecasts are now available in 18 additional countries: Brazil, Colombia, Sri Lanka, Burkina Faso, Cameroon, Chad, Democratic Republic of Congo, Ivory Coast, Ghana, Guinea, Malawi, Nigeria, Sierra Leone, Angola, South Sudan, Namibia, Liberia, South Africa.

  • Wildfires: Wildfires affect hundreds of thousands of people each year, and are increasing in frequency and size. I experienced firsthand the need for accurate information when wildfires occur and this inspired our crisis response work. We detect wildfire boundaries using new AI models based on satellite imagery and show their real-time location in Search and Maps. Since July, we’ve covered more than 30 big wildfire events in the U.S. and Canada, helping inform people and firefighting teams with over 7 million views in Search and Maps. Today, wildfire detection is now available in the U.S., Canada, Mexico and Australia.
Picture shows the location of the Pukatawagan fire in Manitoba, Canada.

The location of the Pukatawagan fire in Manitoba, Canada.

  • Hurricanes: Access to authoritative forecasts and safety information about hurricanes can be life-saving. In the days before a hurricane in North America or a typhoon in Japan, detailed forecasts from authoritative sources appear on SOS Alerts in Search and Maps to show a storm’s predicted trajectory. We’re also using machine learning to analyze satellite imagery after disasters and identify which areas need help. When Hurricane Ian hit Florida in September, this technology was deployed in partnership with Google.org grantee GiveDirectly to quickly allocate aid to those most affected.

Managing current and future climate impacts

Climate change poses a threat to our world’s natural resources and food security. We’re working with governments, organizations and communities to provide information and technologies to help adapt to these changes.

  • Keeping cities greener and healthier: Extreme temperatures and poor air quality are increasingly common in cities and can impact public health. To mitigate this, our Project Green Light uses AI to optimize traffic lights at intersections around the world with the aim to help minimize congestion and related pollution. Project Air View also brings detailed air quality maps to scientists, policymakers and communities. And we’re working to expand our Environmental Insights Explorer’s Tree Canopy Insights tool to hundreds of cities by the end of this year so they can use trees to lower street-level temperatures and improve quality of life.
  • Meeting the world’s growing demand for food: Mineral — a project from X, Alphabet’s moonshot factory — is working to build a more sustainable and productive food system. The team is joining diverse data sets in radically new ways — from soil and weather data to drone and satellite images — and using AI to reveal insights never before possible about what’s happening with crops. As part of our Startups For Sustainable Development program, we’re also supporting startups addressing food security. These include startups like OKO, which provides crop insurance to keep farmers in business in case of adverse weather events and has reached tens of thousands of farmers in Mali and Uganda.
  • Helping farmers protect their crops: Pest infestations can threaten entire crops and impact the livelihoods of millions. In collaboration with InstaDeep and the Food and Agriculture Organization of the United Nations, our team at the Google AI Center in Ghana is using AI to better detect locust outbreaks so that it’s possible to implement control measures. In India, Google.org Fellows worked with Wadhwani AI to create an AI-powered app that helps identify and treat infestations of pests, resulting in a 20% reduction in pesticide sprays and a 26% increase in profit margins for farmers. Google Cloud is also working with agricultural technology companies to use machine learning and cloud services to improve crop yields.
  • Analyzing a changing planet: Using Google Cloud and Google Earth Engine, organizations and businesses can better assess and manage climate risks. For example, the U.S. Forest Service uses these tools to analyze land-cover changes to better respond to new wildfire threats and monitor the impacts of invasive insects, diseases and droughts. Similarly, the Bank of Montreal is integrating climate data — like precipitation trends — into its business strategy and risk management for clients.

AI already plays a critical role in addressing many urgent, climate-related challenges. It is important that we continue to invest in research and raise awareness about why we are doing this work. Google Arts and Culture has collaborated with artists on the Culture meets Climate collection so everyone can explore more perspectives on climate change. And at COP27 we hope to generate more awareness and engage in productive discussions about how to use AI, innovations, and shared data to help global communities address the changing climate.

Read More

Robots That Write Their Own Code

Robots That Write Their Own Code

<!––><!––>

A common approach used to control robots is to program them with code to detect objects, sequencing commands to move actuators, and feedback loops to specify how the robot should perform a task. While these programs can be expressive, re-programming policies for each new task can be time consuming, and requires domain expertise.

What if when given instructions from people, robots could autonomously write their own code to interact with the world? It turns out that the latest generation of language models, such as PaLM, are capable of complex reasoning and have also been trained on millions of lines of code. Given natural language instructions, current language models are highly proficient at writing not only generic code but, as we’ve discovered, code that can control robot actions as well. When provided with several example instructions (formatted as comments) paired with corresponding code (via in-context learning), language models can take in new instructions and autonomously generate new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime. More broadly, this suggests an alternative approach to using machine learning for robots that (i) pursues generalization through modularity and (ii) leverages the abundance of open-source code and data available on the Internet.

Given code for an example task (left), language models can re-compose API calls to assemble new robot behaviors for new tasks (right) that use the same functions but in different ways.

To explore this possibility, we developed Code as Policies (CaP), a robot-centric formulation of language model-generated programs executed on physical systems. CaP extends our prior work, PaLM-SayCan, by enabling language models to complete even more complex robotic tasks with the full expression of general-purpose Python code. With CaP, we propose using language models to directly write robot code through few-shot prompting. Our experiments demonstrate that outputting code led to improved generalization and task performance over directly learning robot tasks and outputting natural language actions. CaP allows a single system to perform a variety of complex and varied robotic tasks without task-specific training.

We demonstrate, across several robot systems, including a robot from Everyday Robots, that language models can autonomously interpret language instructions to generate and execute CaPs that represent reactive low-level policies (e.g., proportional-derivative or impedance controllers) and waypoint-based policies (e.g., vision-based pick and place, trajectory-based control).

A Different Way to Think about Robot Generalization

To generate code for a new task given natural language instructions, CaP uses a code-writing language model that, when prompted with hints (i.e., import statements that inform which APIs are available) and examples (instruction-to-code pairs that present few-shot “demonstrations” of how instructions should be converted into code), writes new code for new instructions. Central to this approach is hierarchical code generation, which prompts language models to recursively define new functions, accumulate their own libraries over time, and self-architect a dynamic codebase. Hierarchical code generation improves state-of-the-art on both robotics as well as standard code-gen benchmarks in natural language processing (NLP) subfields, with 39.8% pass@1 on HumanEval, a benchmark of hand-written coding problems used to measure the functional correctness of synthesized programs.

Code-writing language models can express a variety of arithmetic operations and feedback loops grounded in language. Pythonic language model programs can use classic logic structures, e.g., sequences, selection (if/else), and loops (for/while), to assemble new behaviors at runtime. They can also use third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc. These models not only generalize to new instructions, but they can also translate precise values (e.g., velocities) to ambiguous descriptions (“faster” and “to the left”) depending on the context to elicit behavioral commonsense.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

CaP generalizes at a specific layer in the robot: interpreting natural language instructions, processing perception outputs (e.g., from off-the-shelf object detectors), and then parameterizing control primitives. This fits into systems with factorized perception and control, and imparts a degree of generalization (acquired from pre-trained language models) without the magnitude of data collection needed for end-to-end robot learning. CaP also inherits language model capabilities that are unrelated to code writing, such as supporting instructions with non-English languages and emojis.

CaP inherits the capabilities of language models, such as multilingual and emoji support.

By characterizing the types of generalization encountered in code generation problems, we can also study how hierarchical code generation improves generalization. For example, “systematicity” evaluates the ability to recombine known parts to form new sequences, “substitutivity” evaluates robustness to synonymous code snippets, while “productivity” evaluates the ability to write policy code longer than those seen in the examples (e.g., for new long horizon tasks that may require defining and nesting new functions). Our paper presents a new open-source benchmark to evaluate language models on a set of robotics-related code generation problems. Using this benchmark, we find that, in general, bigger models perform better across most metrics, and that hierarchical code generation improves “productivity” generalization the most.

Performance on our RoboCodeGen Benchmark across different generalization types. The larger model (Davinci) performs better than the smaller model (Cushman), with hierarchical code generation improving productivity the most.

We’re also excited about the potential for code-writing models to express cross-embodied plans for robots with different morphologies that perform the same task differently depending on the available APIs (perception action spaces), which is an important aspect of any robotics foundation model.

Language model code-generation exhibits cross-embodiment capabilities, completing the same task in different ways depending on the available APIs (that define perception action spaces).

Limitations

Code as policies today are restricted by the scope of (i) what the perception APIs can describe (e.g., few visual-language models to date can describe whether a trajectory is “bumpy” or “more C-shaped”), and (ii) which control primitives are available. Only a handful of named primitive parameters can be adjusted without over-saturating the prompts. Our approach also assumes all given instructions are feasible, and we cannot tell if generated code will be useful a priori. CaPs also struggle to interpret instructions that are significantly more complex or operate at a different abstraction level than the few-shot examples provided to the language model prompts. Thus, for example, in the tabletop domain, it would be difficult for our specific instantiation of CaPs to “build a house with the blocks” since there are no examples of building complex 3D structures. These limitations point to avenues for future work, including extending visual language models to describe low-level robot behaviors (e.g., trajectories) or combining CaPs with exploration algorithms that can autonomously add to the set of control primitives.

Open-Source Release

We have released the code needed to reproduce our experiments and an interactive simulated robot demo on the project website, which also contains additional real-world demos with videos and generated code.

Conclusion

Code as policies is a step towards robots that can modify their behaviors and expand their capabilities accordingly. This can be enabling, but the flexibility also raises potential risks since synthesized programs (unless manually checked per runtime) may result in unintended behaviors with physical hardware. We can mitigate these risks with built-in safety checks that bound the control primitives that the system can access, but more work is needed to ensure new combinations of known primitives are equally safe. We welcome broad discussion on how to minimize these risks while maximizing the potential positive impacts towards more general-purpose robots.

Acknowledgements

This research was done by Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng. Special thanks to Vikas Sindhwani, Vincent Vanhoucke for helpful feedback on writing, Chad Boodoo for operations and hardware support. An early preprint is available on arXiv.

Read More

Natural Language Assessment: A New Framework to Promote Education

Natural Language Assessment: A New Framework to Promote Education

Whether it’s a professional honing their skills or a child learning to read, coaches and educators play a key role in assessing the learner’s answer to a question in a given context and guiding them towards a goal. These interactions have unique characteristics that set them apart from other forms of dialogue, yet are not available when learners practice alone at home. In the field of natural language processing, this type of capability has not received much attention and is technologically challenging. We set out to explore how we can use machine learning to assess answers in a way that facilitates learning.

In this blog, we introduce an important natural language understanding (NLU) capability called Natural Language Assessment (NLA), and discuss how it can be helpful in the context of education. While typical NLU tasks focus on the user’s intent, NLA allows for the assessment of an answer from multiple perspectives. In situations where a user wants to know how good their answer is, NLA can offer an analysis of how close the answer is to what is expected. In situations where there may not be a “correct” answer, NLA can offer subtle insights that include topicality, relevance, verbosity, and beyond. We formulate the scope of NLA, present a practical model for carrying out topicality NLA, and showcase how NLA has been used to help job seekers practice answering interview questions with Google’s new interview prep tool, Interview Warmup.

Overview of Natural Language Assessment (NLA)

The goal of NLA is to evaluate the user’s answer against a set of expectations. Consider the following components for an NLA system interacting with students:

  • A question presented to the student
  • Expectations that define what we expect to find in the answer (e.g., a concrete textual answer, a set of topics we expect the answer to cover, conciseness)
  • An answer provided by the student
  • An assessment output (e.g., correctness, missing information, too specific or general, stylistic feedback, pronunciation, etc.)
  • [Optional] A context (e.g., a chapter in a book or an article)

With NLA, both the expectations about the answer and the assessment of the answer can be very broad. This enables teacher-student interactions that are more expressive and subtle. Here are two examples:

  1. A question with a concrete correct answer: Even in situations where there is a clear correct answer, it can be helpful to assess the answer more subtly than simply correct or incorrect. Consider the following:

    Context: Harry Potter and the Philosopher’s Stone
    Question: “What is Hogwarts?”
    Expectation: “Hogwarts is a school of Witchcraft and Wizardry” [expectation is given as text]
    Answer: “I am not exactly sure, but I think it is a school.”

    The answer may be missing salient details but labeling it as incorrect wouldn’t be entirely true or useful to a user. NLA can offer a more subtle understanding by, for example, identifying that the student’s answer is too general, and also that the student is uncertain.

    Illustration of the NLA process from input question, answer and expectation to assessment output.

    This kind of subtle assessment, along with noting the uncertainty the student expressed, can be important in helping students build skills in conversational settings.

  2. Topicality expectations: There are many situations in which a concrete answer is not expected. For example, if a student is asked an opinion question, there is no concrete textual expectation. Instead, there’s an expectation of relevance and opinionation, and perhaps some level of succinctness and fluency. Consider the following interview practice setup:

    Question: “Tell me a little about yourself?”
    Expectations: { “Education”, “Experience”, “Interests” } (a set of topics)
    Answer: “Let’s see. I grew up in the Salinas valley in California and went to Stanford where I majored in economics but then got excited about technology so next I ….”

    In this case, a useful assessment output would map the user’s answer to a subset of the topics covered, possibly along with a markup of which parts of the text relate to which topic. This can be challenging from an NLP perspective as answers can be long, topics can be mixed, and each topic on its own can be multi-faceted.

A Topicality NLA Model

In principle, topicality NLA is a standard multi-class task for which one can readily train a classifier using standard techniques. However, training data for such scenarios is scarce and it would be costly and time consuming to collect for each question and topic. Our solution is to break each topic into granular components that can be identified using large language models (LLMs) with a straightforward generic tuning.

We map each topic to a list of underlying questions and define that if the sentence contains an answer to one of those underlying questions, then it covers that topic. For the topic “Experience” we might choose underlying questions such as:

  • Where did you work?
  • What did you study?

While for the topic “Interests” we might choose underlying questions such as:

  • What are you interested in?
  • What do you enjoy doing?

These underlying questions are designed through an iterative manual process. Importantly, since these questions are sufficiently granular, current language models (see details below) can capture their semantics. This allows us to offer a zero-shot setting for the NLA topicality task: once trained (more on the model below), it is easy to add new questions and new topics, or adapt existing topics by modifying their underlying content expectation without the need to collect topic specific data. See below the model’s predictions for the sentence “I’ve worked in retail for 3 years” for the two topics described above:

A diagram of how the model uses underlying questions to predict the topic most likely to be covered by the user’s answer.

Since an underlying question for the topic “Experience” was matched, the sentence would be classified as “Experience”.

Application: Helping Job Seekers Prepare for Interviews

Interview Warmup is a new tool developed in collaboration with job seekers to help them prepare for interviews in fast-growing fields of employment such as IT Support and UX Design. It allows job seekers to practice answering questions selected by industry experts and to become more confident and comfortable with interviewing. As we worked with job seekers to understand their challenges in preparing for interviews and how an interview practice tool could be most useful, it inspired our research and the application of topicality NLA.

We build the topicality NLA model (once for all questions and topics) as follows: we train an encoder-only T5 model (EncT5 architecture) with 350 million parameters on Question-Answers data to predict the compatibility of an <underlying question, answer> pair. We rely on data from SQuAD 2.0 which was processed to produce <question, answer, label> triplets.

In the Interview Warmup tool, users can switch between talking points to see which ones were detected in their answer.

The tool does not grade or judge answers. Instead it enables users to practice and identify ways to improve on their own. After a user replies to an interview question, their answer is parsed sentence-by-sentence with the Topicality NLA model. They can then switch between different talking points to see which ones were detected in their answer. We know that there are many potential pitfalls in signaling to a user that their response is “good”, especially as we only detect a limited set of topics. Instead, we keep the control in the user’s hands and only use ML to help users make their own discoveries about how to improve.

So far, the tool has had great results helping job seekers around the world, including in the US, and we have recently expanded it to Africa. We plan to continue working with job seekers to iterate and make the tool even more helpful to the millions of people searching for new jobs.

A short film showing how Interview Warmup and its NLA capabilities were developed in collaboration with job seekers.

Conclusion

Natural Language Assessment (NLA) is a technologically challenging and interesting research area. It paves the way for new conversational applications that promote learning by enabling the nuanced assessment and analysis of answers from multiple perspectives. Working together with communities, from job seekers and businesses to classroom teachers and students, we can identify situations where NLA has the potential to help people learn, engage, and develop skills across an array of subjects, and we can build applications in a responsible way that empower users to assess their own abilities and discover ways to improve.

Acknowledgements

This work is made possible through a collaboration spanning several teams across Google. We’d like to acknowledge contributions from Google Research Israel, Google Creative Lab, and Grow with Google teams among others.

Read More

A new genome sequencing tool powered with our technology

A new genome sequencing tool powered with our technology

Genome sequencing provides a more complete description of cells and organisms, allowing scientists to uncover serious genetic conditions such as the elevated risk for breast cancer or pulmonary arterial hypertension. While researching genomics has the potential to save lives and preserve people’s quality of life, it’s incredibly challenging work.

Back in January, we announced a partnership with PacBio, a developer of genome sequencing instruments, to further advance genomic technologies. Today, PacBio is introducing the Revio sequencing system, an instrument that runs on our deep learning technology, DeepConsensus. With DeepConsensus built right into Revio, researchers can quickly and accurately identify genetic variants that cause diseases.

How Google Health’s technology works

Genome sequencing requires observing individual DNA molecules amidst a complex and noisy background. To address the problem, Google Health worked to adapt Transformer, one of our most influential existing deep learning methods that was developed in 2017 primarily to understand languages. We then applied it to PacBio’s data, which uses fluorescence light to encode DNA sequences. The result was DeepConsensus.

Last year, we demonstrated that DeepConsensus was capable of reducing sequencing errors by 42%, resulting in better genome assemblies and more accurate identification of genetic variants. Although this was a promising research demonstration, we knew that DeepConsensus could have the greatest impact if it was running directly on PacBio’s sequencing instrument. Over the last year, we’ve worked closely with PacBio to speed up DeepConsensus by over 500x from its initial release. We’ve also further improved its accuracy to reduce errors by 59%.

Combining our AI methods and genomics work with PacBio’s instruments and expertise, we were able to build better methods for the research community. Our partnership with PacBio doesn’t stop with Revio. There’s limitless potential to make an impact on the research community and improve healthcare and access for people around the world.

Read More

Open Images V7 — Now Featuring Point Labels

Open Images V7 — Now Featuring Point Labels

Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. Researchers around the world use Open Images to train and evaluate computer vision models. Since the initial release of Open Images in 2016, which included image-level labels covering 6k categories, we have provided multiple updates to enrich annotations and expand the potential use cases of the dataset. Through several releases, we have added image-level labels for over 20k categories on all images and bounding box annotations, visual relations, instance segmentations, and localized narratives (synchronized voice, mouse trace, and text caption) on a subset of 1.9M images.

Today, we are happy to announce the release of Open Images V7, which expands the Open Images dataset even further with a new annotation type called point-level labels and includes a new all-in-one visualization tool that allows a better exploration of the rich data available.

Point Labels

The main strategy used to collect the new point-level label annotations leveraged suggestions from a machine learning (ML) model and human verification. First, the ML model selected points of interest and asked a yes or no question, e.g., “is this point on a pumpkin?”. Then, human annotators spent an average of 1.1 seconds answering the yes or no questions. We aggregated the answers from different annotators over the same question and assigned a final “yes”, “no”, or “unsure” label to each annotated point.

Illustration of the annotations interface.
(Image by Lenore Edman, under CC BY 2.0 license)

For each annotated image, we provide a collection of points, each with a “yes” or “no” label for a given class. These points provide sparse information that can be used for the semantic segmentation task. We collected a total of 38.6M new point annotations (12.4M with “yes” labels) that cover 5.8 thousand classes and 1.4M images.

By focusing on point labels, we expanded the number of images annotated and categories covered. We also concentrated the efforts of our annotators on efficiently collecting useful information. Compared to our instance segmentation, the new points include 16x more classes and cover more images. The new points also cover 9x more classes than our box annotations. Compared to existing segmentation datasets, like PASCAL VOC, COCO, Cityscapes, LVIS, or ADE20K, our annotations cover more classes and more images than previous work. The new point label annotations are the first type of annotation in Open Images that provides localization information for both things (countable objects, like cars, cats, and catamarans), and stuff categories (uncountable objects like grass, granite, and gravel). Overall, the newly collected data is roughly equivalent to two years of human annotation effort.

Our initial experiments show that this type of sparse data is suitable for both training and evaluating segmentation models. Training a model directly on sparse data allows us to reach comparable quality to training on dense annotations. Similarly, we show that one can directly compute the traditional semantic segmentation intersection-over-union (IoU) metric over sparse data. The ranking across different methods is preserved, and the sparse IoU values are an accurate estimate of its dense version. See our paper for more details.

Below, we show four example images with their point-level labels, illustrating the rich and diverse information these annotations provide. Circles ⭘ are “yes” labels, and squares are “no” labels.

Four example images with point-level labels.
Images by Richie Diesterheft, John AM Nueva, Sarah Ackerman, and C Thomas, all under CC BY 2.0 license.

New Visualizers

In addition to the new data release, we also expanded the available visualizations of the Open Images annotations. The Open Images website now includes dedicated visualizers to explore the localized narratives annotations, the new point-level annotations, and a new all-in-one view. This new all-in-one view is available for the subset of 1.9M densely annotated images and allows one to explore the rich annotations that Open Images has accumulated over seven releases. On average these images have annotations for 6.7 image-labels (classes), 8.3 boxes, 1.7 relations, 1.5 masks, 0.4 localized narratives and 34.8 point-labels per image.

Below, we show two example images with various annotations in the all-in-one visualizer. The figures show the image-level labels, bounding boxes, box relations, instance masks, localized narrative mouse trace and caption, and point-level labels. The + classes have positive annotations (of any kind), while classes have only negative annotations (image-level or point-level).

Two example images with various annotations in the all-in-one visualizer.
Images by Jason Paris, and Rubén Vique, all under CC BY 2.0 license.

Conclusion

We hope that this new data release will enable computer vision research to cover ever more diverse and challenging scenarios. As the quality of automated semantic segmentation models improves over common classes, we want to move towards the long tail of visual concepts, and sparse point annotations are a step in that direction. More and more works are exploring how to use such sparse annotations (e.g., as supervision for instance segmentation or semantic segmentation), and Open Images V7 contributes to this research direction. We are looking forward to seeing what you will build next.

Acknowledgements

Thanks to Vittorio Ferrari, Jordi Pont-Tuset, Alina Kuznetsova, Ashlesha Sadras, and the annotators team for their support creating this new data release.

Read More

Google at ECCV 2022

Google at ECCV 2022

Google is proud to be a Platinum Sponsor of the European Conference on Computer Vision (ECCV 2022), a premier forum for the dissemination of research in computer vision and machine learning (ML). This year, ECCV 2022 will be held as a hybrid event, in person in Tel Aviv, Israel with virtual attendance as an option. Google has a strong presence at this year’s conference with over 60 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive research and expanding our partnership with the broader ML research community. 

Registered for ECCV 2022? We hope you’ll visit our on-site or virtual booths to learn more about the research we’re presenting at ECCV 2022, including several demos and opportunities to connect with our researchers. Learn more about Google’s research being presented at ECCV 2022 below (Google affiliations in bold).

Organizing Committee

Program Chairs include: Moustapha Cissé

Awards Paper Committee: Todd Zickler

Area Chairs include: Ayan Chakrabarti, Tali Dekel, Alireza Fathi, Vittorio Ferrari, David Fleet, Dilip Krishnan, Michael Rubinstein, Cordelia Schmid, Deqing Sun, Federico Tombari, Jasper Uijlings, Ming-Hsuan Yang, Todd Zickler

Accepted Publications

NeuMesh: Learning Disentangled Neural Mesh-Based Implicit Field for Geometry and Texture Editing
Bangbang Yang, Chong Bao, Junyi Zeng, Hujun Bao, Yinda Zhang, Zhaopeng Cui, Guofeng Zhang

Anti-Neuron Watermarking: Protecting Personal Data Against Unauthorized Neural Networks
Zihang Zou, Boqing Gong, Liqiang Wang

Exploiting Unlabeled Data with Vision and Language Models for Object Detection
Shiyu Zhao, Zhixing Zhang, Samuel Schulter, Long Zhao, Vijay Kumar B G, Anastasis Stathopoulos, Manmohan Chandraker, Dimitris N. Metaxas

Waymo Open Dataset: Panoramic Video Panoptic Segmentation
Jieru Mei, Alex Zhu, Xinchen Yan, Hang Yan, Siyuan Qiao, Yukun Zhu, Liang-Chieh Chen, Henrik Kretzschmar

PRIF: Primary Ray-Based Implicit Function
Brandon Yushan Feng, Yinda Zhang, Danhang Tang, Ruofei Du, Amitabh Varshney

LoRD: Local 4D Implicit Representation for High-Fidelity Dynamic Human Modeling
Boyan Jiang, Xinlin Ren, Mingsong Dou, Xiangyang Xue, Yanwei Fu, Yinda Zhang

k-Means Mask Transformer (see blog post)
Qihang Yu*, Siyuan Qiao, Maxwell D Collins, Yukun Zhu, Hartwig Adam, Alan Yuille, Liang-Chieh Chen

MaxViT: Multi-Axis Vision Transformer (see blog post)
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, Yinxiao Li

E-Graph: Minimal Solution for Rigid Rotation with Extensibility Graphs
Yanyan Li, Federico Tombari

RBP-Pose: Residual Bounding Box Projection for Category-Level Pose Estimation
Ruida Zhang, Yan Di, Zhiqiang Lou, Fabian Manhardt, Federico Tombari, Xiangyang Ji

GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning
Huseyin Coskun, Alireza Zareian, Joshua L Moore, Federico Tombari, Chen Wang

Scaling Open-Vocabulary Image Segmentation with Image-Level Labels
Golnaz Ghiasi, Xiuye Gu, Yin Cui, Tsung-Yi Lin*

Adaptive Transformers for Robust Few-Shot Cross-Domain Face Anti-spoofing
Hsin-Ping Huang, Deqing Sun, Yaojie Liu, Wen-Sheng Chu, Taihong Xiao, Jinwei Yuan, Hartwig Adam, Ming-Hsuan Yang

DualPrompt: Complementary Prompting for Rehearsal-Free Continual Learning
Zifeng Wang*, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, Tomas Pfister

BLT: Bidirectional Layout Transformer for Controllable Layout Generation
Xiang Kong, Lu Jiang, Huiwen Chang, Han Zhang, Yuan Hao, Haifeng Gong, Irfan Essa

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer
Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, Ming-Hsuan Yang, Jiaqi Ma

Learning Visibility for Robust Dense Human Body Estimation
Chun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, Ming-Hsuan Yang

Are Vision Transformers Robust to Patch Perturbations?
Jindong Gu, Volker Tresp, Yao Qin

PseudoAugment: Learning to Use Unlabeled Data for Data Augmentation in Point Clouds
Zhaoqi Leng, Shuyang Cheng, Ben Caine, Weiyue Wang, Xiao Zhang, Jonathon Shlens, Mingxing Tan, Dragomir Anguelov

Structure and Motion from Casual Videos
Zhoutong Zhang, Forrester Cole, Zhengqi Li, Noah Snavely, Michael Rubinstein, William T. Freeman

PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map
Chenfeng Xu, Tian Li, Chen Tang, Lingfeng Sun, Kurt Keutzer, Masayoshi Tomizuka, Alireza Fathi, Wei Zhan

Novel Class Discovery Without Forgetting
Joseph K J, Sujoy Paul, Gaurav Aggarwal, Soma Biswas, Piyush Rai, Kai Han, Vineeth N Balasubramanian

Hierarchically Self-Supervised Transformer for Human Skeleton Representation Learning
Yuxiao Chen, Long Zhao, Jianbo Yuan, Yu Tian, Zhaoyang Xia, Shijie Geng, Ligong Han, Dimitris N. Metaxas

PACTran: PAC-Bayesian Metrics for Estimating the Transferability of Pretrained Models to Classification Tasks
Nan Ding, Xi Chen, Tomer Levinboim, Soravit Changpinyo, Radu Soricut

InfiniteNature-Zero: Learning Perpetual View Generation of Natural Scenes from Single Images
Zhengqi Li, Qianqian Wang*, Noah Snavely, Angjoo Kanazawa*

Generalizable Patch-Based Neural Rendering (see blog post)
Mohammed Suhail*, Carlos Esteves, Leonid Sigal, Ameesh Makadia

LESS: Label-Efficient Semantic Segmentation for LiDAR Point Clouds
Minghua Liu, Yin Zhou, Charles R. Qi, Boqing Gong, Hao Su, Dragomir Anguelov

The Missing Link: Finding Label Relations Across Datasets
Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Learning Instance-Specific Adaptation for Cross-Domain Segmentation
Yuliang Zou, Zizhao Zhang, Chun-Liang Li, Han Zhang, Tomas Pfister, Jia-Bin Huang

Learning Audio-Video Modalities from Image Captions
Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
Medhini Narasimhan*, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid

On Label Granularity and Object Localization
Elijah Cole, Kimberly Wilber, Grant Van Horn, Xuan Yang, Marco Fornoni, Pietro Perona, Serge Belongie, Andrew Howard, Oisin Mac Aodha

Disentangling Architecture and Training for Optical Flow
Deqing Sun, Charles Herrmann, Fitsum Reda, Michael Rubinstein, David J. Fleet, William T. Freeman

NewsStories: Illustrating Articles with Visual Summaries
Reuben Tan, Bryan Plummer, Kate Saenko, J.P. Lewis, Avneesh Sud, Thomas Leung

Improving GANs for Long-Tailed Data Through Group Spectral Regularization
Harsh Rangwani, Naman Jaswani, Tejan Karmali, Varun Jampani, Venkatesh Babu Radhakrishnan

Planes vs. Chairs: Category-Guided 3D Shape Learning Without Any 3D Cues
Zixuan Huang, Stefan Stojanov, Anh Thai, Varun Jampani, James Rehg

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and Sketch
Patsorn Sangkloy, Wittawat Jitkrittum, Diyi Yang, James Hays

Learned Monocular Depth Priors in Visual-Inertial Initialization
Yunwen Zhou, Abhishek Kar, Eric L. Turner, Adarsh Kowdle, Chao Guo, Ryan DuToit, Konstantine Tsotsos

How Stable are Transferability Metrics Evaluations?
Andrea Agostinelli, Michal Pandy, Jasper Uijlings, Thomas Mensink, Vittorio Ferrari

Data-Free Neural Architecture Search via Recursive Label Calibration
Zechun Liu*, Zhiqiang Shen, Yun Long, Eric Xing, Kwang-Ting Cheng, Chas H. Leichner

Fast and High Quality Image Denoising via Malleable Convolution
Yifan Jiang*, Bartlomiej Wronski, Ben Mildenhall, Jonathan T. Barron, Zhangyang Wang, Tianfan Xue

Concurrent Subsidiary Supervision for Unsupervised Source-Free Domain Adaptation
Jogendra Nath Kundu, Suvaansh Bhambri, Akshay R Kulkarni, Hiran Sarkar,
Varun Jampani, Venkatesh Babu Radhakrishnan

Learning Online Multi-Sensor Depth Fusion
Erik Sandström, Martin R. Oswald, Suryansh Kumar, Silvan Weder, Fisher Yu, Cristian Sminchisescu, Luc Van Gool

Hierarchical Semantic Regularization of Latent Spaces in StyleGANs
Tejan Karmali, Rishubh Parihar, Susmit Agrawal, Harsh Rangwani, Varun Jampani, Maneesh K Singh, Venkatesh Babu Radhakrishnan

RayTran: 3D Pose Estimation and Shape Reconstruction of Multiple Objects from Videos with Ray-Traced Transformers
Michał J Tyszkiewicz, Kevis-Kokitsi Maninis, Stefan Popov, Vittorio Ferrari

Neural Video Compression Using GANs for Detail Synthesis and Propagation
Fabian Mentzer, Eirikur Agustsson, Johannes Ballé, David Minnen, Nick Johnston, George Toderici

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset
Grant Van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, Serge Belongie

Implicit Neural Representations for Image Compression
Yannick Strümpler, Janis Postels, Ren Yang, Luc Van Gool, Federico Tombari

3D Compositional Zero-Shot Learning with DeCompositional Consensus
Muhammad Ferjad Naeem, Evin Pınar Örnek, Yongqin Xian, Luc Van Gool, Federico Tombari

FindIt: Generalized Localization with Natural Language Queries (see blog post)
Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

A Simple Single-Scale Vision Transformer for Object Detection and Instance Segmentation
Wuyang Chen*, Xianzhi Du, Fan Yang, Lucas Beyer, Xiaohua Zhai, Tsung-Yi Lin, Huizhong Chen, Jing Li, Xiaodan Song, Zhangyang Wang, Denny Zhou

Improved Masked Image Generation with Token-Critic
Jose Lezama, Huiwen Chang, Lu Jiang, Irfan Essa

Learning Discriminative Shrinkage Deep Networks for Image Deconvolution
Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang

AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Efthymios Tzinis*, Scott Wisdom, Tal Remez, John Hershey

Simple Open-Vocabulary Object Detection with Vision Transformers
Matthias Minderer, Alexey Gritsenko, Austin C Stone, Maxim Neumann, Dirk Weißenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, Neil Houlsby

COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality
Honglu Zhou, Asim Kadav, Aviv Shamsian, Shijie Geng, Farley Lai, Long Zhao, Ting Liu, Mubbasir Kapadia, Hans Peter Graf

Video Question Answering with Iterative Video-Text Co-tokenization (see blog post)
AJ Piergiovanni, Kairo Morton*, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Class-Agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer, Ming-Hsuan Yang

FILM: Frame Interpolation for Large Motion (see blog post)
Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, Brian Curless

Compositional Human-Scene Interaction Synthesis with Semantic Control
Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, Siyu Tang

Workshops

LatinX in AI
Mentors include: José Lezama
Keynote Speakers include: Andre Araujo

AI for Creative Video Editing and Understanding
Keynote Speakers include: Tali Dekel, Negar Rostamzadeh

Learning With Limited and Imperfect Data (L2ID)
Invited Speakers include: Xiuye Gu
Organizing Committee includes: Sadeep Jayasumana

International Challenge on Compositional and Multimodal Perception (CAMP)
Program Committee includes: Edward Vendrow

Self-Supervised Learning: What is Next?
Invited Speakers include: Mathilde Caron, Arsha Nagrani
Organizers include: Andrew Zisserman

3rd Workshop on Adversarial Robustness In the Real World
Invited Speakers include: Ekin Dogus Cubuk
Organizers include: Xinyun Chen, Alexander Robey, Nataniel Ruiz, Yutong Bai

AV4D: Visual Learning of Sounds in Spaces
Invited Speakers include: John Hershey

Challenge on Mobile Intelligent Photography and Imaging (MIPI)
Invited Speakers include: Peyman Milanfar

Robust Vision Challenge 2022
Organizing Committee includes: Alina Kuznetsova

Computer Vision in the Wild
Challenge Organizers include: Yi-Ting Chen, Ye Xia
Invited Speakers include: Yin Cui, Yongqin Xian, Neil Houlsby

Self-Supervised Learning for Next-Generation Industry-Level Autonomous Driving (SSLAD)
Organizers include: Fisher Yu

Responsible Computer Vision
Organizing Committee includes: Been Kim
Invited Speakers include: Emily Denton

Cross-Modal Human-Robot Interaction
Invited Speakers include: Peter Anderson

ISIC Skin Image Analysis
Organizing Committee includes: Yuan Liu
Steering Committee includes: Yuan Liu, Dale Webster
Invited Speakers include: Yuan Liu

Observing and Understanding Hands in Action
Sponsored by Google

Autonomous Vehicle Vision (AVVision)
Speakers include: Fisher Yu

Visual Perception for Navigation in Human Environments: The JackRabbot Human Body Pose Dataset and Benchmark
Organizers include: Edward Vendrow

Language for 3D Scenes
Invited Speakers include: Jason Baldridge
Organizers include: Leonidas Guibas

Designing and Evaluating Computer Perception Systems (CoPe)
Organizers include: Andrew Zisserman

Learning To Generate 3D Shapes and Scenes
Panelists include: Pete Florence

Advances in Image Manipulation
Program Committee includes: George Toderici, Ming-Hsuan Yang

TiE: Text in Everything
Challenge Organizers include: Shangbang Long, Siyang Qin
Invited Speakers include: Tali Dekel, Aishwarya Agrawal

Instance-Level Recognition
Organizing Committee: Andre Araujo, Bingyi Cao, Tobias Weyand
Invited Speakers include: Mathilde Caron

What Is Motion For?
Organizing Committee: Deqing Sun, Fitsum Reda, Charles Herrmann
Invited Speakers include: Tali Dekel

Neural Geometry and Rendering: Advances and the Common Objects in 3D Challenge
Invited Speakers include: Ben Mildenhall

Visual Object-Oriented Learning Meets Interaction: Discovery, Representations, and Applications
Invited Speakers include: Klaus Greff, Thomas Kipf
Organizing Committee includes: Leonidas Guibas

Vision with Biased or Scarce Data (VBSD)
Program Committee includes: Yizhou Wang

Multiple Object Tracking and Segmentation in Complex Environments
Invited Speakers include: Xingyi Zhou, Fisher Yu

3rd Visual Inductive Priors for Data-Efficient Deep Learning Workshop
Organizing Committee includes: Ekin Dogus Cubuk

DeeperAction: Detailed Video Action Understanding and Anomaly Recognition
Advisors include: Rahul Sukthankar

Sign Language Understanding Workshop and Sign Language Recognition, Translation & Production Challenge
Organizing Committee includes: Andrew Zisserman
Speakers include: Andrew Zisserman

Ego4D: First-Person Multi-Modal Video Understanding
Invited Speakers include: Michal Irani

AI-Enabled Medical Image Analysis: Digital Pathology & Radiology/COVID19
Program Chairs include: Po-Hsuan Cameron Chen
Workshop Partner: Google Health

Visual Object Tracking Challenge (VOT 2022)
Technical Committee includes: Christoph Mayer

Assistive Computer Vision and Robotics
Technical Committee includes: Maja Mataric

Human Body, Hands, and Activities from Egocentric and Multi-View Cameras
Organizers include: Francis Engelmann

Frontiers of Monocular 3D Perception: Implicit x Explicit
Panelists include: Pete Florence

Tutorials

Self-Supervised Representation Learning in Computer Vision
Invited Speakers include: Ting Chen

Neural Volumetric Rendering for Computer Vision
Organizers include: Ben Mildenhall, Pratul Srinivasan, Jon Barron
Presenters include: Ben Mildenhall, Pratul Srinivasan

New Frontiers in Efficient Neural Architecture Search!
Speakers include: Ruochen Wang



*Work done while at Google.  

Read More

How AI can help in the fight against breast cancer

How AI can help in the fight against breast cancer

In 2020, there were 2.3 million people diagnosed with breast cancer and 685,000 deaths globally. Early cancer detection is key to better health outcomes. But screenings are work intensive, and patients often find getting mammograms and waiting for results stressful.

In response to these challenges, Google Health and Northwestern Medicine partnered in 2021 on a clinical research study to explore whether artificial intelligence (AI) models can reduce the time to diagnosis during the screening process, narrowing the assessment gap and improving the patient experience. This work is among the first prospective randomized controlled studies for AI in breast cancer screening, and the results will be published in early 2023.

Behind this work, are scientists and researchers united in the fight against breast cancer. We spoke with Dr. Sunny Jansen, a technical program manager at Google, and Sally Friedewald, MD, the division chief of Breast and Women’s Imaging at Northwestern University Feinberg School of Medicine, on how they hope this work will help screening providers catch cancer earlier and improve the patient experience.

What were you hoping to achieve with this work in the fight against breast cancer?

Dr. Jansen: Like so many of us, I know how breast cancer can impact families and communities, and how critical early detection can be. The experiences of so many around me have influenced my work in this area. I hope that AI can make the future of breast cancer screening easier, faster, more accurate — and, ultimately, more accessible for women globally.

So we sought to understand how AI can reduce diagnostic delays and help patients receive diagnoses as soon as possible by streamlining care into a single visit. For patients with abnormal findings at screening, the diagnostic delay to get additional imaging tests is typically a couple of weeks in the U.S. Often, the results are normal after the additional imaging tests, but that waiting period can be nerve-racking. Additionally, it can be harder for some patients to come back to get additional imaging tests, which exacerbates delays and leads to disparities in the timeliness of care.

Dr. Friedewald: I anticipate an increase in the demand for screenings and challenges in having enough providers with the necessary specialized training. Using AI, we can identify patients who need additional imaging when they are still in the clinic. We can expedite their care, and, in many cases, eliminate the need for return visits. Patients who aren’t flagged still receive the care they need as well. This translates into operational efficiencies and ultimately leads to patients getting a breast cancer diagnosis faster. We already know the earlier treatment starts, the better.

What were your initial beliefs about applying AI to identify breast cancer? How have these changed through your work on this project?

Dr. Jansen: Most existing publications about AI and breast cancer analyze AI performance retrospectively by reviewing historical datasets. While retrospective studies have a lot of value, they don’t necessarily represent how AI works in the real world. Sally decided early on that it would be important to do a prospective study, incorporating AI into real-world clinical workflows and measuring the impact. I wasn’t sure what to expect!

Dr. Friedewald: Computer-aided detection (CAD), which was developed a few decades ago to help radiologists identify cancers via mammogram, has proven to be helpful in some environments. Overall, in the U.S., CAD has not resulted in increased cancer detection. I was concerned that AI would be similar to CAD in efficacy. However, AI gathers data in a fundamentally different way. I am hopeful that with this new information we can identify cancers earlier with the ultimate goal of saving lives.

The research will be published in early 2023. What did you find most inspiring and hopeful about what you learned?

Dr. Jansen: The patients who consented to participate in the study inspired me. Clinicians and scientists must conduct quality real-world research so that the best ideas can be identified and moved forward, and we need patients as equal partners in our research.

Dr. Friedewald: Agreed! There’s an appetite to improve our processes and make screening easier and less anxiety-provoking. I truly believe that if we can streamline care for our patients, we will decrease the stress associated with screening and hopefully improve access for those who need it.

Additionally, AI has the potential to go beyond the prioritization of patients who need care. By prospectively identifying patients who are at higher risk of developing breast cancer, AI could help us determine patients that might need a more rigorous screening regimen. I am looking forward to collaborating with Google on this topic and others that could ultimately improve cancer survival.

Read More