Learning to Play Minecraft with Video PreTraining (VPT)

Learning to Play Minecraft with Video PreTraining (VPT)

We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.

Read Paper


View Code and model weights


MineRL Competition

The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a record of what happened but not precisely how it was achieved, i.e. you will not know the exact sequence of mouse movements and keys pressed. If we would like to build large-scale foundation models in these domains as we’ve done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where “action labels” are simply the next words in a sentence.

In order to utilize the wealth of unlabeled video data available on the internet, we introduce a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but also the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use past and future information to guess the action at each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
VPT method overview

VPT Zero-Shot Results

We chose to validate our method in Minecraft because it (1) is one of the most actively played video games in the world and thus has a wealth of freely available video data and (2) is open-ended with a wide variety of things to do, similar to real-world applications such as computer usage. Unlike prior works in Minecraft that use simplified action spaces aimed at easing exploration, our AI uses the much more generally applicable, though also much more difficult, native human interface: 20Hz framerate with the mouse and keyboard.

Trained on 70,000 hours of IDM-labeled online video, our behavioral cloning model (the “VPT foundation model”) accomplishes tasks in Minecraft that are nearly impossible to achieve with reinforcement learning from scratch. It learns to chop down trees to collect logs, craft those logs into planks, and then craft those planks into a crafting table; this sequence takes a human proficient in Minecraft approximately 50 seconds or 1,000 consecutive game actions.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
Sequence of items required to craft a crafting table, labeled with the median time it takes proficient humans to reach each step
Crafting of a crafting table “zero shot” (i.e. after pre-training only without additional fine-tuning)

Additionally, the model performs other complex skills humans often do in the game, such as swimming, hunting animals for food, and eating that food. It also learned the skill of “pillar jumping”, a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself.

Swimming (zero-shot)

Hunting animals (zero-shot)

Eating food (zero-shot)

Pillar jumping (zero-shot)

Fine-tuning with Behavioral Cloning

Foundation models are designed to have a broad behavior profile and be generally capable across a wide variety of tasks. To incorporate new knowledge or allow them to specialize on a narrower task distribution, it is common practice to fine-tune these models to smaller, more specific datasets. As a case study into how well the VPT foundation model can be fine-tuned to downstream datasets, we asked our contractors to play for 10 minutes in brand new Minecraft worlds and build a house from basic Minecraft materials. We hoped that this would amplify the foundation model’s ability to reliably perform “early game” skills such as building crafting tables. When fine-tuning to this dataset, not only do we see a massive improvement in reliably performing the early game skills already present in the foundation model, but the fine-tuned model also learns to go even deeper into the technology tree by crafting both wooden and stone tools. Sometimes we even see some rudimentary shelter construction and the agent searching through villages, including raiding chests.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
Sequence of items required to craft a stone pickaxe, labeled with the median time it takes proficient humans to reach each step
Improved early game behavior from BC fine-tuning

Crafting a stone pickaxe

Constructing a rudimentary wooden shelter

Searching through a village

Data Scaling

Perhaps the most important hypothesis of our work is that it is far more effective to use labeled contractor data to train an IDM (as part of the VPT pipeline) than it is to directly train a BC foundation model from that same small contractor dataset. To validate this hypothesis we train foundation models on increasing amounts of data from 1 to 70,000 hours. Those trained on under 2,000 hours of data are trained on the contractor data with ground-truth labels that were originally collected to train the IDM, and those trained on over 2,000 hours are trained on internet data labeled with our IDM. We then take each foundation model and fine-tune it to the house building dataset described in the previous section.

Effect of foundation model training data on fine-tuning

As foundation model data increases, we generally see an increase in crafting ability, and only at the largest data scale do we see the emergence of stone tool crafting.

Fine-Tuning with Reinforcement Learning

When it is possible to specify a reward function, reinforcement learning (RL) can be a powerful method for eliciting high, potentially even super-human, performance. However, many tasks require overcoming hard exploration challenges, and most RL methods tackle these with random exploration priors, e.g. models are often incentivized to act randomly via entropy bonuses. The VPT model should be a much better prior for RL because emulating human behavior is likely much more helpful than taking random actions. We set our model the challenging task of collecting a diamond pickaxe, an unprecedented capability in Minecraft made all the more difficult when using the native human interface.

Crafting a diamond pickaxe requires a long and complicated sequence of subtasks. To make this task tractable, we reward agents for each item in the sequence.

Learning to Play Minecraft with Video PreTraining (VPT)
Learning to Play Minecraft with Video PreTraining (VPT)
RL fine-tuned VPT model crafting a diamond pickaxe

We found that an RL policy trained from a random initialization (the standard RL method) barely achieves any reward, never learning to collect logs and only rarely collecting sticks. In stark contrast, fine-tuning from a VPT model not only learns to craft diamond pickaxes (which it does in 2.5% of 10-minute Minecraft episodes), but it even has a human-level success rate at collecting all items leading up to the diamond pickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average.

Reward over episodes

Conclusion

VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet. Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.

For more information, please see our paper. We are also open sourcing our contractor data, Minecraft environment, model code, and model weights, which we hope will aid future research into VPT. Furthermore, we have partnered with the MineRL NeurIPS competition this year. Contestants can use and fine-tune our models to try to solve many difficult tasks in Minecraft. Those interested can check out the competition webpage and compete for a blue-sky prize of $100,000 in addition to a regular prize pool of $20,000. Grants are available to self-identified underrepresented groups and individuals.


Acknowledgments
This was a large effort by a dedicated team. Each author made huge contributions on many fronts over long time periods. All members were full time on the project for over six months. BB, IA, PZ, and JC were on the original VPT project team, and thus were involved for even longer (over a year). Aside from those original team members, author order is random. It was also randomized between IA and PZ.


OpenAI

Robots play with play dough

The inner child in many of us feels an overwhelming sense of joy when stumbling across a pile of the fluorescent, rubbery mixture of water, salt, and flour that put goo on the map: play dough. (Even if this happens rarely in adulthood.)

While manipulating play dough is fun and easy for 2-year-olds, the shapeless sludge is hard for robots to handle. Machines have become increasingly reliable with rigid objects, but manipulating soft, deformable objects comes with a laundry list of technical challenges, and most importantly, as with most flexible structures, if you move one part, you’re likely affecting everything else. 

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University recently let robots take their hand at playing with the modeling compound, but not for nostalgia’s sake. Their new system learns directly from visual inputs to let a robot with a two-fingered gripper see, simulate, and shape doughy objects. “RoboCraft” could reliably plan a robot’s behavior to pinch and release play dough to make various letters, including ones it had never seen. With just 10 minutes of data, the two-finger gripper rivaled human counterparts that teleoperated the machine — performing on-par, and at times even better, on the tested tasks. 

“Modeling and manipulating objects with high degrees of freedom are essential capabilities for robots to learn how to enable complex industrial and household interaction tasks, like stuffing dumplings, rolling sushi, and making pottery,” says Yunzhu Li, CSAIL PhD student and author on a new paper about RoboCraft. “While there’s been recent advances in manipulating clothes and ropes, we found that objects with high plasticity, like dough or plasticine — despite ubiquity in those household and industrial settings — was a largely underexplored territory. With RoboCraft, we learn the dynamics models directly from high-dimensional sensory data, which offers a promising data-driven avenue for us to perform effective planning.” 

With undefined, smooth material, the whole structure needs to be accounted for before you can do any type of efficient and effective modeling and planning. By turning the images into graphs of little particles, coupled with algorithms, RoboCraft, using a graph neural network as the dynamics model, makes more accurate predictions about the material’s change of shapes. 

Typically, researchers have used complex physics simulators to model and understand force and dynamics being applied to objects, but RoboCraft simply uses visual data. The inner-workings of the system relies on three parts to shape soft material into, say, an “R.” 

The first part — perception — is all about learning to “see.” It uses cameras to collect raw, visual sensor data from the environment, which are then turned into little clouds of particles to represent the shapes. A graph-based neural network then uses said particle data to learn to “simulate” the object’s dynamics, or how it moves. Then, algorithms help plan the robot’s behavior so it learns to “shape” a blob of dough, armed with the training data from the many pinches. While the letters are a bit loose, they’re indubitably representative. 

Besides cutesy shapes, the team is (actually) working on making dumplings from dough and a prepared filling. Right now, with just a two finger gripper, it’s a big ask. RoboCraft would need additional tools (a baker needs multiple tools to cook; so do robots) — a rolling pin, a stamp, and a mold. 

A more far in the future domain the scientists envision is using RoboCraft for assistance with household tasks and chores, which could be of particular help to the elderly or those with limited mobility. To accomplish this, given the many obstructions that could take place, a much more adaptive representation of the dough or item would be needed, and as well as exploration into what class of models might be suitable to capture the underlying structural systems. 

“RoboCraft essentially demonstrates that this predictive model can be learned in very data-efficient ways to plan motion. In the long run, we are thinking about using various tools to manipulate materials,” says Li. “If you think about dumpling or dough making, just one gripper wouldn’t be able to solve it. Helping the model understand and accomplish longer-horizon planning tasks, such as, how the dough will deform given the current tool, movements and actions, is a next step for future work.” 

Li wrote the paper alongside Haochen Shi, Stanford master’s student; Huazhe Xu, Stanford postdoc; Zhiao Huang, PhD student at the University of California at San Diego; and Jiajun Wu, assistant professor at Stanford. They will present the research at the Robotics: Science and Systems conference in New York City. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI), the Samsung Global Research Outreach (GRO) Program, the Toyota Research Institute (TRI), and Amazon, Autodesk, Salesforce, and Bosch.

Read More

GODEL: Combining goal-oriented dialog with real-world conversations

Diagram showing GODEL’s architecture. The environment of the dialog system consists of both structured and unstructured content, which it uses to retrieve information. This source content, which we term “grounding,” is updated and repeatedly used by GODEL to produce a new response after each user input.

They make restaurant recommendations, help us pay bills, and remind us of appointments. Many people have come to rely on virtual assistants and chatbots to perform a wide range of routine tasks. But what if a single dialog agent, the technology behind these language-based apps, could perform all these tasks and then take the conversation further? In addition to providing on-topic expertise, such as recommending a restaurant, it could engage in a conversation about the history of the neighborhood or a recent sports game, and then bring the conversation back on track. What if the agent’s responses continually reflect the latest world events? And what if it could do all of this without the need for any additional work by the designer?   

With GODEL, this may not be far off. GODEL stands for Grounded Open Dialogue Language Model, and it ushers in a new class of pretrained language models that enable both task-oriented and social conversation and are evaluated by the usefulness of their responses.  

Pretrained language models are among the engines that power conversational AI, the technology that underlies these dialog agents. They can either be task-oriented (“give me a job, and I’ll do it”) or engage in a conversation without a specified outcome, known as open-domain or chit-chat. GODEL combines both these capabilities, giving dialog agents the ability to generate responses based not just on the context of the conversation, but also on external information, content that was not part of the dataset when the model was trained. This includes both structured content, such as information stored in databases, and unstructured content, such as restaurant reviews, Wikipedia articles, and other publicly available material found on the web. This explains how a simple task-based query about restaurant recommendations can evolve into a dialog about ingredients, food, and even cooking techniques—the kind of winding path that real-world conversations take.  

In 2019, the Deep Learning and Natural Language Processing groups at Microsoft Research released DialoGPT, the first large-scale pretrained language model designed specifically for dialog. This helped make conversational AI more accessible and easier to work with, and it enabled the research community to make considerable progress in this area. With GODEL, our goal is to help further this progress by empowering researchers and developers to create dialog agents that are unrestricted in the types of queries they can respond to and the sources of information they can draw from. We also worked to ensure those responses are useful to the person making the query.    

In our paper, “GODEL: Large-Scale Pre-training for Goal-Directed Dialog,” we describe the technical details underlying GODEL, and we have made the code available on GitHub

A grounded model

One of GODEL’s key features is the flexibility it provides users in defining their model’s grounding—the sources from which their dialog agents retrieve information. This flexibility informs GODEL’s versatility in diverse conversational settings. If someone were to inquire about a local restaurant for example, GODEL would be able to provide specific and accurate responses even though that venue may not have been included in the data used to train it. Responses would vary depending on whether the grounding information is empty, a snippet of a document, a search result (unstructured text), or information drawn from a database about the restaurant (structured text). However, each response would be appropriate and useful. 

In addition to specificity, grounded generation helps keep models up to date, as the grounded text can incorporate information that may not have been available at the time the model was trained. For example, if a model were developed before the 2022 Winter Olympics, GODEL would be able to provide details on those games and a list of winners even though all the data available to train it predates that event.

Broad application of GODEL

Another main feature of GODEL is its wide range of dialog applications. While its predecessor, DialoGPT, and other prior pretrained models for dialog have mostly focused on social bots, GODEL can be applied to a variety of dialogs, including those that are task-oriented, question-answering, and grounded chit-chat. In the same conversation, GODEL can produce reasonable responses for a variety of query types, including general questions or requests for specific actions.  

In addition, GODEL’s responses have been evaluated for their helpfulness. In our paper, we show that evaluation is done more reliably on datasets that are goal-directed, and that people generally agree on which responses are better when asked to judge their utility towards achieving certain goals. Equipped with this robust evaluation setup, we compared our model against several strong baselines and state-of-the-art approaches and show that GODEL is superior in terms of both human and automatic evaluation, as indicated in Figure 1. The paper describes extensive experiments against other state-of-the-art pretrained language models and demonstrates that performance gains are even larger in these cases. 

Two bar graphs showing that GODEL outperforms the baseline, in terms of both human and automated dialog evaluation. For human evaluation, GODEL received much higher human ratings (47, 41, and 27), while the human ratings for the best baseline were low (30, 22, and 17). For automatic evaluation, differences are smaller yet still statistically significant.
Figure 1: These charts illustrate GODEL’s performance against T5, a pretrained model that performed best in our evaluation. They compare the aggregate performance of models fine-tuned from GODEL against that of models fine-tuned from T5. They show that GODEL performs much better in human evaluations and makes appreciable gains in the automatic evaluation. The test set for these experiments combines a variety of dialog genres, including task-oriented dialog, conversational question-answering, and grounded chit-chat.

The following examples illustrate different dialog scenarios where GODEL uses a variety of sources to respond to identical user queries. 

  • This example illustrates how GODEL responds in an open-ended scenario in which the user asks a question that is completely unrelated to the initial question. Despite the lack of relevance, GODEL responds appropriately while trying to bring the conversation back on track. 

    Figure showing how GODEL responds to a user who just changed the topic, demonstrating that it can bring the conversation back on track. While the initial query is about a restaurant, the user suddenly mentions a series of tornadoes that have recently affected the area. GODEL uses grounding from a recent news article to provide information about the tornadoes, as requested by the user. Finally, it asks the user if there is anything else it can help with.

  • This example illustrates how GODEL responds in a task-oriented setting in which the model is connected to the components of a traditional goal-oriented dialog systems, such as a database. In this case, the relevant environment contains structured information, a database returning two restaurants relevant to the current conversation.  

    Figure showing how GODEL responds appropriately to a user's request for a restaurant reservation. The user expresses a preference for a restaurant named Lucky Star, and GODEL extracts information from a database about that restaurant and retrieves relevant information, such as a reference number, to generate a response that flows naturally with the rest of the conversation.

  • This example illustrates how GODEL responds in a task-oriented setting in which traditional components of task-oriented dialog systems are not available. In this case, GODEL retrieves a restaurant review via a search engine. The response reflects both the context of the conversation and a snippet of the retrieved text, a restaurant review.  

    Figure showing how GODEL responds appropriately to a user's request for information about a specific restaurant. The user asks whether a given restaurant is good for groups, and GODEL uses text originating from restaurant reviews to infer that the restaurant is indeed good for groups. Also, GODEL provides additional information to address a concern with larger groups—that food is typically served quickly.

  •  This example illustrates how GODEL responds in a question-answering scenario, where the user asks a general question and the context provides the dialog agent with the words it needs to search for the relevant information on the web. 

    Figure showing how GODEL responds appropriately when asked to give an example of a popular Chinese dish. GODEL uses grounding originating from search results to respond to the question while focusing on the most relevant information of the retrieved document.

GODEL available as open source

To advance research, we believe it is crucial to make code and models publicly available, and we have released GODEL as fully open source. We have made three versions of GODEL available: base, large, and extra-large. We are also including the code needed to retrain all pretrained models and to fine-tune models for specific tasks: the CoQA dataset, intended for conversational question-answering; the Wizard of Wikipedia and Wizard of the Internet datasets, aimed at information-seeking chats; and MultiWOZ is for task-completion dialogs.

We hope GODEL helps numerous academic research teams advance the field of conversational AI with innovative dialog models while eliminating the need for significant GPU resources. We plan to continuously improve GODEL and make more models available to the research community. Please visit our project page to learn more about the GODEL project and new releases.

Acknowledgements

We would like to thank our fellow colleagues at Microsoft Research who contributed to this work and blog post: Bill Dolan, Pengcheng He, Elnaz Nouri, Clarisse Simoes Ribeiro. 

The post GODEL: Combining goal-oriented dialog with real-world conversations appeared first on Microsoft Research.

Read More

Making an Impact: GFN Thursday Transforms Macs Into GeForce Gaming PCs

Thanks to the GeForce cloud, even Mac users can be PC gamers. This GFN Thursday, fire up your Macbook and get your game on.

This week brings eight more games to the GeForce NOW library. Plus, members can play Genshin Impact and claim a reward to start them out on their journeys streaming on GeForce NOW.

Mac User by Day, Gamer by Night

Love using a Mac, but can’t play the PC-only game that everyone’s talking about — like Genshin Impact or this week’s Epic Games Store free game, Car Mechanic Simulator 2018? GeForce NOW transforms nearly any Mac into a high-end gaming rig, rendering games at full quality and streaming them to Macbook Pros, Macbook Airs, iMacs and Mac Minis.

On GeForce NOW, you play the real PC versions of games without having to worry if something has been ported to Mac. Since the native PC version of games streams straight from the cloud, gamers can upgrade to the newest Apple hardware with confidence.

GeForce NOW RTX 3080 members can play on M1 Mac laptops at up to 1600p, or up to 4K resolution on supported external displays. Stream with even longer sessions lengths — up to eight hours. Members on RTX 3080 and Priority plans can even play with RTX ON for supported games, experiencing modern classics like Cyberpunk 2077 and Control with real-time ray tracing. No PC required.

Game saves are synced across each digital store for supported games, so members can play on Macs, as well as any other supported device, without losing progress.

Join today to see what it’s like to have the best of both PC and Mac worlds.

Get Started With Genshin Impact

This week brings the release of Genshin Impact, as well as rewards for Travelers playing on GeForce NOW.

Embark on a journey as a Traveler from another world and search for a missing sibling in the fantastic continent of Teyvat. Explore immersive landscapes, dive deep into rich quests alongside iconic characters and complete daily challenges, streaming across supported PCs, Macs and Chromebooks.

RTX 3080 members can even play with ultra-low latency, streaming at 1440p and 120 frames per second or in 4K resolution at 60 FPS on the PC and Mac apps.

Genshin Impact Reward on GeForce NOW
Start the adventure off right with rewards in “Genshin Impact.”

Members who’ve opted in to rewards will receive an email for a starter kit that can be claimed through the NVIDIA Rewards redemption portal. The kit will become available in game once players reach Adventure Rank 10.

The reward includes 10,000 Mora to purchase various items, five Fine Enhancement Ores to enhance weapons, three Squirrel Fish and three Northern Apple Stews for fuel, and 10 Adventurer’s Experience points to level up characters.

Getting membership rewards for streaming games on the cloud is easy. Log in to your NVIDIA account and select “GEFORCE NOW” from the header, then scroll down to “REWARDS” and click the “UPDATE REWARDS SETTINGS” button. Check the box in the dialogue window that shows up to start receiving special offers and in-game goodies.

Jump Into the Newest Games

Planet Zoo on GeForce NOW
Get a little wild this week with new endangered animals to care for and more in the Planet Zoo: Conservation Pack.

There’s something for everyone on GeForce NOW. This week brings new in-game content like the Planet Zoo: Conservation Pack, the newest DLC for Frontier Developments’ ultimate zoo sim.

Members can also stream the following eight new titles this week:

Finally, we’ve got a little challenge for you this week. Let us know your answer on Twitter or in the comments below.

The post Making an Impact: GFN Thursday Transforms Macs Into GeForce Gaming PCs appeared first on NVIDIA Blog.

Read More

Learning to play Minecraft with Video PreTraining

We trained a neural network to play Minecraft by Video PreTraining (VPT) on a massive unlabeled video dataset of human Minecraft play, while using only a small amount of labeled contractor data. With fine-tuning, our model can learn to craft diamond tools, a task that usually takes proficient humans over 20 minutes (24,000 actions). Our model uses the native human interface of keypresses and mouse movements, making it quite general, and represents a step towards general computer-using agents.OpenAI Blog

Geospatial deep learning with TorchGeo

TorchGeo is a PyTorch domain library providing datasets, samplers, transforms, and pre-trained models specific to geospatial data.

https://github.com/microsoft/torchgeo

For decades, Earth observation satellites, aircraft, and more recently UAV platforms have been collecting increasing amounts of imagery of the Earth’s surface. With information about seasonal and long-term trends, remotely sensed imagery can be invaluable for solving some of the greatest challenges to humanity, including climate change adaptation, natural disaster monitoring, water resource management, and food security for a growing global population. From a computer vision perspective, this includes applications like land cover mapping (semantic segmentation), deforestation and flood monitoring (change detection), glacial flow (pixel tracking), hurricane tracking and intensity estimation (regression), and building and road detection (object detection, instance segmentation). By leveraging recent advancements in deep learning architectures, cheaper and more powerful GPUs, and petabytes of freely available satellite imagery datasets, we can come closer to solving these important problems.

National Oceanic and Atmospheric Administration satellite image of Hurricane Katrina, taken on August 28, 2005 (source). Geospatial machine learning libraries like TorchGeo can be used to detect, track, and predict future trajectories of hurricanes and other natural disasters.

The challenges

In traditional computer vision datasets, such as ImageNet, the image files themselves tend to be rather simple and easy to work with. Most images have 3 spectral bands (RGB), are stored in common file formats like PNG or JPEG, and can be easily loaded with popular software libraries like PIL or OpenCV. Each image in these datasets is usually small enough to pass directly into a neural network. Furthermore, most of these datasets contain a finite number of well-curated images that are assumed to be independent and identically distributed, making train-val-test splits straightforward. As a result of this relative homogeneity, the same pre-trained models (e.g., CNNs pretrained on ImageNet) have shown to be effective across a wide range of vision tasks using transfer learning methods. Existing libraries, such as torchvision, handle these simple cases well, and have been used to make large advances in vision tasks over the past decade.

Remote sensing imagery is not so uniform. Instead of simple RGB images, satellites tend to capture images that are multispectral (Landsat 8 has 11 spectral bands) or even hyperspectral (Hyperion has 242 spectral bands). These images capture information at a wider range of wavelengths (400 nm–15 µm), far outside of the visible spectrum. Different satellites also have very different spatial resolutions—GOES has a resolution of 4 km/px, Maxar imagery is 30 cm/px, and drone imagery resolution can be as high as 7 mm/px. These datasets almost always have a temporal component, with satellite revisists that are daily, weekly, or biweekly. Images often have overlap with other images in the dataset, and need to be stitched together based on geographic metadata. These images tend to be very large (e.g., 10K x 10K pixels), so it isn’t possible to pass an entire image through a neural network. This data is distributed in hundreds of different raster and vector file formats like GeoTIFF and ESRI Shapefile, requiring specialty libraries like GDAL to load.

From left to right: Mercator, Albers Equal Area, and Interrupted Goode Homolosine projections (source). Geospatial data is associated with one of many different types of reference systems that project the 3D Earth onto a 2D representation. Combining data from different sources often involves re-projecting to a common reference system in order to ensure that all layers are aligned.

Although each image is 2D, the Earth itself is 3D. In order to stitch together images, they first need to be projected onto a 2D representation of the Earth, called a coordinate reference system (CRS). Most people are familiar with equal angle representations like Mercator that distort the size of regions (Greenland looks larger than Africa even though Africa is 15x larger), but there are many other CRSs that are commonly used. Each dataset may use a different CRS, and each image within a single dataset may also be in a unique CRS. In order to use data from multiple layers, they must all share a common CRS, otherwise the data won’t be properly aligned. For those who aren’t familiar with remote sensing data, this can be a daunting task.

Even if you correctly georeference images during indexing, if you don’t project them to a common CRS, you’ll end up with rotated images with nodata values around them, and the images won’t be pixel-aligned.

The solution

At the moment, it can be quite challenging to work with both deep learning models and geospatial data without having expertise in both of these very different fields. To address these challenges, we’ve built TorchGeo, a PyTorch domain library for working with geospatial data. TorchGeo is designed to make it simple:

  1. for machine learning experts to work with geospatial data, and
  2. for remote sensing experts to explore machine learning solutions.

TorchGeo is not just a research project, but a production-quality library that uses continuous integration to test every commit with a range of Python versions on a range of platforms (Linux, macOS, Windows). It can be easily installed with any of your favorite package managers, including pip, conda, and spack:

$ pip install torchgeo

TorchGeo is designed to have the same API as other PyTorch domain libraries like torchvision, torchtext, and torchaudio. If you already use torchvision in your workflow for computer vision datasets, you can switch to TorchGeo by changing only a few lines of code. All TorchGeo datasets and samplers are compatible with the PyTorch DataLoader class, meaning that you can take advantage of wrapper libraries like PyTorch Lightning for distributed training. In the following sections, we’ll explore possible use cases for TorchGeo to show how simple it is to use.

Geospatial datasets and samplers

Example application in which we combine A) a scene from Landsat 8 and B) Cropland Data Layer labels, even though these files are in different EPSG projections. We want to sample patches C) and D) from these datasets using a geospatial bounding box as an index.

Many remote sensing applications involve working with geospatial datasets —datasets with geographic metadata. In TorchGeo, we define a GeoDataset class to represent these kinds of datasets. Instead of being indexed by an integer, each GeoDataset is indexed by a spatiotemporal bounding box, meaning that two or more datasets covering a different geographic extent can be intelligently combined.

In this example, we show how easy it is to work with geospatial data and to sample small image patches from a combination of Landsat and Cropland Data Layer (CDL) data using TorchGeo. First, we assume that the user has Landsat 7 and 8 imagery downloaded. Since Landsat 8 has more spectral bands than Landsat 7, we’ll only use the bands that both satellites have in common. We’ll create a single dataset including all images from both Landsat 7 and 8 data by taking the union between these two datasets.

from torch.utils.data import DataLoader
from torchgeo.datasets import CDL, Landsat7, Landsat8, stack_samples
from torchgeo.samplers import RandomGeoSampler

landsat7 = Landsat7(root="...")
landsat8 = Landsat8(root="...", bands=Landsat8.all_bands[1:-2])
landsat = landsat7 | landsat8

Next, we take the intersection between this dataset and the CDL dataset. We want to take the intersection instead of the union to ensure that we only sample from regions where we have both Landsat and CDL data. Note that we can automatically download and checksum CDL data. Also note that each of these datasets may contain files in different CRSs or resolutions, but TorchGeo automatically ensures that a matching CRS and resolution is used.

cdl = CDL(root="...", download=True, checksum=True)
dataset = landsat & cdl

This dataset can now be used with a PyTorch data loader. Unlike benchmark datasets, geospatial datasets often include very large images. For example, the CDL dataset consists of a single image covering the entire contiguous United States. In order to sample from these datasets using geospatial coordinates, TorchGeo defines a number of samplers. In this example, we’ll use a random sampler that returns 256 x 256 pixel images and 10,000 samples per epoch. We’ll also use a custom collation function to combine each sample dictionary into a mini-batch of samples.

sampler = RandomGeoSampler(dataset, size=256, length=10000)
dataloader = DataLoader(dataset, batch_size=128, sampler=sampler, collate_fn=stack_samples)

This data loader can now be used in your normal training/evaluation pipeline.

for batch in dataloader:
    image = batch["image"]
    mask = batch["mask"]

    # train a model, or make predictions using a pre-trained model

Many applications involve intelligently composing datasets based on geospatial metadata like this. For example, users may want to:

  • Combine datasets for multiple image sources and treat them as equivalent (e.g., Landsat 7 and 8)
  • Combine datasets for disparate geospatial locations (e.g., Chesapeake NY and PA)

These combinations require that all queries are present in at least one dataset, and can be created using a UnionDataset. Similarly, users may want to:

  • Combine image and target labels and sample from both simultaneously (e.g., Landsat and CDL)
  • Combine datasets for multiple image sources for multimodal learning or data fusion (e.g., Landsat and Sentinel)

These combinations require that all queries are present in both datasets, and can be created using an IntersectionDataset. TorchGeo automatically composes these datasets for you when you use the intersection (&) and union (|) operators.

Multispectral and geospatial transforms

In deep learning, it’s common to augment and transform the data so that models are robust to variations in the input space. Geospatial data can have variations such as seasonal changes and warping effects, as well as image processing and capture issues like cloud cover and atmospheric distortion. TorchGeo utilizes augmentations and transforms from the Kornia library, which supports GPU acceleration and supports multispectral imagery with more than 3 channels.

Traditional geospatial analyses compute and visualize spectral indices which are combinations of multispectral bands. Spectral indices are designed to highlight areas of interest in a multispectral image relevant to some application, such as vegetation health, areas of man-made change or increasing urbanization, or snow cover. TorchGeo supports numerous transforms, which can compute common spectral indices and append them as additional bands to a multispectral image tensor.

Below, we show a simple example where we compute the Normalized Difference Vegetation Index (NDVI) on a Sentinel-2 image. NDVI measures the presence of vegetation and vegetation health and is computed as the normalized difference between the red and near-infrared (NIR) spectral bands. Spectral index transforms operate on sample dictionaries returned from TorchGeo datasets and append the resulting spectral index to the image channel dimension.

First, we instantiate a Sentinel-2 dataset and load a sample image. Then, we plot the true color (RGB) representation of this data to see the region we are looking at.

import matplotlib.pyplot as plt
from torchgeo.datasets import Sentinel2
from torchgeo.transforms import AppendNDVI

dataset = Sentinel2(root="...")
sample = dataset[...]
fig = dataset.plot(sample)
plt.show()

Next, we instantiate and compute an NDVI transform, appending this new channel to the end of the image. Sentinel-2 imagery uses index 0 for its red band and index 3 for its NIR band. In order to visualize the data, we also normalize the image. NDVI values can range from -1 to 1, but we want to use the range 0 to 1 for plotting.

transform = AppendNDVI(index_red=0, index_nir=3)
sample = transform(sample)
sample["image"][-1] = (sample["image"][-1] + 1) / 2
plt.imshow(sample["image"][-1], cmap="RdYlGn_r")
plt.show()

True color (left) and NDVI (right) of the Texas Hill Region, taken on November 16, 2018 by the Sentinel-2 satellite. In the NDVI image, red indicates water bodies, yellow indicates barren soil, light green indicates unhealthy vegetation, and dark green indicates healthy vegetation.

Benchmark datasets

One of the driving factors behind progress in computer vision is the existence of standardized benchmark datasets like ImageNet and MNIST. Using these datasets, researchers can directly compare the performance of different models and training procedures to determine which perform the best. In the remote sensing domain, there are many such datasets, but due to the aforementioned difficulties of working with this data and the lack of existing libraries for loading these datasets, many researchers opt to use their own custom datasets.

One of the goals of TorchGeo is to provide easy-to-use data loaders for these existing datasets. TorchGeo includes a number of benchmark datasets —datasets that include both input images and target labels. This includes datasets for tasks like image classification, regression, semantic segmentation, object detection, instance segmentation, change detection, and more.

If you’ve used torchvision before, these types of datasets should be familiar. In this example, we’ll create a dataset for the Northwestern Polytechnical University (NWPU) very-high-resolution ten-class (VHR-10) geospatial object detection dataset. This dataset can be automatically downloaded, checksummed, and extracted, just like with torchvision.

from torch.utils.data import DataLoader
from torchgeo.datasets import VHR10

dataset = VHR10(root="...", download=True, checksum=True)
dataloader = DataLoader(dataset, batch_size=128, shuffle=True, num_workers=4)

for batch in dataloader:
    image = batch["image"]
    label = batch["label"]

    # train a model, or make predictions using a pre-trained model

All TorchGeo datasets are compatible with PyTorch data loaders, making them easy to integrate into existing training workflows. The only difference between a benchmark dataset in TorchGeo and a similar dataset in torchvision is that each dataset returns a dictionary with keys for each PyTorch Tensor.

Example predictions from a Mask R-CNN model trained on the NWPU VHR-10 dataset. The model predicts sharp bounding boxes and masks for all objects with high confidence scores.

Reproducibility with PyTorch Lightning

Another key goal of TorchGeo is reproducibility. For many of these benchmark datasets, there is no predefined train-val-test split, or the predefined split has issues with class imbalance or geographic distribution. As a result, the performance metrics reported in the literature either can’t be reproduced, or aren’t indicative of how well a pre-trained model would work in a different geographic location.

In order to facilitate direct comparisons between results published in the literature and further reduce the boilerplate code needed to run experiments with datasets in TorchGeo, we have created PyTorch Lightning datamodules with well-defined train-val-test splits and trainers for various tasks like classification, regression, and semantic segmentation. These datamodules show how to incorporate augmentations from the kornia library, include preprocessing transforms (with pre-calculated channel statistics), and let users easily experiment with hyperparameters related to the data itself (as opposed to the modeling process). Training a regression model on the Inria Aerial Image Labeling dataset is as easy as a few imports and four lines of code.

from pytorch_lightning import Trainer
from torchgeo.datamodules import InriaAerialImageLabelingDataModule
from torchgeo.trainers import SemanticSegmentationTask

datamodule = InriaAerialImageLabelingDataModule(root_dir="...", batch_size=64, num_workers=6)
task = SemanticSegmentationTask(model="resnet18", pretrained=True, learning_rate=0.1)
trainer = Trainer(gpus=1, default_root_dir="...")

trainer.fit(model=task, datamodule=datamodule)

Building segmentations produced by a U-Net model trained on the Inria Aerial Image Labeling dataset. Reproducing these results is as simple as a few imports and four lines of code, making comparison of different models and training techniques simple and easy.

In our preprint we show a set of results that use the aforementioned datamodules and trainers to benchmark simple modeling approaches for several of the datasets in TorchGeo. For example, we find that a simple ResNet-50 can achieve state-of-the-art performance on the So2Sat dataset. These types of baseline results are important for evaluating the contribution of different modeling choices when tackling problems with remotely sensed data.

Future work and contributing

There is still a lot of remaining work to be done in order to make TorchGeo as easy to use as possible, especially for users without prior deep learning experience. One of the ways in which we plan to achieve this is by expanding our tutorials to include subjects like “writing a custom dataset” and “transfer learning”, or tasks like “land cover mapping” and “object detection”.

Another important project we are working on is pre-training models. Most remote sensing researchers work with very small labeled datasets, and could benefit from pre-trained models and transfer learning approaches. TorchGeo is the first deep learning library to provide models pre-trained on multispectral imagery. Our goal is to provide models for different image modalities (optical, SAR, multispectral) and specific platforms (Landsat, Sentinel, MODIS) as well as benchmark results showing their performance with different amounts of training data. Self-supervised learning is a promising method for training such models. Satellite imagery datasets often contain petabytes of imagery, but accurately labeled datasets are much harder to come by. Self-supervised learning methods will allow us to train directly on the raw imagery without needing large labeled datasets.

Aside from these larger projects, we’re always looking to add new datasets, data augmentation transforms, and sampling strategies. If you’re Python savvy and interested in contributing to TorchGeo, we would love to see contributions! TorchGeo is open source under an MIT license, so you can use it in almost any project.

External links:

If you like TorchGeo, give us a star on GitHub! And if you use TorchGeo in your work, please cite our paper.

Acknowledgments

We would like to thank all TorchGeo contributors for their efforts in creating the library, the Microsoft AI for Good program for support, and the PyTorch Team for their guidance. This research is part of the Blue Waters sustained-petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993), the State of Illinois, and as of December, 2019, the National Geospatial-Intelligence Agency. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center for Supercomputing Applications. The research was supported in part by NSF grants IIS-1908104, OAC-1934634, and DBI-2021898.

Read More

Visual inspection automation using Amazon SageMaker JumpStart

According to Gartner, hyperautomation is the number one trend in 2022 and will continue advancing in future. One of the main barriers to hyperautomation is in areas where we’re still struggling to reduce human involvement. Intelligent systems have a hard time matching human visual recognition abilities, despite great advancements in deep learning in computer vision. This is mainly due to the lack of annotated data (or when data is sparse) and in areas such as quality control, where trained human eyes still dominate. Another reason is the feasibility of human access in all areas of the product supply chain, such as quality control inspection on the production line. Visual inspection is widely used for performing internal and external assessment of various equipment in a production facility, such as storage tanks, pressure vessels, piping, vending machines, and other equipment, which expands to many industries, such as electronics, medical, CPG, and raw materials and more.

Using Artificial Intelligence (AI) for automated visual inspection or augmenting the human visual inspection process with AI can help address the challenges outlined below.

Challenges of human visual inspection

Human-led visual inspection has the following high-level issues:

  • Scale – Most products go through multiple stages, from assembly to supply chain to quality control, before being made available to the end consumer. Defects can occur during the manufacturing process or assembly at different points in space and time. Therefore, it’s not always feasible or cost-effective to use in-person human visual inspection. This inability to scale can result in disasters such as the BP Deepwater Horizon oil spill and Challenger space shuttle explosion, the overall negative impact of which (to humans and nature) overshoots the monetary cost by quite a distance.
  • Human visual error – In areas where human-led visual inspection can be conveniently performed, human error is a major factor that often goes overlooked. According to the following report, most inspection tasks are complex and typically exhibit error rates of 20–30%, which directly translates to cost and undesirable outcomes.
  • Personnel and miscellaneous costs – Although the overall cost of quality control can vary greatly depending on industry and location, according to some estimates, a trained quality inspector salary ranges between $26,000–60,000 (USD) per year. There are also other miscellaneous costs that may not always be accounted for.

SageMaker JumpStart is a great place to get started with various Amazon SageMaker features and capabilities through curated one-click solutions, example notebooks, and pre-trained Computer Vision, Natural Language Processing and Tabular data models that users can choose, fine-tune (if needed) and deploy using AWS SageMaker infrastructure.

In this post, we walk through how to quickly deploy an automated defect detection solution, from data ingestion to model inferencing, using a publicly available dataset and SageMaker JumpStart.

Solution overview

This solution uses a state-of-the-art deep learning approach to automatically detect surface defects using SageMaker. The Defect Detection Network or DDN model enhances the Faster R-CNN and identifies possible defects in an image of a steel surface. The NEU surface defect database, is a balanced dataset that contains six kinds of typical surface defects of a hot-rolled steel strip: rolled-in scale (RS), patches (Pa), crazing (Cr), pitted surface (PS), inclusion (In), and scratches (Sc). The database includes 1,800 grayscale images: 300 samples each of type of defect.

Content

The JumpStart solution contains the following artifacts, which are available to you from the JupyterLab File Browser:

  • cloudformation/ AWS CloudFormation configuration files to create relevant SageMaker resources and apply permissions. Also includes cleanup scripts to delete created resources.
  • src/ – Contains the following:

    • prepare_data/ – Data preparation for NEU datasets.
    • sagemaker_defect_detection/ – Main package containing the following:

      • dataset – Contains NEU dataset handling.
      • models – Contains Automated Defect Inspection (ADI) System called Defect Detection Network. See the following paper for details.
      • utils – Various utilities for visualization and COCO evaluation.
      • classifier.py – For the classification task.
      • detector.py – For the detection task.
      • transforms.py – Contains the image transformations used in training.
  • notebooks/ – The individual notebooks, discussed in more detail later in this post.
  • scripts/ – Various scripts for training and building.

Default dataset

This solution trains a classifier on the NEU-CLS dataset and a detector on the NEU-DET dataset. This dataset contains 1800 images and 4189 bounding boxes in total. The type of defects in our dataset are as follows:

  • Crazing (class: Cr, label: 0)
  • Inclusion (class: In, label: 1)
  • Pitted surface (class: PS, label: 2)
  • Patches (class: Pa, label: 3)
  • Rolled-in scale (class: RS, label: 4)
  • Scratches (class: Sc, label: 5)

The following are sample images of the six classes.

The following images are sample detection results. From left to right, we have the original image, the ground truth detection, and the SageMaker DDN model output.

Architecture

The JumpStart solution comes pre-packaged with Amazon SageMaker Studio notebooks that download the required datasets and contain the code and helper functions for training the model/s and deployment using a real-time SageMaker endpoint.

All notebooks download the dataset from a public Amazon Simple Storage Service (Amazon S3) bucket and import helper functions to visualize the images. The notebooks allow the user to customize the solution, such as hyperparameters for model training or perform transfer learning in case you choose to use the solution for your defect detection use case.

The solution contains the following four Studio notebooks:

  • 0_demo.ipynb – Creates a model object from a pre-trained DDN model on the NEU-DET dataset and deploys it behind a real-time SageMaker endpoint. Then we send some image samples with defects for detection and visualize the results.
  • 1_retrain_from_checkpoint.ipynb – Retrains our pre-trained detector for a few more epochs and compares results. You can also bring your own dataset; however, we use the same dataset in the notebook. Also included is a step to perform transfer learning by fine-tuning the pre-trained model. Fine-tuning a deep learning model on one particular task involves using the learned weights from a particular dataset to enhance the performance of the model on another dataset. You can also perform fine-tuning over the same dataset used in the initial training but perhaps with different hyperparameters.
  • 2_detector_from_scratch.ipynb – Trains our detector from scratch to identify if defects exist in an image.
  • 3_classification_from_scratch.ipynb – Trains our classifier from scratch to classify the type of defect in an image.

Each notebook contains boilerplate code which deploys a SageMaker real-time endpoint for model inferencing. You can view the list of notebooks by going to the JupyterLab file browser and navigating to the “notebooks” folder in the JumpStart Solution directory or by clicking “Open Notebook” on the JumpStart solution, specifically “Product Defect Detection” solution page (See below).

Prerequisites

The solution outlined in this post is part of Amazon SageMaker JumpStart. To run this SageMaker JumpStart 1P Solution and have the infrastructure deploy to your AWS account, you need to create an active Amazon SageMaker Studio instance (see Onboard to Amazon SageMaker Domain).

JumpStart features are not available in SageMaker notebook instances, and you can’t access them through the AWS Command Line Interface (AWS CLI).

Deploy the solution

We provide walkthrough videos for the high-level steps on this solution. To start, launch SageMaker JumpStart and choose the Product Defect Detection solution on the Solutions tab.

The provided SageMaker notebooks download the input data and launch the later stages. The input data is located in an S3 bucket.

We train the classifier and detector models and evaluate the results in SageMaker. If desired, you can deploy the trained models and create SageMaker endpoints.

The SageMaker endpoint created from the previous step is an HTTPS endpoint and is capable of producing predictions.

You can monitor the model training and deployment via Amazon CloudWatch.

Clean up

When you’re finished with this solution, make sure that you delete all unwanted AWS resources. You can use AWS CloudFormation to automatically delete all standard resources that were created by the solution and notebook. On the AWS CloudFormation console, delete the parent stack. Deleting the parent stack automatically deletes the nested stacks.

You need to manually delete any extra resources that you may have created in this notebook, such as extra S3 buckets in addition to the solution’s default bucket or extra SageMaker endpoints (using a custom name).

Conclusion

In this post, we introduced a solution using SageMaker JumpStart to address issues with the current state of visual inspection, quality control, and defect detection in various industries. We recommended a novel approach called Automated Defect Inspection system built using a pre-trained DDN model for defect detection on steel surfaces. After you launched the JumpStart solution and downloaded the public NEU datasets, you deployed a pre-trained model behind a SageMaker real-time endpoint and analyzed the endpoint metrics using CloudWatch. We also discussed other features of the JumpStart solution, such as how to bring your own training data, perform transfer learning, and retrain the detector and classifier.

Try out this JumpStart solution on SageMaker Studio, either retraining the existing model on a new dataset for defect detection or pick from SageMaker JumpStart’s library of computer vision models, NLP models or tabular models and deploy them for your specific use case.


About the Authors

Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.

Tao Sun is an Applied Scientist in AWS. He obtained his Ph.D. in Computer Science from University of Massachusetts, Amherst. His research interests lie in deep reinforcement learning and probabilistic modeling. He contributed to AWS DeepRacer, AWS DeepComposer. He likes ballroom dance and reading during his spare time.

Read More