World-Class: NVIDIA Research Builds AI Model to Populate Virtual Worlds With 3D Objects, Characters

The massive virtual worlds created by growing numbers of companies and creators could be more easily populated with a diverse array of 3D buildings, vehicles, characters and more — thanks to a new AI model from NVIDIA Research.

Trained using only 2D images, NVIDIA GET3D generates 3D shapes with high-fidelity textures and complex geometric details. These 3D objects are created in the same format used by popular graphics software applications, allowing users to immediately import their shapes into 3D renderers and game engines for further editing.

The generated objects could be used in 3D representations of buildings, outdoor spaces or entire cities, designed for industries including gaming, robotics, architecture and social media.

GET3D can generate a virtually unlimited number of 3D shapes based on the data it’s trained on. Like an artist who turns a lump of clay into a detailed sculpture, the model transforms numbers into complex 3D shapes.

With a training dataset of 2D car images, for example, it creates a collection of sedans, trucks, race cars and vans. When trained on animal images, it comes up with creatures such as foxes, rhinos, horses and bears. Given chairs, the model generates assorted swivel chairs, dining chairs and cozy recliners.

“GET3D brings us a step closer to democratizing AI-powered 3D content creation,” said Sanja Fidler, vice president of AI research at NVIDIA, who leads the Toronto-based AI lab that created the tool. “Its ability to instantly generate textured 3D shapes could be a game-changer for developers, helping them rapidly populate virtual worlds with varied and interesting objects.”

GET3D is one of more than 20 NVIDIA-authored papers and workshops accepted to the NeurIPS AI conference, taking place in New Orleans and virtually, Nov. 26-Dec. 4.

It Takes AI Kinds to Make a Virtual World

The real world is full of variety: streets are lined with unique buildings, with different vehicles whizzing by and diverse crowds passing through. Manually modeling a 3D virtual world that reflects this is incredibly time consuming, making it difficult to fill out a detailed digital environment.

Though quicker than manual methods, prior 3D generative AI models were limited in the level of detail they could produce. Even recent inverse rendering methods can only generate 3D objects based on 2D images taken from various angles, requiring developers to build one 3D shape at a time.

GET3D can instead churn out some 20 shapes a second when running inference on a single NVIDIA GPU — working like a generative adversarial network for 2D images, while generating 3D objects. The larger, more diverse the training dataset it’s learned from, the more varied and detailed the output.

NVIDIA researchers trained GET3D on synthetic data consisting of 2D images of 3D shapes captured from different camera angles. It took the team just two days to train the model on around 1 million images using NVIDIA A100 Tensor Core GPUs.

Enabling Creators to Modify Shape, Texture, Material

GET3D gets its name from its ability to Generate Explicit Textured 3D meshes — meaning that the shapes it creates are in the form of a triangle mesh, like a papier-mâché model, covered with a textured material. This lets users easily import the objects into game engines, 3D modelers and film renderers — and edit them.

Once creators export GET3D-generated shapes to a graphics application, they can apply realistic lighting effects as the object moves or rotates in a scene. By incorporating another AI tool from NVIDIA Research, StyleGAN-NADA, developers can use text prompts to add a specific style to an image, such as modifying a rendered car to become a burned car or a taxi, or turning a regular house into a haunted one.

The researchers note that a future version of GET3D could use camera pose estimation techniques to allow developers to train the model on real-world data instead of synthetic datasets. It could also be improved to support universal generation — meaning developers could train GET3D on all kinds of 3D shapes at once, rather than needing to train it on one object category at a time.

For the latest news from NVIDIA AI research, watch the replay of NVIDIA founder and CEO Jensen Huang’s keynote address at GTC

The post World-Class: NVIDIA Research Builds AI Model to Populate Virtual Worlds With 3D Objects, Characters appeared first on NVIDIA Blog.

Read More

TensorStore for High-Performance, Scalable Array Storage

TensorStore for High-Performance, Scalable Array Storage

Many exciting contemporary applications of computer science and machine learning (ML) manipulate multidimensional datasets that span a single large coordinate system, for example, weather modeling from atmospheric measurements over a spatial grid or medical imaging predictions from multi-channel image intensity values in a 2d or 3d scan. In these settings, even a single dataset may require terabytes or petabytes of data storage. Such datasets are also challenging to work with as users may read and write data at irregular intervals and varying scales, and are often interested in performing analyses using numerous machines working in parallel.

Today we are introducing TensorStore, an open-source C++ and Python software library designed for storage and manipulation of n-dimensional data that:

TensorStore has already been used to solve key engineering challenges in scientific computing (e.g., management and processing of large datasets in neuroscience, such as peta-scale 3d electron microscopy data and “4d” videos of neuronal activity). TensorStore has also been used in the creation of large-scale machine learning models such as PaLM by addressing the problem of managing model parameters (checkpoints) during distributed training.

Familiar API for Data Access and Manipulation

TensorStore provides a simple Python API for loading and manipulating large array data. In the following example, we create a TensorStore object that represents a 56 trillion voxel 3d image of a fly brain and access a small 100×100 patch of the data as a NumPy array:

>>> import tensorstore as ts
>>> import numpy as np

# Create a TensorStore object to work with fly brain data.
>>> dataset = ts.open({
... 'driver':
... 'neuroglancer_precomputed',
... 'kvstore':
... 'gs://neuroglancer-janelia-flyem-hemibrain/' +
... 'v1.1/segmentation/',
... }).result()

# Create a 3-d view (remove singleton 'channel' dimension):
>>> dataset_3d = dataset[ts.d['channel'][0]]
>>> dataset_3d.domain
{ "x": [0, 34432), "y": [0, 39552), "z": [0, 41408) }

# Convert a 100x100x1 slice of the data to a numpy ndarray
>>> slice = np.array(dataset_3d[15000:15100, 15000:15100, 20000])

Crucially, no actual data is accessed or stored in memory until the specific 100×100 slice is requested; hence arbitrarily large underlying datasets can be loaded and manipulated without having to store the entire dataset in memory, using indexing and manipulation syntax largely identical to standard NumPy operations. TensorStore also provides extensive support for advanced indexing features, including transforms, alignment, broadcasting, and virtual views (data type conversion, downsampling, lazily on-the-fly generated arrays).

The following example demonstrates how TensorStore can be used to create a zarr array, and how its asynchronous API enables higher throughput:

>>> import tensorstore as ts
>>> import numpy as np

>>> # Create a zarr array on the local filesystem
>>> dataset = ts.open({
... 'driver': 'zarr',
... 'kvstore': 'file:///tmp/my_dataset/',
... },
... dtype=ts.uint32,
... chunk_layout=ts.ChunkLayout(chunk_shape=[256, 256, 1]),
... create=True,
... shape=[5000, 6000, 7000]).result()

>>> # Create two numpy arrays with example data to write.
>>> a = np.arange(100*200*300, dtype=np.uint32).reshape((100, 200, 300))
>>> b = np.arange(200*300*400, dtype=np.uint32).reshape((200, 300, 400))

>>> # Initiate two asynchronous writes, to be performed concurrently.
>>> future_a = dataset[1000:1100, 2000:2200, 3000:3300].write(a)
>>> future_b = dataset[3000:3200, 4000:4300, 5000:5400].write(b)

>>> # Wait for the asynchronous writes to complete
>>> future_a.result()
>>> future_b.result()

Safe and Performant Scaling

Processing and analyzing large numerical datasets requires significant computational resources. This is typically achieved through parallelization across numerous CPU or accelerator cores spread across many machines. Therefore a fundamental goal of TensorStore has been to enable parallel processing of individual datasets that is both safe (i.e., avoids corruption or inconsistencies arising from parallel access patterns) and high performance (i.e., reading and writing to TensorStore is not a bottleneck during computation). In fact, in a test within Google’s datacenters, we found nearly linear scaling of read and write performance as the number of CPUs was increased:

Read and write performance for a TensorStore dataset in zarr format residing on Google Cloud Storage (GCS) accessed concurrently using a variable number of single-core compute tasks in Google data centers. Both read and write performance scales nearly linearly with the number of compute tasks.

Performance is achieved by implementing core operations in C++, extensive use of multithreading for operations such as encoding/decoding and network I/O, and partitioning large datasets into much smaller units through chunking to enable efficiently reading and writing subsets of the entire dataset. TensorStore also provides configurable in-memory caching (which reduces slower storage system interactions for frequently accessed data) and an asynchronous API that enables a read or write operation to continue in the background while a program completes other work.

Safety of parallel operations when many machines are accessing the same dataset is achieved through the use of optimistic concurrency, which maintains compatibility with diverse underlying storage layers (including Cloud storage platforms, such as GCS, as well as local filesystems) without significantly impacting performance. TensorStore also provides strong ACID guarantees for all individual operations executing within a single runtime.

To make distributed computing with TensorStore compatible with many existing data processing workflows, we have also integrated TensorStore with parallel computing libraries such as Apache Beam (example code) and Dask (example code).

Use Case: Language Models

An exciting recent development in ML is the emergence of more advanced language models such as PaLM. These neural networks contain hundreds of billions of parameters and exhibit some surprising capabilities in natural language understanding and generation. These models also push the limits of computational infrastructure; in particular, training a language model such as PaLM requires thousands of TPUs working in parallel.

One challenge that arises during this training process is efficiently reading and writing the model parameters. Training is distributed across many separate machines, but parameters must be regularly saved to a single object (“checkpoint”) on a permanent storage system without slowing down the overall training process. Individual training jobs must also be able to read just the specific set of parameters they are concerned with in order to avoid the overhead that would be required to load the entire set of model parameters (which could be hundreds of gigabytes).

TensorStore has already been used to address these challenges. It has been applied to manage checkpoints associated with large-scale (“multipod”) models trained with JAX (code example) and has been integrated with frameworks such as T5X (code example) and Pathways. Model parallelism is used to partition the full set of parameters, which can occupy more than a terabyte of memory, over hundreds of TPUs. Checkpoints are stored in zarr format using TensorStore, with a chunk structure chosen to allow the partition for each TPU to be read and written independently in parallel.

When saving a checkpoint, each model parameter is written using TensorStore in zarr format using a chunk grid that further subdivides the grid used to partition the parameter over TPUs. The host machines write in parallel the zarr chunks for each of the partitions assigned to TPUs attached to that host. Using TensorStore’s asynchronous API, training proceeds even while the data is still being written to persistent storage. When resuming from a checkpoint, each host reads only the chunks that make up the partitions assigned to that host.

Use Case: 3D Brain Mapping

The field of synapse-resolution connectomics aims to map the wiring of animal and human brains at the detailed level of individual synaptic connections. This requires imaging the brain at extremely high resolution (nanometers) over fields of view of up to millimeters or more, which yields datasets that can span petabytes in size. In the future these datasets may extend to exabytes as scientists contemplate mapping entire mouse or primate brains. However, even current datasets pose significant challenges related to storage, manipulation, and processing; in particular, even a single brain sample may require millions of gigabytes with a coordinate system (pixel space) of hundreds of thousands pixels in each dimension.

We have used TensorStore to solve computational challenges associated with large-scale connectomic datasets. Specifically, TensorStore has managed some of the largest and most widely accessed connectomic datasets, with Google Cloud Storage as the underlying object storage system. For example, it has been applied to the human cortex “h01” dataset, which is a 3d nanometer-resolution image of human brain tissue. The raw imaging data is 1.4 petabytes (roughly 500,000 * 350,000 * 5,000 pixels large, and is further associated with additional content such as 3d segmentations and annotations that reside in the same coordinate system. The raw data is subdivided into individual chunks 128x128x16 pixels large and stored in the “Neuroglancer precomputed” format, which is optimized for web-based interactive viewing and can be easily manipulated from TensorStore.

A fly brain reconstruction for which the underlying data can be easily accessed and manipulated using TensorStore.

Getting Started

To get started using the TensorStore Python API, you can install the tensorstore PyPI package using:

pip install tensorstore

Refer to the tutorials and API documentation for usage details. For other installation options and for using the C++ API, refer to installation instructions.

Acknowledgements

Thanks to Tim Blakely, Viren Jain, Yash Katariya, Jan-Matthis Luckmann, Michał Januszewski, Peter Li, Adam Roberts, Brain Williams, and Hector Yee from Google Research, and Davis Bennet, Stuart Berg, Eric Perlman, Stephen Plaza, and Juan Nunez-Iglesias from the broader scientific community for valuable feedback on the design, early testing and debugging.

Read More

Detect population variance of endangered species using Amazon Rekognition

Our planet faces a global extinction crisis. UN Report shows a staggering number of more than a million species feared to be on the path of extinction. The most common reasons for extinction include loss of habitat, poaching, and invasive species. Several wildlife conservation foundations, research scientists, volunteers, and anti-poaching rangers have been working tirelessly to address this crisis. Having accurate and regular information about endangered animals in the wild will improve wildlife conservationists’ ability to study and conserve endangered species. Wildlife scientists and field staff use cameras equipped with infrared triggers, called camera traps, and place them in the most effective locations in forests to capture images of wildlife. These images are then manually reviewed, which is a very time-consuming process.

In this post, we demonstrate a solution using Amazon Rekognition Custom Labels along with motion sensor camera traps to automate this process to recognize engendered species and study them. Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to their use case. We detail how to recognize endangered animal species from images collected from camera traps, draw insights about their population count, and detect humans around them. This information will be helpful to conservationists, who can make proactive decisions to save them.

Solution overview

The following diagram illustrates the architecture of the solution.
Solution overview
This solution uses the following AI services, serverless technologies, and managed services to implement a scalable and cost-effective architecture:

  • Amazon Athena – A serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
  • Amazon CloudWatch – A monitoring and observability service that collects monitoring and operational data in the form of logs, metrics, and events
  • Amazon DynamoDB – A key-value and document database that delivers single-digit millisecond performance at any scale
  • AWS Lambda – A serverless compute service that lets you run code in response to triggers such as changes in data, shifts in system state, or user actions
  • Amazon QuickSight – A serverless, machine learning (ML)-powered business intelligence service that provides insights, interactive dashboards, and rich analytics
  • Amazon Rekognition – Uses ML to identify objects, people, text, scenes, and activities in images and videos, as well as detect any inappropriate content
  • Amazon Rekognition Custom Labels – Uses AutoML to help train custom models to identify the objects and scenes in images that are specific to your business needs
  • Amazon Simple Queue Service (Amazon SQS) – A fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications
  • Amazon Simple Storage Service (Amazon S3) – Serves as an object store for documents and allows for central management with fine-tuned access controls.

The high-level steps in this solution are as follows:

  1. Train and build a custom model using Rekognition Custom Labels to recognize endangered species in the area. For this post, we train on images of rhinoceros.
  2. Images that are captured through the motion sensor camera traps are uploaded to an S3 bucket, which publishes an event for every uploaded image.
  3. A Lambda function is triggered for every event published, which retrieves the image from the S3 bucket and passes it to the custom model to detect the endangered animal.
  4. The Lambda function uses the Amazon Rekognition API to identify the animals in the image.
  5. If the image has any endangered species of rhinoceros, the function updates the DynamoDB database with the count of the animal, date of image captured, and other useful metadata that can be extracted from the image EXIF header.
  6. QuickSight is used to visualize the animal count and location data collected in the DynamoDB database to understand the variance of the animal population over time. By looking at the dashboards regularly, conservation groups can identify patterns and isolate probable causes like diseases, climate, or poaching that could be causing this variance and proactively take steps to address the issue.

Prerequisites

A good training set is required to build an effective model using Rekognition Custom Labels. We have used the images from AWS Marketplace (Animals & Wildlife Data Set from Shutterstock) and Kaggle to build the model.

Implement the solution

Our workflow includes the following steps:

  1. Train a custom model to classify the endangered species (rhino in our example) using the AutoML capability of Rekognition Custom Labels.

You can also perform these steps from the Rekognition Custom Labels console. For instructions, refer to Creating a project, Creating training and test datasets, and Training an Amazon Rekognition Custom Labels model.

In this example, we use the dataset from Kaggle. The following table summarizes the dataset contents.

Label Training Set Test Set
Lion 625 156
Rhino 608 152
African_Elephant 368 92
  1. Upload the pictures captured from the camera traps to a designated S3 bucket.
  2. Define the event notifications in the Permissions section of the S3 bucket to send a notification to a defined SQS queue when an object is added to the bucket.

Define event notification

The upload action triggers an event that is queued in Amazon SQS using the Amazon S3 event notification.

  1. Add the appropriate permissions via the access policy of the SQS queue to allow the S3 bucket to send the notification to the queue.

ML-9942-event-not

  1. Configure a Lambda trigger for the SQS queue so the Lambda function is invoked when a new message is received.

Lambda trigger

  1. Modify the access policy to allow the Lambda function to access the SQS queue.

Lambda function access policy

The Lambda function should now have the right permissions to access the SQS queue.

Lambda function permissions

  1. Set up the environment variables so they can be accessed in the code.

Environment variables

Lambda function code

The Lambda function performs the following tasks on receiving a notification from the SNS queue:

  1. Make an API call to Amazon Rekognition to detect labels from the custom model that identify the endangered species:
exports.handler = async (event) => {
const id = AWS.util.uuid.v4();
const bucket = event.Records[0].s3.bucket.name;
const photo = decodeURIComponent(event.Records[0].s3.object.key.replace(/+/g, ' '));
const client = new AWS.Rekognition({ region: REGION });
const paramsCustomLabel = {
Image: {
S3Object: {
Bucket: bucket,
Name: photo
},
},
ProjectVersionArn: REK_CUSTOMMODEL,
MinConfidence: MIN_CONFIDENCE
}
let response = await client.detectCustomLabels(paramsCustomLabel).promise();
console.log("Rekognition customLabels response = ",response);
  1. Fetch the EXIF tags from the image to get the date when the picture was taken and other relevant EXIF data. The following code uses the dependencies (package – version) exif-reader – ^1.0.3, sharp – ^0.30.7:
const getExifMetaData = async (bucket,key)=>{
return new Promise((resolve) => {
const s3 = new AWS.S3({ region: REGION });
const param = {
Bucket: bucket,
Key : key
};

s3.getObject(param, (error, data) => {
if (error) {
console.log("Error getting S3 file",error);
resolve({status:false,errorText: error.message});
} else {
sharp(data.Body)
.metadata()
.then(({ exif }) => {
const exifProperties = exifReader(exif);
resolve({status:true,exifProp: exifProperties});
}).catch(err => {console.log("Error Processing Exif ");resolve({status:false});})
}
});
});
}

var gpsData = "";
var createDate = "";
const imageS3 = await getExifMetaData(bucket, photo);
if(imageS3.status){
gpsData = imageS3.exifProp.gps;
createDate = imageS3.exifProp.image.CreateDate;
}else{
createDate = event.Records[0].eventTime;
console.log("No exif found in image, setting createDate as the date of event", createDate);
}

The solution outlined here is asynchronous; the images are captured by the camera traps and then at a later time uploaded to an S3 bucket for processing. If the camera trap images are uploaded more frequently, you can extend the solution to detect humans in the monitored area and send notifications to concerned activists to indicate possible poaching in the vicinity of these endangered animals. This is implemented through the Lambda function that calls the Amazon Rekognition API to detect labels for the presence of a human. If a human is detected, an error message is logged to CloudWatch Logs. A filtered metric on the error log triggers a CloudWatch alarm that sends an email to the conservation activists, who can then take further action.

  1. Expand the solution with the following code:
const paramHumanLabel = {
Image: {
S3Object: {
Bucket: bucket,
Name: photo
},
},
MinConfidence: MIN_CONFIDENCE
}

let humanLabel = await client.detectLabels(paramHumanLabel).promise();
let humanFound = humanLabel.Labels.filter(obj => obj.Name === HUMAN);
var humanDetected = false;
if(humanFound.length > 0){
console.error("Human Face Detected");
humanDetected = true;
}
  1. If any endangered species is detected, the Lambda function updates DynamoDB with the count, date and other optional metadata that is obtained from the image EXIF tags:
let dbresponse = await dynamo.putItem({
Item: {
id: { S: id },
type: { S: response.CustomLabels[0].Name },
image: {S : photo},
createDate: {S: createDate.toString()},
confidence: {S: response.CustomLabels[0].Confidence.toString()},
gps: {S: gpsData.toString()},
humanDetected: {BOOL: humanDetected}
},

TableName: ANIMAL_TABLENAME,
}).promise();

Query and visualize the data

You can now use Athena and QuickSight to visualize the data.

  1. Set the DynamoDB table as the data source for Athena.DynamoDB data source
  1. Add the data source details.

The next important step is to define a Lambda function that connects to the data source.

  1. Chose Create Lambda function.

Lambda function

  1. Enter names for AthenaCatalogName and SpillBucket; the rest can be default settings.
  2. Deploy the connector function.

Lambda connector

After all the images are processed, you can use QuickSight to visualize the data for the population variance over time from Athena.

  1. On the Athena console, choose a data source and enter the details.
  2. Choose Create Lambda function to provide a connector to DynamoDB.

Create Lambda function

  1. On the QuickSight dashboard, choose New Analysis and New Dataset.
  2. Choose Athena as the data source.

Athena as data source

  1. Enter the catalog, database, and table to connect to and choose Select.

Catalog

  1. Complete dataset creation.

Catalog

The following chart shows the number of endangered species captured on a given day.

QuickSight chart

GPS data is presented as part of the EXIF tags of a captured image. Due to the sensitivity of the location of these endangered animals, our dataset didn’t have the GPS location. However, we created a geospatial chart using simulated data to show how you can visualize locations when GPS data is available.

Geospatial chart

Clean up

To avoid incurring unexpected costs, be sure to turn off the AWS services you used as part of this demonstration—the S3 buckets, DynamoDB table, QuickSight, Athena, and the trained Rekognition Custom Labels model. You should delete these resources directly via their respective service consoles if you no longer need them. Refer to Deleting an Amazon Rekognition Custom Labels model for more information about deleting the model.

Conclusion

In this post, we presented an automated system that identifies endangered species, records their population count, and provides insights about variance in population over time. You can also extend the solution to alert the authorities when humans (possible poachers) are in the vicinity of these endangered species. With the AI/ML capabilities of Amazon Rekognition, we can support the efforts of conservation groups to protect endangered species and their ecosystems.

For more information about Rekognition Custom Labels, refer to Getting started with Amazon Rekognition Custom Labels and Moderating content. If you’re new to Rekognition Custom Labels, you can use our Free Tier, which lasts 3 months and includes 10 free training hours per month and 4 free inference hours per month. The Amazon Rekognition Free Tier includes processing 5,000 images per month for 12 months.


About the Authors

author-jyothiJyothi Goudar is Partner Solutions Architect Manager at AWS. She works closely with global system integrator partner to enable and support customers moving their workloads to AWS.

Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.

Read More

How Amazon Search reduced ML inference costs by 85% with AWS Inferentia

Amazon’s product search engine indexes billions of products, serves hundreds of millions of customers worldwide, and is one of the most heavily used services in the world. The Amazon Search team develops machine learning (ML) technology that powers the Amazon.com search engine and helps customers search effortlessly. To deliver a great customer experience and operate at the massive scale required by the Amazon.com search engine, this team is always looking for ways to build more cost-effective systems with real-time latency and throughput requirements. The team constantly explores hardware and compilers optimized for deep learning to accelerate model training and inference, while reducing operational costs across the board.

In this post, we describe how Amazon Search uses AWS Inferentia, a high-performance accelerator purpose built by AWS to accelerate deep learning inference workloads. The team runs low-latency ML inference with Transformer-based NLP models on AWS Inferentia-based Amazon Elastic Compute Cloud (Amazon EC2) Inf1 instances, and saves up to 85% in infrastructure costs while maintaining strong throughput and latency performance.

Deep learning for duplicate and query intent prediction

Searching the Amazon Marketplace is a multi-task, multi-modal problem, dealing with several inputs such as ASINs (Amazon Standard Identification Number, a 10-digit alphanumeric number that uniquely identifies products), product images, textual descriptions, and queries. To create a tailored user experience, predictions from many models are used for different aspects of search. This is a challenge because the search system has thousands of models with tens of thousands of transactions per second (TPS) at peak load. We focus on two components of that experience:

  • Customer-perceived duplicate predictions – To show the most relevant list of products that match a user’s query, it’s important to identify products that customers have a hard time differentiating between
  • Query intent prediction – To adapt the search page and product layout to better suit what the customer is looking for, it’s important to predict the intent and type of the user’s query (for example, a media-related query, help query, and other query types)

Both of these predictions are made using Transformer model architectures, namely BERT-based models. In fact, both share the same BERT-based model as a basis, and each one stacks a classification/regression head on top of this backbone.

Duplicate prediction takes in various textual features for a pair of evaluated products as inputs (such as product type, title, description, and so on) and is computed periodically for large datasets. This model is trained end to end in a multi-task fashion. Amazon SageMaker Processing jobs are used to run these batch workloads periodically to automate their launch and only pay for the processing time that is used. For this batch workload use case, the requirement for inference throughput was 8,800 total TPS.

Intent prediction takes the user’s textual query as input and is needed in real time to dynamically serve everyday traffic and enhance the user experience on the Amazon Marketplace. The model is trained on a multi-class classification objective. This model is then deployed on Amazon Elastic Container Service (Amazon ECS), which enables quick auto scaling and easy deployment definition and management. Because this is a real-time use case, it required the P99 latency to be under 10 milliseconds to ensure a delightful user experience.

AWS Inferentia and the AWS Neuron SDK

EC2 Inf1 instances are powered by AWS Inferentia, the first ML accelerator purpose built by AWS to accelerate deep learning inference workloads. Inf1 instances deliver up to 2.3 times higher throughput and up to 70% lower cost per inference than comparable GPU-based EC2 instances. You can keep training your models using your framework of choice (PyTorch, TensorFlow, MXNet), and then easily deploy them on AWS Inferentia to benefit from the built-in performance optimizations. You can deploy a wide range of model types using Inf1 instances, from image recognition, object detection, natural language processing (NLP), and modern recommender models.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the EC2 Inf1 instances. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch. Therefore, you can deploy deep learning models on AWS Inferentia with the same familiar APIs provided by your framework of choice, and benefit from the boost in performance and lowest cost-per-inference in the cloud.

Since its launch, the Neuron SDK has continued to increase the breadth of models it supports while continuing to improve performance and reduce inference costs. This includes NLP models (BERTs), image classification models (ResNet, VGG), and object detection models (OpenPose and SSD).

Deploy on Inf1 instances for low latency, high throughput, and cost savings

The Amazon Search team wanted to save costs while meeting their high throughput requirement on duplication prediction, and the low latency requirement on query intent prediction. They chose to deploy on AWS Inferentia-based Inf1 instances and not only met the high performance requirements, but also saved up to 85% on inference costs.

Customer-perceived duplicate predictions

Prior to the usage of Inf1, a dedicated Amazon EMR cluster was running using CPU-based instances. Without relying on hardware acceleration, a large number of instances were necessary to meet the high throughput requirement of 8,800 total transactions per second. The team switched to inf1.6xlarge instances, each with 4 AWS Inferentia accelerators, and 16 NeuronCores (4 cores per AWS Inferentia chip). They traced the Transformer-based model for a single NeuronCore and loaded one mode per NeuronCore to maximize throughput. By taking advantage of the 16 available NeuronCores, they decreased inference costs by 85% (based on the current public Amazon EC2 on-demand pricing).

Query intent prediction

Given the P99 latency requirement of 10 milliseconds or less, the team loaded the model to every available NeuronCore on inf1.6xlarge instances. You can easily do this with PyTorch Neuron using the torch.neuron.DataParallel API. With the Inf1 deployment, the model latency was 3 milliseconds, end-to-end latency was approximately 10 milliseconds, and maximum throughput at peak load reached 16,000 TPS.

Get started with sample compilation and deployment code

The following is some sample code to help you get started on Inf1 instances and realize the performance and cost benefits like the Amazon Search team. We show how to compile and perform inference with a PyTorch model, using PyTorch Neuron.

First, the model is compiled with torch.neuron.trace():

m = torch.jit.load(f="./cpu_model.pt", map_location=torch.device('cpu'))
m.eval()
model_neuron = torch.neuron.trace(
    m,
    inputs,
    compiler_workdir="work_" + str(cores) + "_" + str(batch_size),
    compiler_args=[
        '--fp32-cast=all', '--neuroncore-pipeline-cores=' + str(cores)
    ])
model_neuron.save("m5_batch" + str(batch_size) + "_cores" + str(cores) +
                  "_with_extra_op_and_fp32cast.pt")

For the full list of possible arguments to the trace method, refer to PyTorch-Neuron trace Python API. As you can see, compiler arguments can be passed to the torch.neuron API directly. All FP32 operators are cast to BF16 with --fp32-cast=all, providing the highest performance while preserving dynamic range. More casting options are available to let you control the performance to model precision trade-off. The models used for both use cases were compiled for a single NeuronCore (no pipelining).

We then load the model on Inferentia with torch.jit.load, and use it for prediction. The Neuron runtime automatically loads the model to NeuronCores.

cm_cpd_preprocessing_jit = torch.jit.load(f=CM_CPD_PROC,
                                          map_location=torch.device('cpu'))
cm_cpd_preprocessing_jit.eval()
m5_model = torch.jit.load(f=CM_CPD_M5)
m5_model.eval()

input = get_input()
with torch.no_grad():
    batch_cm_cpd = cm_cpd_preprocessing_jit(input)
    input_ids, attention_mask, position_ids, valid_length, token_type_ids = (
        batch_cm_cpd['input_ids'].type(torch.IntTensor),
        batch_cm_cpd['attention_mask'].type(torch.HalfTensor),
        batch_cm_cpd['position_ids'].type(torch.IntTensor),
        batch_cm_cpd['valid_length'].type(torch.IntTensor),
        batch_cm_cpd['token_type_ids'].type(torch.IntTensor))
    model_res = m5_model(input_ids, attention_mask, position_ids, valid_length,
                         token_type_ids)

Conclusion

The Amazon Search team was able to reduce their inference costs by 85% using AWS Inferentia-based Inf1 instances, under heavy traffic and demanding performance requirements. AWS Inferentia and the Neuron SDK provided the team the flexibility to optimize the deployment process separately from training, and put forth a shallow learning curve via well-rounded tools and familiar framework APIs.

You can unlock performance and cost benefits by getting started with the sample code provided in this post. Also, check out the end-to-end tutorials to run ML models on Inferentia with PyTorch and TensorFlow.


About the authors

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use cases and helping customers optimize deep learning model training and deployment. He is also an active proponent of ML-specialized hardware and low-code ML solutions.

Weiqi Zhang is a Software Engineering Manager at Search M5, where he works on productizing large-scale models for Amazon machine learning applications. His interests include information retrieval and machine learning infrastructure.

Jason Carlson is a Software Engineer for developing machine learning pipelines to help reduce the number of stolen search impressions due to customer-perceived duplicates. He mostly works with Apache Spark, AWS, and PyTorch to help deploy and feed/process data for ML models. In his free time, he likes to read and go on runs.

Shaohui Xi is an SDE at the Search Query Understanding Infra team. He leads the effort for building large-scale deep learning online inference services with low latency and high availability. Outside of work, he enjoys skiing and exploring good foods.

Zhuoqi Zhang is a Software Development Engineer at the Search Query Understanding Infra team. He works on building model serving frameworks to improve latency and throughput for deep learning online inference services. Outside of work, he likes playing basketball, snowboarding, and driving.

Haowei Sun is a software engineer in the Search Query Understanding Infra team. She works on designing APIs and infrastructure supporting deep learning online inference services. Her interests include service API design, infrastructure setup, and maintenance. Outside of work, she enjoys running, hiking, and traveling.

Jaspreet Singh is an Applied Scientist on the M5 team, where he works on large-scale foundation models to improve the customer shopping experience. His research interests include multi-task learning, information retrieval, and representation learning.

Shruti Koparkar is a Senior Product Marketing Manager at AWS. She helps customers explore, evaluate, and adopt EC2 accelerated computing infrastructure for their machine learning needs.

Read More

Go Hands On: Logitech G CLOUD Launches With Support for GeForce NOW

When it rains, it pours. And this GFN Thursday brings a downpour of news for GeForce NOW members.

The Logitech G CLOUD is the latest gaming handheld device to support GeForce NOW, giving members a brand new way to keep the gaming going.

But that’s not all: Portal with RTX joins GeForce NOW in November, free for Portal owners. Find out more about this ray-traced reimagining of Valve’s classic game, and more titles like A Plague Tale: Requiem and Warhammer 40,000: Darktide, streaming this fall.

Plus, conquer a breathtaking fantasy world and engage in colossal real-time battles as Creative Assembly’s Total Warhammer series comes to GeForce NOW — included in the eight new titles joining the GeForce NOW library this week.

Finally, an update to the GeForce NOW app on PC and Mac begins rolling out this week with optimized streaming resolution support for 16:10 4K displays at up to 60 frames per second, perfect for RTX 3080 members streaming on Macbooks.

A New Way to Play

The just-announced G CLOUD is the latest way to stream your PC library from the cloud on GeForce NOW.

GeForce NOW Streams to Logitech G CLOUD
G CLOUD is the latest way to stream your PC games wherever there’s WiFi.

Developed in partnership with Tencent Games, the G CLOUD is an Android device with a seven-inch 1080p 16:9 touchscreen, fully customizable controls and support for GeForce NOW right out of the box.

Members can instantly stream GeForce NOW’s library of 1,000+ games that support gamepad, using touch controls or G CLOUD’s built-in, customizable precision gaming controls. Its lightweight design makes it a joy to hold during the most frantic action. And thanks to its 12+ hour battery life, the action can last all day.

The G CLOUD is available to preorder today at $299.99 for a limited time, with full availability in October for $349.99. Check out the device.

The Hottest Games, Streaming Soon

Get ready to play three new release titles coming to the cloud in the near future.

Portal with RTX releases in November as free downloadable content for all Portal owners, and will be streaming on GeForce NOW. It’s a ray-traced reimagining of Valve’s classic game, built using a revolutionary modding tool called NVIDIA RTX Remix.

In Portal with RTX, full ray tracing transforms each scene, enabling light to bounce and be affected by each area’s new high-resolution, physically based textures and enhanced high-poly models. Every light is ray traced and casts shadows, global illumination indirect lighting naturally illuminates and darkens rooms, volumetric ray-traced lighting scatters through fog and smoke, and shadows are pixel perfect.

Wishlist the Portal with RTX DLC on Steam now to be notified the second it’s released.

A tale continues when A Plague Tale: Requiem launches Tuesday, Oct. 18, enhanced with ray-traced effects.

A Plague Tale Requiem on GeForce NOW

After escaping their devastated homeland in the critically acclaimed A Plague Tale: Innocence, siblings Amicia and Hugo venture south of 14th-century France to new regions and vibrant cities. But when Hugo’s powers reawaken, death and destruction return in a flood of devouring rats. Forced to flee once more, the siblings place their hopes in a prophesied island that may hold the key to saving Hugo.

On Tuesday, Nov. 30, Fatshark leaps thousands of years into the future to bring gamers Warhammer 40,000: Darktide with NVIDIA DLSS and advanced ray-tracing effects.

Warhammer 40K Darktide on GeForce NOW

Head to the industrial city of Tertium to combat the forces of Chaos, using Vermintide 2’s lauded melee system and a range of deadly Warhammer 40,000 weapons. Personalize your play style with a character-creation system and delve deep into the city to put a stop to the horrors that lurk.

GeForce NOW members can stream all of these great games when they’re released. RTX 3080 members can level up their gaming experience even further with 4K resolution and 60 frames per second on the PC and Mac apps, ultra-low latency, dedicated RTX 3080 servers and eight-hour sessions.

Charge Into the ‘Total Warhammer’ Series This Week

Make your move in the incoming additions from the Total War series by SEGA and Creative Assembly – Total War: WARHAMMER, Total War: WARHAMMER II and Total War: WARHAMMER III are streaming this week.

Total War Warhammer Series on GeForce NOW
It’s all about making big-brain plays in these “Total War” titles.

Explore and expand across fantasy lands in this Total War series. Combine turn-based civilization management and real-time epic strategy battles in this fantastic franchise, streaming from underpowered PCs, Macs and more. Command otherworldly troops, send forth ferocious monsters and harness powerful magic to pave your way to victory.

In addition, members can look for the following games streaming from the cloud this week:

With all of these new games streaming across GeForce NOW compatible devices, you can have your cake and eat it, too. Speaking of cake, we have a question for you. Let us know your answer on Twitter or in the comments below.

The post Go Hands On: Logitech G CLOUD Launches With Support for GeForce NOW appeared first on NVIDIA Blog.

Read More

Continental and AEye Join NVIDIA DRIVE Sim Sensor Ecosystem, Providing Rich Capabilities for AV Development

Autonomous vehicle sensors require the same rigorous testing and validation as the car itself, and one simulation platform is up to the task.

Global tier-1 supplier Continental and software-defined lidar maker AEye announced this week at NVIDIA GTC that they will migrate their intelligent lidar sensor model into NVIDIA DRIVE Sim. The companies are the latest to join the extensive ecosystem of sensor makers using NVIDIA’s end-to-end, cloud-based simulation platform for technology development.

Continental offers a full suite of cameras, radars and ultrasonic sensors, as well as its recently launched short-range flash lidar, some of which are incorporated into the NVIDIA Hyperion autonomous-vehicle development platform.

Last year, Continental and AEye announced a collaboration in which the tier-1 supplier would use the lidar maker’s software-defined architecture to produce a long-range sensor. Now, the companies are contributing this sensor model to DRIVE Sim, helping to bring their vision to the industry.

DRIVE Sim is built on the NVIDIA Omniverse platform for connecting and building custom 3D pipelines, providing physically based digital twin environments to develop and validate autonomous vehicles. DRIVE Sim is open and modular — users can create their own extensions or choose from a rich library of sensor plugins from ecosystem partners.

In addition to providing sensor models, partners use the platform to validate their own sensor architectures.

By joining this rich community of DRIVE Sim users, Continental and AEye can now rapidly simulate edge cases in varying environments to test and validate lidar performance.

A Lidar for All Seasons

AEye and Continental are creating HRL 131, a high-performance, long-range lidar for both passenger cars and commercial vehicles that’s software configurable and can adapt to various driving environments.

The lidar incorporates dynamic performance modes where the laser scan pattern adapts for any automated driving application, including highway driving or dense urban environments in all weather conditions, including direct sun, night, rain, snow, fog, dust and smoke. It features a range of more than 300 meters for detecting vehicles and 200 meters for detecting pedestrians, and is slated for mass production in 2024.

The simulated Continental HRL131 long-range lidar sensor, built on AEye’s 4Sight intelligent sensing platform, running in NVIDIA DRIVE Sim.

With DRIVE Sim, developers can recreate obstacles with their exact physical properties and place them in complex highway environments. They can determine which lidar performance modes are suitable for the chosen application based on uncertainties experienced in a particular scenario.

Once identified and tuned, performance modes can be activated on the fly using external cues such as speed, location or even vehicle pitch, which can change with loading conditions, tire-pressure variations and suspension modes.

The ability to simulate performance characteristics of a software-defined lidar model adds even greater flexibility to DRIVE Sim, further accelerating robust autonomous vehicle development.

‘’With the scalability and accuracy of NVIDIA DRIVE Sim, we’re able to validate our long-range lidar technology efficiently,’’ said Gunnar Juergens, head of product line, lidar, at Continental. ‘’It’s a robust tool for the industry to train, test and validate safe self-driving solutions’’

The post Continental and AEye Join NVIDIA DRIVE Sim Sensor Ecosystem, Providing Rich Capabilities for AV Development appeared first on NVIDIA Blog.

Read More