Announcing the launch of the model copy feature for Amazon Rekognition Custom Labels

Amazon Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to your business. Rekognition Custom Labels doesn’t require you to have any prior computer vision expertise. For example, you can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect animated characters in videos.

Developing a custom model to analyze images is a significant undertaking that requires time, expertise, and resources, often taking months to complete. Additionally, it often requires thousands or tens of thousands of hand-labeled images to provide the model with enough data to accurately make decisions. Generating this data can take months to gather and requires large teams of labelers to prepare it for use in machine learning (ML).

Rekognition Custom Labels builds off of the existing capabilities of Amazon Rekognition, which are already trained on tens of millions of images across many categories. Instead of thousands of images, you simply need to upload a small set of training images (typically a few hundred images or less) that are specific to your use case using the Amazon Rekognition console. If the images are already labeled, you can begin training a model in just a few clicks. If not, you can label them directly on the Rekognition Custom Labels console, or use Amazon SageMaker Ground Truth to label them. Rekognition Custom Labels uses transfer learning to automatically inspect the training data, select the right model framework and algorithm, optimize the hyperparameters, and train the model. When you’re satisfied with the model accuracy, you can start hosting the trained model with just one click.

Today we’re happy to announce the launch of the Rekognition Custom Labels model copy feature. This feature allows you to copy your Rekognition Custom Labels models across projects, which can be in the same AWS account or across AWS accounts in the same AWS Region, without retraining the models from scratch. This new capability makes it easier for you to move Rekognition Custom Labels models through various environments such as development, quality assurance, integration, and production without needing to copy the original training and test datasets and retraining the model. You can use the AWS Command Line Interface (AWS CLI) to copy trained models across projects, which can be in the same AWS account or across AWS accounts.

In this post, we show you how to copy models between different AWS accounts in the same AWS Region.

Benefits of the model copy feature

This new feature has the following benefits:

  • Multi-account ML-Ops best practices – You can train a model one time and ensure predictable deployment with consistent results across multiple accounts mapped to various environments such as development, quality assurance, integration, and production allowing you to follow ML-Ops best practices within your organization.
  • Cost savings and faster deployment – You can quickly copy a trained model between accounts, avoiding the time taken to retrain in every account and saving on the model retraining cost.
  • Protect sensitive datasets – You no longer need to share the datasets between different AWS accounts or users. The training data needs to be available only on the AWS account where model training is done. This is very important for certain industries, where data isolation is essential to meet business or regulatory requirements.
  • Easy collaboration – Partners or vendors can now easily train Amazon Rekognition Custom Labels model in their own AWS account and share the models with users across AWS accounts.
  • Consistent performance – Model performance is now consistent across different AWS accounts. Model training is generally non-deterministic and two models trained with the same dataset does not guarantee the same performance scores and the same predictions. Copying the model helps make sure that the behavior of the copied model is consistent with the source model eliminating the need to re-test the model.

Solution overview

The following diagram illustrates our solution architecture.

This post assumes you have a trained a Rekognition Custom Labels model in your source account. For instructions, refer to Training a custom single class object detection model with Amazon Rekognition Custom Labels. In this post, we used the image classification “Rooms” project from the Rekognition Custom Labels sample projects list and trained a room classification model in the source account to classify images of kitchens, bathrooms, living rooms, and more.

To demonstrate the functionality of the model copy feature, we go through the following steps in the source account:

  1. Start the model and run inferences on sample images.
  2. Define a resource-based policy to allow cross-account access to copy the Rekognition Custom Labels model.

Then we copy the source model to the target account.

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket, which serves as a container for the model evaluation and performance statistics.
  2. Create a project.
  3. Copy the trained model from the source account to the target account.
  4. Start the model and run inference on the sample images.
  5. Verify the inference results match the results of the source account model.

Prerequisites

In addition to having a trained model in your source account, make sure you complete the following prerequisite steps:

  1. Install the AWS CLI V2.
  2. Configure your AWS CLI with the following code and enter your Region:
    aws configure

  3. Run the following commands to ensure you have AWS CLI version 2.xx installed on your local host:
    aws --version

  4. Update the AWS credentials file under $HOME/.aws/credentials with the following entry:
    [source-account]
    aws_access_key_id = ####
    aws_secret_access_key = #######
    
    [target-account]
    aws_access_key_id = ####
    aws_secret_access_key = #######

  5. Get the ProjectArn and ProjectVersionArn for the source AWS account.ProjectArn is the project associated with your source model. ProjectVersionArn is the version of the model you’re interested in copying to the target account.You can find the SourceProjectArn using the following command:
    aws rekognition describe-projects 
    --region us-east-1 
    --profile source-account
    
    {
        "ProjectDescriptions": [{
            "ProjectArn": "arn:aws:rekognition:us-east-1::111111111111:project/rooms_1/1657588855531",
            .
            .
        }]
    }

    If you see multiple lines of output, pick the ProjectArn associated with the model you’re going to copy.

    You can find the SourceProjectVersionArn for the model you trained using the SourceProjectArn (the preceding output). Replace the SourceProjectArn in the following command:

    aws rekognition describe-project-versions 
    --project-arn SourceProjectArn 
    --region us-east-1 
    --profile source-account

    The command returns the SourceProjectVersionArn. If you see multiple lines of output, pick the ProjectVersionArn of interest.

    {
        "ProjectVersionDescriptions": [
            {
                "ProjectVersionArn": "arn:aws:rekognition:us-east-1:111111111111:project/rooms_1/version/rooms_1.2022-07-12T09.39.36/1657643976475",
                .
                .
            }
        ]
    }

You’re now ready to run the steps to implement the solution. Replace the values of SourceProjectArn and SourceProjectVersionArn in the following commands with the values you generated.

1. Start the model and run inference on sample images

In the source account, enter the following code to start the model:

aws rekognition start-project-version 
--project-version-arn SourceProjectVersionArn 
--min-inference-units 1 
--region us-east-1 
--profile source-account
{
    "Status": "STARTING"
}

After the model is hosted and in the running state, you can run inference.

We used the following images (demo1.jpeg and demo2.jpeg) to run inference. These images are located in our local file system in the same directory where the AWS CLI commands are being run from.

The following image is demo1.jpeg, which shows a backyard.

See the following inference code and output:

aws rekognition detect-custom-labels 
--project-version-arn SourceProjectVersionArn   
--image-bytes fileb://demo1.jpeg 
--region us-east-1 
--profile source-account
{
    "Name": "backyard",
    "Confidence": 45.77000045776367
 }

The following image is demo2.jpeg, which shows a bedroom.

See the following inference code and output:

aws rekognition detect-custom-labels 
--project-version-arn SourceProjectVersionArn   
--image-bytes fileb://demo2.jpeg 
--region us-east-1 
--profile source-account
{
    "Name": "bedroom",
    "Confidence": 61.84600067138672
 }

The inference results show the image belongs to the classes backyard and bedroom, with a confidence score of 45.77 and 61.84, respectively.

2. Define the IAM resource policy for the trained model to allow cross-account access

To create your resource-based IAM policy, complete the following steps in the source account:

  1. Allow your specific AWS account to access resources using the provided IAM resource policy (for more information, refer to Creating a project policy document. Replace the values for TargetAWSAccountId and SourceProjectVersionArn in the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Principal": {
                    "AWS": [ "TargetAWSAccountId" ]
                },
                "Action": "Rekognition:CopyProjectVersion",
                "Resource": "SourceProjectVersionArn",
                "Effect": "Allow"
            }
        ]
    }

  2. Attach the policy to the project in the source account by calling the following command.
    aws rekognition put-project-policy 
    --project-arn SourceProjectArn 
    --policy-name PolicyName 
    --policy-document '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Principal": {
                    "AWS": [ "TargetAWSAccountId" ]
                },
                "Action": "Rekognition:CopyProjectVersion",
                "Resource": "SourceProjectVersionArn",
                "Effect": "Allow"
            }
        ]
    }' 
    --region us-east-1 
    --profile source-account

    Replace SourceProjectArn, PolicyName, TargetAWSAccountId, and SourceProjectVersionArn.

    The output shows the policy revision ID created:

    {
        "PolicyRevisionId": "f95907f9c1472c114f61b0e1f31ed131"
    }

Now we’re ready to copy the trained model from the source account to the target account.

3. Create an S3 bucket in the target account

You can use an existing S3 bucket in your account or create a new S3 bucket. For this post, we call this S3 bucket DestinationS3Bucket.

4. Create a new Rekognition Custom Labels project

Create a new project with the following code:

aws rekognition create-project 
--project-name target_rooms_1 
--region us-east-1 
--profile target-account 

This creates a TargetProjectArn in the target account:

{
    "ProjectArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/1657599660206"
}

Note the value of the destination project ProjectArn field. We use this value in the following copy model command.

5. Copy the model from the source account to the target account

Supply the source and target ProjectArn, source ProjectVersionArn, and target S3 bucket and S3 key prefix in the following code:

aws rekognition copy-project-version 
--source-project-arn SourceProjectArn 
--source-project-version-arn SourceProjectVersionArn 
--destination-project-arn TargetProjectArn 
--version-name TargetVersionName 
--output-config '{"S3Bucket":"DestinationS3Bucket", "S3KeyPrefix":"DestinationS3BucketPrefix"}' 
--region us-east-1 
--profile target-account

This creates a copied model TargetProjectVersionArn in the target account. The TargetVersionName in our case has been named copy_rooms_1:

{
    "ProjectVersionArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/version/copy_rooms_1/1657667877079"
}

Check the status of the model copy process:

aws rekognition describe-project-versions 
--project-arn TargetProjectArn 
--version-names TargetVersionName 
--region us-east-1 
--profile target-account

The model copy from the source account to the target account is complete when the Status changes to COPYING_COMPLETED:

 {
    "ProjectVersionDescriptions": [
        {
            "ProjectVersionArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/version/copy_rooms_1/1657667877079",
            "CreationTimestamp": "2022-07-12T16:17:57.079000-07:00",
            "Status": "COPYING_COMPLETED",
            "StatusMessage": "Model copy operation was successful",
            ..........
            ..........
            "EvaluationResult": {
                "F1Score": 0.0,
                "Summary": {

6. Start the model and run inference

Enter the following code to start the model in the target account:

aws rekognition start-project-version 
--project-version-arn TargetProjectArn 
--min-inference-units 1 
--region us-east-1 
--profile target-account
{
    "Status": "STARTING"
}

Check the status of the model:

aws rekognition describe-project-versions 
--project-arn TargetProjectArn 
--version-names copy_rooms_1 
--region us-east-1 
--profile target-account

The model is now hosted and running:

{
    "ProjectVersionDescriptions": [
        {
            "ProjectVersionArn": "arn:aws:rekognition:us-east-1:222222222222:project/target_rooms_1/version/copy_rooms_1/1657667877079",
            "CreationTimestamp": "2022-07-12T16:17:57.079000-07:00",
            "MinInferenceUnits": 1,
            "Status": "RUNNING",
            "StatusMessage": "The model is running.",
            ..........
            ..........
        }
    ]
}

Run inference with the following code:

aws rekognition detect-custom-labels 
 --project-version-arn TargetProjectVersionArn 
 --image-bytes fileb://demo1.jpeg 
 --region us-east-1 
 --profile target-account
{
    "Name": "backyard",
    "Confidence": 45.77000045776367
 }
aws rekognition detect-custom-labels 
 --project-version-arn TargetProjectVersionArn 
 --image-bytes fileb://demo2.jpeg 
 --region us-east-1 
 --profile target-account
{
    "Name": "bedroom",
    "Confidence": 61.84600067138672

7. Verify the inference results match

The classes and the confidence scores for the images demo1.jpg and demo2.jpg in the target account should match the results in the source account.

Conclusion

In this post, we demonstrated the Rekognition Custom Label model copy feature. This feature enables you to train a classification or object detection model in one account and then share the model with another account in the same Region. This simplifies the multi-account strategy where the model can be trained one time and shared between accounts within the same Region without having to retrain or share the training datasets. This allows for a predicable deployment in every account as part of your MLOps workflow. For more information, refer to Copying an Amazon Rekognition Custom Labels model, or try out the walkthrough in this post using a cloud shell with the AWS CLI.

As of this writing, the model copy feature in Amazon Rekognition Custom Labels is available in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Seoul)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • EU (Frankfurt)
  • EU (Ireland)
  • EU (London)

Give the feature a try, and please send us feedback either via the AWS forum for Amazon Rekognition or through your AWS support contacts.


About the authors

Amit Gupta is a Senior AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Yogesh Chaturvedi is a Solutions Architect at AWS with a focus in computer vision. He works with customers to address their business challenges using cloud technologies. Outside of work, he enjoys hiking, traveling, and watching sports.

Aakash Deep is a Senior Software Engineer with AWS. He enjoys working on computer vision, AI, and distributed systems. Outside of work, he enjoys hiking and traveling.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Read More

Cloud-based medical imaging reconstruction using deep neural networks

Medical imaging techniques like computed tomography (CT), magnetic resonance imaging (MRI), medical x-ray imaging, ultrasound imaging, and others are commonly used by doctors for various reasons. Some examples include detecting changes in the appearance of organs, tissues, and vessels, and detecting abnormalities such as tumors and various other type of pathologies.

Before doctors can use the data from those techniques, the data needs to be transformed from its native raw form to a form that can be displayed as an image on a computer screen.

This process is known as image reconstruction, and it plays a crucial role in a medical imaging workflow—it’s the step that creates diagnostic images that can be then reviewed by doctors.

In this post, we discuss a use case of MRI reconstruction, but the architectural concepts can be applied to other types of image reconstruction.

Advances in the field of image reconstruction have led to the successful application of AI-based techniques within magnetic resonance (MR) imaging. These techniques are aimed at increasing the accuracy of the reconstruction and in the case of MR modality, and decreasing the time required for a full scan.

Within MR, applications using AI to work with under-sampled acquisitions have been successfully employed, achieving nearly ten times reduction in scan times.

Waiting times for tests like MRIs and CT scans have increased rapidly in the last couple of years, leading to wait times as long as 3 months. To ensure good patient care, the increasing need for quick availability of reconstructed images along with the need to reduce operational costs has driven the need of a solution capable of scaling according to storage and computational needs.

In addition to computational needs, data growth has seen a steady increase in the last few years. For example, looking at the datasets made available by the Medical Image Computing and Computer-Assisted Intervention (MICCAI), it’s possible to gather that the annual growth is 21% for MRI, 24% for CT, and 31% for functional MRI (fMRI). (For more information, refer to Dataset Growth in Medical Image Analysis Research.)

In this post, we show you a solution architecture that addresses these challenges. This solution can enable research centers, medial institutions, and modality vendors to have access to unlimited storage capabilities, scalable GPU power, fast data access for machine learning (ML) training and reconstruction tasks, simple and fast ML development environments, and the ability to have on-premises caching for fast and low-latency image data availability.

Solution overview

This solution uses an MRI reconstruction technique known as Robust Artificial-neural-networks for k-space Interpolation (RAKI). This approach is advantageous because it’s scan-specific and doesn’t require prior data to train the neural network. The drawback to this technique is that it requires a lot of computational power to be effective.

The AWS architecture outlined shows how a cloud-based reconstruction approach can effectively perform computational-heavy tasks like the one required by the RAKI neural network, scaling according to the load and accelerating the reconstruction process. This opens the door to techniques that can’t realistically be implemented on premises.

Data layer

The data layer has been architected around the following principles:

  • Seamless integration with modalities that store data generated into an attached storage drive via a network share on a NAS device
  • Limitless and secure data storage capabilities to scale to the continuous demand of storage space
  • Fast storage availability for ML workloads such as deep neural training and neural image reconstruction
  • The ability to archive historic data using a low-cost, scalable approach
  • Permit availability to the most frequently accessed reconstructed data while simultaneously keeping less frequently accessed data archived at a lower cost

The following diagram illustrates this architecture.

This approach uses the following services:

  • AWS Storage Gateway for a seamless integration with the on-premises modality that exchanges information via a file share system. This allows transparent access to the following AWS Cloud storage capabilities while maintaining how the modality exchanges data:

    • Fast cloud upload of the volumes generated by the MR modality.
    • Low-latency access to frequently used reconstructed MR studies via local caching offered by Storage Gateway.
  • Amazon SageMaker for unlimited and scalable cloud storage. Amazon S3 also provides low-cost, historical raw MRI data deep archiving with Amazon S3 Glacier, and an intelligent storage tier for the reconstructed MRI with Amazon S3 Intelligent-Tiering.
  • Amazon FSx for Lustre for fast and scalable intermediate storage used for ML training and reconstruction tasks.

The following figure shows a concise architecture describing the data exchange between the cloud environments.

Using Storage Gateway with the caching mechanism allows on-premises applications to quickly access data that’s available on the local cache. This occurs while simultaneously giving access to scalable storage space on the cloud.

With this approach, modalities can generate raw data from acquisition jobs, as well as write the raw data into a network share handled from Storage Gateway.

If the modality generates multiple files that belong to the same scan, it’s recommended to create a single archive (.tar for example), and perform a single transfer to the network share to accelerate the data transfer.

Data decompression and transformation layer

The data decompression layer receives the raw data, automatically performs decompression, and applies potential transformations to the raw data before submitting the preprocessed data to the reconstruction layer.

The adopted architecture is outlined in the following figure.

In this architecture, raw MRI data lands in the raw MRI S3 bucket, thereby triggering a new entry in Amazon Simple Queue Service (Amazon SQS).

An AWS Lambda function retrieves the raw MRI Amazon SQS queue depth, which represents the amount of raw MRI acquisitions uploaded to the AWS Cloud. This is used with AWS Fargate to automatically modulate the size of an Amazon Elastic Container Service (Amazon ECS) cluster.

This architecture approach lets it automatically scale up and down accordingly to the number of raw scans landed into the raw input bucket.

After the raw MRI data is decompressed and preprocessed, it’s saved into another S3 bucket so that it can be reconstructed.

Neural model development layer

The neural model development layer consists of a RAKI implementation. This creates a neural network model to allow the fast image reconstruction of under-sampled magnetic resonance raw data.

The following figure shows the architecture that realizes the neural model development and container creation.

In this architecture, Amazon SageMaker is used to develop the RAKI neural model, and simultaneously to create the container that is later used to perform the MRI reconstruction.

Then, the created container is included in the fully managed Amazon Elastic Container Registry (Amazon ECR) repository so that it can then spin off reconstruction tasks.

Fast data storage is guaranteed by the adoption of Amazon FSx for Lustre. It provides sub-millisecond latencies, up to hundreds of GBps of throughput, and up to millions of IOPS. This approach gives SageMaker access to a cost-effective, high-performance, and scalable storage solution.

MRI reconstruction layer

The MRI reconstruction based on the RAKI neural network is handled by the architecture shown in the following diagram.

With the same architectural pattern adopted in the decompression and preprocessing layer, the reconstruction layer automatically scales up and down by analyzing the depth of the queue responsible for holding all the reconstruction requests. In this case, to enable GPU support, AWS Batch is used to run the MRI reconstruction jobs.

Amazon FSx for Lustre is used to exchange the large amount of data involved in MRI acquisition. Furthermore, when a reconstruction job is complete and the reconstructed MRI data is stored in the target S3 bucket, the architecture employed automatically requests a refresh of the storage gateway. This makes the reconstructed data available to the on-premises facility.

Overall architecture and results

The overall architecture is shown in the following figure.

We applied the described architecture on MRI reconstruction tasks with datasets approximately 2.4 GB in size.

It took approximately 210 seconds to train 221 datasets, for a total of 514 GB of raw data on a single node equipped with a Nvidia Tesla V100-SXM2-16GB.

The reconstruction, after the RAKI network has been trained, took an average of 40 seconds on a single node equipped with a Nvidia Tesla V100-SXM2-16GB.

The application of the preceding architecture to a reconstruction job can yield the results in the following figure.

The image shows that good results can be obtained via reconstruction techniques such as RAKI. Moreover, adopting cloud technology can make these computation-heavy approaches available without the limitations found in on-premises solutions where storage and computational resources are always limited.

Conclusions

With tools such as Amazon SageMaker, Amazon FSx for Lustre, AWS Batch, Fargate, and Lambda, we can create a managed environment that is scalable, secure, cost-effective, and capable of performing complex tasks such as image reconstruction at scale.

In this post, we explored a possible solution for image reconstruction from raw modality data using a computationally intensive technique known as RAKI: a database free deep learning technique for fast image reconstruction.

To learn more about how AWS is accelerating innovation in healthcare, visit AWS for Health.

References


About the author

Benedetto Carollo is the Senior Solution Architect for medical imaging and healthcare at Amazon Web Services in Europe, Middle East, and Africa. His work focuses on helping medical imaging and healthcare customers solve business problems by leveraging technology. Benedetto has over 15 years of experience of technology and medical imaging and has worked for companies like Canon Medical Research and Vital Images. Benedetto received his summa cum laude MSc in Software Engineering from the University of Palermo – Italy.

Read More

Smart Devices, Smart Manufacturing: Pegatron Taps AI, Digital Twins

In the fast-paced field of making the world’s tech devices, Pegatron Corp. initially harnessed AI to gain an edge. Now, it’s on the cusp of creating digital twins to further streamline its efficiency.

Whether or not they’re familiar with the name, most people have probably used smartphones, tablets, Wi-Fi routers or other products that Taiwan-based Pegatron makes in nearly a dozen factories across seven countries. Last year, it made more than 10 million notebook computers.

Andrew Hsiao, associate vice president of Pegatron’s software R&D division, is leading the company’s move into machine learning and the 3D internet known as the metaverse.

Building an AI Platform

“We’ve been collecting factory data since 2012 to find patterns and insights that enhance operations,” said Hsiao, a veteran tech manager who’s been with the company for 14 years, since it spun out of ASUS, one of the world’s largest PC makers.

In 2016, Pegatron’s COO, Denese Yao, launched a task force to apply new technology to improve operations. Hsiao’s team of AI experts collaborated with factory workers to find use cases for AI. One of their first pilot projects used deep learning to detect anomalies in products as they came down the line.

It got solid results using modified versions of neural network models like ResNet, so they stepped on the gas.

Today, Pegatron uses Cambrian, an AI platform it built for automated inspection, deployed in most of its factories. It maintains hundreds of AI models, trained and running in production on NVIDIA GPUs.

Fewer Defects, More Consistency

The new platform catches up to 60% more defects with 30% fewer variations than human inspectors, and factory employees appreciate it.

“Manual inspection is a boring, repetitive job, so it’s not surprising employees don’t like it,” he said. “Now, we’re seeing employees motivated to learn about the new technology, so it’s empowering people to do more value-added work.”

The system may also improve throughput as factories adjust workflows on assembly and packing stations to account for faster inspection lines.

Models Deployed 50x Faster

Pegatron’s system uses NVIDIA A100 Tensor Core GPUs to deploy AI models up to 50x faster than when it trained them on workstations, cutting weeks of work down to a few hours.

“With our unified platform based on DGX, we have our data lake, datasets and training all in one place, so we can deploy a model in one click,” Hsiao said.

Using the Multi-Instance GPU capability in A100 GPUs, Pegatron cut developers’ wait time for access to an accelerator from nearly an hour to 30 seconds. “That lets us dynamically schedule jobs like AI inference and lightweight model training,” he said.

As part of its AI inference work, the system analyzes more than 10 million images a day using NVIDIA A40 and other GPUs.

Triton, NGC Simplify AI Jobs

Pegatron uses NVIDIA Triton Inference Server, open-source software that helps deploy, run and scale AI models across all types of processors, and frameworks. It works hand-in-hand with NVIDIA TensorRT, software that simplifies neural networks to reduce latency.

“Triton and TensorRT make it easy to serve multiple clients and convert jobs to the most cost-effective precision levels,” he said.

Hsiao’s team optimizes pretrained AI models it downloads in integrated Kubernetes containers from the NVIDIA NGC hub for GPU-optimized software.

“NGC is very helpful because we get with one click the deep learning frameworks and all the other software components we need, stuff that used to take us a lot of time to pull together,” he said.

Next Step: Digital Twins

Taking another step in smarter manufacturing, Pegatron is piloting NVIDIA Omniverse, a platform for developing digital twins

It has two use cases so far. First, testing Omniverse Replicator to generate synthetic data of what products coming down the inspection line might look like under different lighting conditions or orientations. This information will make its perception models smarter.

Second, it’s creating digital twins of inspection machines. That lets remote workers manage them remotely, have better insight into predictive maintenance and simulate software updates before deploying them to a physical machine.

“Today, when a system goes down, we can only check logs that might be incomplete, but with Omniverse, we can replay what happened to understand how to fix it, or, run simulations to predict how it will behave in the future,” he said.

Pegatron engineer monitors factory remotely with Omniverse
A Pegatron engineer monitors an inspection machine remotely with Omniverse.

What’s more, industrial engineers who care about throughput, automation engineers responsible for downtime, and equipment engineers who handle maintenance can work together on the same virtual system at the same time, even when logging in from different countries.

Vision of a Virtual Factory

If all goes well, Pegatron could have Omniverse available on its inspection machines before the end of the year.

Meanwhile, Hsiao is looking for partners who can help build virtual versions of a whole production line in Omniverse. Longer term, his vision is to create a digital twin of an entire factory.

“In my opinion, the greatest impact will come from building a full virtual factory so we can try out things like new ways to route products through the plant,” he said. “When you just build it out without a simulation first, your mistakes are very costly.”

The post Smart Devices, Smart Manufacturing: Pegatron Taps AI, Digital Twins appeared first on NVIDIA Blog.

Read More

AI Shows the Way: Seoul Robotics Helps Cars Move, Park on Their Own

Imagine driving a car — one without self-driving capabilities — to a mall, airport or parking garage, and using an app to have the car drive off to park itself.

Software company Seoul Robotics is using NVIDIA technology to make this possible — turning non-autonomous cars into self-driving vehicles.

Headquartered in Korea, the company’s initial focus is on improving first- and last-mile logistics such as parking. Its Level 5 Control Tower is a mesh network of sensors and computers placed on infrastructure around a facility, like buildings or light poles — rather than on individual cars — to capture an unobstructed view of the environment.

The system enables cars to move autonomously by directing their vehicle-to-everything, or so-called V2X, communication systems. These systems pass information from a vehicle to infrastructure, other vehicles, any surrounding entities — and vice versa. V2X technology, which comes standard in many modern cars, is used to improve road safety, traffic efficiency and energy savings.

Seoul Robotics’ platform, dubbed LV5 CTRL TWR, collects 3D data from the environment using cameras and lidar. Computer vision and deep learning-based AI analyze the data, determining the most efficient and safest paths for vehicles within the covered area.

Then, through V2X, the platform can manage a car’s existing features, such as adaptive-cruise-control, lane-keeping and brake-assist functions, to safely get it from place to place.

LV5 CTRL TWR is built using NVIDIA CUDA libraries for creating GPU-accelerated applications, as well as the Jetson AGX Orin module for high-performance AI at the edge. NVIDIA GPUs are used in the cloud for global fleet path planning.

Seoul Robotics is a member of NVIDIA Metropolis — a partner program centered on an application framework and set of developer tools that supercharge vision AI applications — and NVIDIA Inception, a free, global program that nurtures cutting-edge startups.

Autonomy Through Infrastructure

Seoul Robotics is pioneering a new path to level 5 autonomy, or full driving automation, with what’s known as “autonomy through infrastructure.”

“Instead of outfitting the vehicles themselves with sensors, we’re outfitting the surrounding infrastructure with sensors,” said Jerone Floor, vice president of product and solutions at Seoul Robotics.

Using V2X capabilities, LV5 CTRL TWR sends commands from infrastructure to cars, making vehicles turn right or left, move from point A to B, brake and more. It achieves an accuracy in positioning a car of plus or minus four centimeters.

“No matter how smart a vehicle is, if another car is coming from around a corner, for example, it won’t be able to see it,” Floor said. “LV5 CTRL TWR provides vehicles with the last bits of information gathered from having a holistic view of the environment, so they’re never ‘blind.’”

These communication protocols already exist in most vehicles, he added. LV5 CTRL TWR acts as the AI-powered brain of the instructive mechanisms, requiring nothing more than a firmware update in cars.

“From the beginning, we knew we needed deep learning in the system in order to achieve the really high performance required to reach safety goals — and for that, we needed GPU acceleration,” Floor said. “So, we designed the system from the ground up based on NVIDIA GPUs and CUDA.”

NVIDIA CUDA libraries help the Seoul Robotics team render massive amounts of data from the 3D sensors in real time, as well as accelerate training and inference for its deep learning models.

As a Metropolis member, Seoul Robotics received early access to software development kits and the NVIDIA Jetson AGX Orin for edge AI.

“The compute capabilities of Jetson AGX Orin allow us to have the LV5 CTRL TWR cover more area with a single module,” Floor added. “Plus, it handles a wide temperature range, enabling our system to work in both indoor and outdoor units, rain or shine.”

Deployment Across the Globe

LV5 CTRL TWR is in early commercial deployment at a BMW manufacturing facility in Munich.

According to Floor, cars must often change locations once they’re manufactured, from electrical repair stations to parking lots for test driving and more.

Equipped with LV5 CTRL TWR, the BMW facility has automated such movement of cars — resulting in time and cost savings. Automating car transfers also enhances safety for employees and frees them up to focus on other tasks, like headlight alignment and more, Floor said.

And from the moment a vehicle is fully manufactured until it’s delivered to the customer, it moves on average through up to seven parking lots. Moving cars manually costs manufacturers anywhere from $30 to $60, per car, per lot — meaning LV5 CTRL TWR can address a $30 billion market.

The technology behind LV5 CTRL TWR can be used across industries, Floor highlighted. Beyond automotive factories, Seoul Robotics envisions its platform to be deployed across the globe — at retail stores, airports, traffic intersections and more.

NVIDIA Jetson AGX Orin 32GB production modules are now available.

Learn more about NVIDIA Metropolis and apply to join NVIDIA Inception.

Feature image courtesy of BMW Group.

The post AI Shows the Way: Seoul Robotics Helps Cars Move, Park on Their Own appeared first on NVIDIA Blog.

Read More

Towards Helpful Robots: Grounding Language in Robotic Affordances

Over the last several years, we have seen significant progress in applying machine learning to robotics. However, robotic systems today are capable of executing only very short, hard-coded commands, such as “Pick up an apple,” because they tend to perform best with clear tasks and rewards. They struggle with learning to perform long-horizon tasks and reasoning about abstract goals, such as a user prompt like “I just worked out, can you get me a healthy snack?”

Meanwhile, recent progress in training language models (LMs) has led to systems that can perform a wide range of language understanding and generation tasks with impressive results. However, these language models are inherently not grounded in the physical world due to the nature of their training process: a language model generally does not interact with its environment nor observe the outcome of its responses. This can result in it generating instructions that may be illogical, impractical or unsafe for a robot to complete in a physical context. For example, when prompted with “I spilled my drink, can you help?” the language model GPT-3 responds with “You could try using a vacuum cleaner,” a suggestion that may be unsafe or impossible for the robot to execute. When asking the FLAN language model the same question, it apologizes for the spill with “I’m sorry, I didn’t mean to spill it,” which is not a very useful response. Therefore, we asked ourselves, is there an effective way to combine advanced language models with robot learning algorithms to leverage the benefits of both?

In “Do As I Can, Not As I Say: Grounding Language in Robotic Affordances”, we present a novel approach, developed in partnership with Everyday Robots, that leverages advanced language model knowledge to enable a physical agent, such as a robot, to follow high-level textual instructions for physically-grounded tasks, while grounding the language model in tasks that are feasible within a specific real-world context. We evaluate our method, which we call PaLM-SayCan, by placing robots in a real kitchen setting and giving them tasks expressed in natural language. We observe highly interpretable results for temporally-extended complex and abstract tasks, like “I just worked out, please bring me a snack and a drink to recover.” Specifically, we demonstrate that grounding the language model in the real world nearly halves errors over non-grounded baselines. We are also excited to release a robot simulation setup where the research community can test this approach.

<!–

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

–>

With PaLM-SayCan, the robot acts as the language model’s “hands and eyes,” while the language model supplies high-level semantic knowledge about the task.

A Dialog Between User and Robot, Facilitated by the Language Model
Our approach uses the knowledge contained in language models (Say) to determine and score actions that are useful towards high-level instructions. It also uses an affordance function (Can) that enables real-world-grounding and determines which actions are possible to execute in a given environment. Using the the PaLM language model, we call this PaLM-SayCan.

Our approach selects skills based on what the language model scores as useful to the high level instruction and what the affordance model scores as possible.

Our system can be seen as a dialog between the user and robot, facilitated by the language model. The user starts by giving an instruction that the language model turns into a sequence of steps for the robot to execute. This sequence is filtered using the robot’s skillset to determine the most feasible plan given its current state and environment. The model determines the probability of a specific skill successfully making progress toward completing the instruction by multiplying two probabilities: (1) task-grounding (i.e., a skill language description) and (2) world-grounding (i.e., skill feasibility in the current state).

There are additional benefits of our approach in terms of its safety and interpretability. First, by allowing the LM to score different options rather than generate the most likely output, we effectively constrain the LM to only output one of the pre-selected responses. In addition, the user can easily understand the decision making process by looking at the separate language and affordance scores, rather than a single output.

PaLM-SayCan is also interpretable: at each step, we can see the top options it considers based on their language score (blue), affordance score (red), and combined score (green).

Training Policies and Value Functions
Each skill in the agent’s skillset is defined as a policy with a short language description (e.g., “pick up the can”), represented as embeddings, and an affordance function that indicates the probability of completing the skill from the robot’s current state. To learn the affordance functions, we use sparse reward functions set to 1.0 for a successful execution, and 0.0 otherwise.

We use image-based behavioral cloning (BC) to train the language-conditioned policies and temporal-difference-based (TD) reinforcement learning (RL) to train the value functions. To train the policies, we collected data from 68,000 demos performed by 10 robots over 11 months and added 12,000 successful episodes, filtered from a set of autonomous episodes of learned policies. We then learned the language conditioned value functions using MT-Opt in the Everyday Robots simulator. The simulator complements our real robot fleet with a simulated version of the skills and environment, which is transformed using RetinaGAN to reduce the simulation-to-real gap. We bootstrapped simulation policies’ performance by using demonstrations to provide initial successes, and then continuously improved RL performance with online data collection in simulation.

Given a high-level instruction, our approach combines the probabilities from the language model with the probabilities from the value function (VF) to select the next skill to perform. This process is repeated until the high-level instruction is successfully completed.

Performance on Temporally-Extended, Complex, and Abstract Instructions
To test our approach, we use robots from Everyday Robots paired with PaLM. We place the robots in a kitchen environment containing common objects and evaluate them on 101 instructions to test their performance across various robot and environment states, instruction language complexity and time horizon. Specifically, these instructions were designed to showcase the ambiguity and complexity of language rather than to provide simple, imperative queries, enabling queries such as “I just worked out, how would you bring me a snack and a drink to recover?” instead of “Can you bring me water and an apple?”

We use two metrics to evaluate the system’s performance: (1) the plan success rate, indicating whether the robot chose the right skills for the instruction, and (2) the execution success rate, indicating whether it performed the instruction successfully. We compare two language models, PaLM and FLAN (a smaller language model fine-tuned on instruction answering) with and without the affordance grounding as well as the underlying policies running directly with natural language (Behavioral Cloning in the table below). The results show that the system using PaLM with affordance grounding (PaLM-SayCan) chooses the correct sequence of skills 84% of the time and executes them successfully 74% of the time, reducing errors by 50% compared to FLAN and compared to PaLM without robotic grounding. This is particularly exciting because it represents the first time we can see how an improvement in language models translates to a similar improvement in robotics. This result indicates a potential future where robotics is able to ride the wave of progress that we have been observing in language models, bringing these subfields of research closer together.

Algorithm     Plan     Execute
PaLM-SayCan     84%     74%
PaLM     67%    
FLAN-SayCan     70%     61%
FLAN     38%    
Behavioral Cloning     0%     0%
PaLM-SayCan halves errors compared to PaLM without affordances and compared to FLAN over 101 tasks.

<!–

SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.

–>

SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.

<!–<

SayCan demonstrated successful planning for 84% of the 101 test instructions when combined with PaLM.

–>

If you’re interested in learning more about this project from the researchers themselves, please check out the video below:

Conclusion and Future Work
We’re excited about the progress that we’ve seen with PaLM-SayCan, an interpretable and general approach to leveraging knowledge from language models that enables a robot to follow high-level textual instructions to perform physically-grounded tasks. Our experiments on a number of real-world robotic tasks demonstrate the ability to plan and complete long-horizon, abstract, natural language instructions at a high success rate. We believe that PaLM-SayCan’s interpretability allows for safe real-world user interaction with robots. As we explore future directions for this work, we hope to better understand how information gained via the robot’s real-world experience could be leveraged to improve the language model and to what extent natural language is the right ontology for programming robots. We have open-sourced a robot simulation setup, which we hope will provide researchers with a valuable resource for future research that combines robotic learning with advanced language models. The research community can visit the project’s GitHub page and website to learn more.

Acknowledgements
We’d like to thank our coauthors Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Kelly Fu, Keerthana Gopalakrishnan, Alex Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Alex Irpan, Eric Jang, Rosario Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Kuang-Huei Lee, Sergey Levine, Yao Lu, Linda Luu, Carolina Parada, Peter Pastor, Jornell Quiambao, Kanishka Rao, Jarek Rettinghouse, Diego Reyes, Pierre Sermanet, Nicolas Sievers, Clayton Tan, Alexander Toshev, Vincent Vanhoucke, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Mengyuan Yan, and Andy Zeng. We’d also like to thank Yunfei Bai, Matt Bennice, Maarten Bosma, Justin Boyd, Bill Byrne, Kendra Byrne, Noah Constant, Pete Florence, Laura Graesser, Rico Jonschkowski, Daniel Kappler, Hugo Larochelle, Benjamin Lee, Adrian Li, Suraj Nair, Krista Reymann, Jeff Seto, Dhruv Shah, Ian Storz, Razvan Surdulescu, and Vincent Zhao for their help and support in various aspects of the project. And we’d like to thank Tom Small for creating many of the animations in this post.

Read More

Making robots more helpful with language

Even the simplest human tasks are unbelievably complex. The way we perceive and interact with the world requires a lifetime of accumulated experience and context. For example, if a person tells you, “I am running out of time,” you don’t immediately worry they are jogging on a street where the space-time continuum ceases to exist. You understand that they’re probably coming up against a deadline. And if they hurriedly walk toward a closed door, you don’t brace for a collision, because you trust this person can open the door, whether by turning a knob or pulling a handle.

A robot doesn’t innately have that understanding. And that’s the inherent challenge of programming helpful robots that can interact with humans. We know it as “Moravec’s paradox” — the idea that in robotics, it’s the easiest things that are the most difficult to program a robot to do. This is because we’ve had all of human evolution to master our basic motor skills, but relatively speaking, humans have only just learned algebra.

In other words, there’s a genius to human beings — from understanding idioms to manipulating our physical environments — where it seems like we just “get it.” The same can’t be said for robots.

Today, robots by and large exist in industrial environments, and are painstakingly coded for narrow tasks. This makes it impossible for them to adapt to the unpredictability of the real world. That’s why Google Research and Everyday Robots are working together to combine the best of language models with robot learning.

Called PaLM-SayCan, this joint research uses PaLM — or Pathways Language Model — in a robot learning model running on an Everyday Robots helper robot. This effort is the first implementation that uses a large-scale language model to plan for a real robot. It not only makes it possible for people to communicate with helper robots via text or speech, but also improves the robot’s overall performance and ability to execute more complex and abstract tasks by tapping into the world knowledge encoded in the language model.

Using language to improve robots

PaLM-SayCan enables the robot to understand the way we communicate, facilitating more natural interaction. Language is a reflection of the human mind’s ability to assemble tasks, put them in context and even reason through problems. Language models also contain enormous amounts of information about the world, and it turns out that can be pretty helpful to the robot. PaLM can help the robotic system process more complex, open-ended prompts and respond to them in ways that are reasonable and sensible.

PaLM-SayCan shows that a robot’s performance can be improved simply by enhancing the underlying language model. When the system was integrated with PaLM, compared to a less powerful baseline model, we saw a 14% improvement in the planning success rate, or the ability to map a viable approach to a task. We also saw a 13% improvement on the execution success rate, or ability to successfully carry out a task. This is half the number of planning mistakes made by the baseline method. The biggest improvement, at 26%, is in planning long horizon tasks, or those in which eight or more steps are involved. Here’s an example: “I left out a soda, an apple and water. Can you throw them away and then bring me a sponge to wipe the table?” Pretty demanding, if you ask me.

Making sense of the world through language

With PaLM, we’re seeing new capabilities emerge in the language domain such as reasoning via chain of thought prompting. This allows us to see and improve how the model interprets the task. For example, if you show the model a handful of examples with the thought process behind how to respond to a query, it learns to reason through those prompts. This is similar to how we learn by showing our work on our algebra homework.

PaLM-SayCan uses chain of thought prompting, which interprets the instruction in order to score the likelihood of completing the task

So if you ask PaLM-SayCan, “Bring me a snack and something to wash it down with,” it uses chain of thought prompting to recognize that a bag of chips may be a good snack, and that “wash it down” means bring a drink. Then PaLM-SayCan can respond with a series of steps to accomplish this. While we’re early in our research, this is promising for a future where robots can handle complex requests.

Grounding language through experience

Complexity exists in both language and the environments around us. That’s why grounding artificial intelligence in the real world is a critical part of what we do in Google Research. A language model may suggest something that appears reasonable and helpful, but may not be safe or realistic in a given setting. Robots, on the other hand, have been trained to know what is possible given the environment. By fusing language and robotic knowledge, we’re able to improve the overall performance of a robotic system.

Here’s how this works in PaLM-SayCan: PaLM suggests possible approaches to the task based on language understanding, and the robot models do the same based on the feasible skill set. The combined system then cross-references the two to help identify more helpful and achievable approaches for the robot.

By combining language and robotic affordances, PaLM-SayCan breaks down the requested task to perform it successfully

For example, if you ask the language model, “I spilled my drink, can you help?,” it may suggest you try using a vacuum. This seems like a perfectly reasonable way to clean up a mess, but generally, it’s probably not a good idea to use a vacuum on a liquid spill. And if the robot can’t pick up a vacuum or operate it, it’s not a particularly helpful way to approach the task. Together, the two may instead be able to realize “bring a sponge” is both possible and more helpful.

Experimenting responsibly

We take a responsible approach to this research and follow Google’s AI’s Principles in the development of our robots. Safety is our number-one priority and especially important for a learning robot: It may act clumsily while exploring, but it should always be safe. We follow all the tried and true principles of robot safety, including risk assessments, physical controls, safety protocols and emergency stops. We also always implement multiple levels of safety such as force limitations and algorithmic protections to mitigate risky scenarios. PaLM-SayCan is constrained to commands that are safe for a robot to perform and was also developed to be highly interpretable, so we can clearly examine and learn from every decision the system makes.

Making sense of our worlds

Whether it’s moving about busy offices — or understanding common sayings — we still have many mechanical and intelligence challenges to solve in robotics. So, for now, these robots are just getting better at grabbing snacks for Googlers in our micro-kitchens.

But as we continue to uncover ways for robots to interact with our ever-changing world, we’ve found that language and robotics show enormous potential for the helpful, human-centered robots of tomorrow.

Read More

Digital Art Professor Kate Parsons Inspires Next Generation of Creators This Week ‘In the NVIDIA Studio’

Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology accelerates creative workflows.

Many artists can edit a video, paint a picture or build a model — but transforming one’s imagination into stunning creations can now involve breakthrough design technologies.

Kate Parsons, a digital art professor at Pepperdine University and this week’s featured In the NVIDIA Studio artist, helped bring a music video for How Do I Get to Invincible to life using virtual reality and NVIDIA GeForce RTX GPUs.

The project, a part of electronic music trio The Glitch Mob’s visual album See Without Eyes, quickly and seamlessly moved from 3D to VR, thanks to NVIDIA GPU acceleration.

We All FLOAT On

Parsons has blended her passions for art and technology as co-founder of FLOAT LAND, a pioneering studio that innovates across digital media, including video, as well as virtual and augmented reality.

She and co-founder Ben Vance collaborate on projects at the intersection of art and interactivity. They design engaging animations, VR art exhibits and futuristic interactive AR displays.

FLOAT LAND embraces the latest advances in GPU technology to push creative boundaries. Photo courtesy of United Nude.

When The Glitch Mob set out to turn See Without Eyes into a state-of-the-art visual album, the group tapped long-term collaborators Strangeloop Studios, who in turn reached out to FLOAT LAND to create art for the song How Do I Get to Invincible. Parsons and her team brought a dreamlike feel to the project.

Working with the team at Strangeloop Studios, FLOAT LAND created the otherworldly landscapes for How Do I Get to Invincible, harnessing the power of NVIDIA RTX GPUs.

FLOAT LAND is a collaborative studio focused on the intersection of art and interactivity, founded by Kate Parsons and Ben Vance. Photo by Nicole Gawalis.

“We have a long history of using NVIDIA GPUs due to our early work in the VR space,” Parsons said. “The early days of VR were a bit like the Wild West, and it was really important for us to have reliable systems — we consider NVIDIA GPUs to be a key part of our rigs.”

Where Dreams Become (Virtual) Reality

The FLOAT LAND team used several creative applications for the visual album. They began by researching techniques in real-time visual effects to work within Unity software. This included using custom shaders inspired by the Shadertoy computer graphics tool and exploring different looks to create a surreal mix of dark and moody.

Then, the artists built test terrains using Cinema 4D, a professional 3D animation, simulation and rendering solution, and Unity, a leading platform for creating and operating interactive, real-time 3D content, to explore post-effects like tilt shift, ambient occlusion and chromatic aberration. They also used the Unity plugin Fog Volume 3 to create rich, dynamic clouds to quickly explore many options.

 

Using NVIDIA RTX GPUs in Unity accelerated the work of Parsons’s team through advanced shading techniques. Plus, NVIDIA DLSS increased the interactivity of the viewport.

“Unity was central to our production process, and we iterated both in editor and in real time to get the look we wanted,” Parsons said. “Some of the effects really pushed the limits of our GPUs. It wouldn’t have been possible to work in real time without GPU acceleration – we would’ve had to render out clips, which takes anywhere from 10 to thousands of times longer.”

And like all great projects, even once it was done, the visual album wasn’t done, Parsons said. Working with virtual entertainment company Wave, FLOAT LAND’s work for the visual album was used to turn the entirety of the piece into a VR experience. Using the Unity and GPU-native groundwork greatly accelerated this process, Parsons added.

The Glitch Mob called it “a completely new way to experience music.”

Best in Class

When she isn’t making her own breathtaking creations, Parsons helps her students grow as creators. She teaches basic and advanced digital art at Pepperdine — including how to use emerging technologies to transform creative workflows.

“Many of my students get really obsessed with learning certain kinds of software — as if learning the software will automatically bypass the need to think creatively,” she said. “In this sense, software is just a tool.”

Parsons advises her students to try a bit of everything and see what sticks. “If there’s something you want to learn, spend about three weeks with it and see if it’s a tool that will be useful for you,” she said.

Vibrant Matter courtesy of FLOAT LAND.

While many of her projects dip into new, immersive fields like AR and VR, Parsons highlighted the importance of understanding the fundamentals, like workflows in Adobe Photoshop and Illustrator. “Students should learn the difference between bitmap and vector images early on,” she said.

Parsons works across multiple systems powered by NVIDIA GPUs — a Dell Alienware PC with a GeForce RTX 2070 GPU in her classroom; a custom PC with a GeForce RTX 2080 in her home office; and a Razer Blade 15 with a GeForce RTX 3070 Laptop GPU for projects on the go. When students ask which laptop they should use for their creative education, Parsons points them to NVIDIA Studio-validated PCs.

Parsons and Vance’s creative workspace, powered by an NVIDIA GeForce RTX 2070 GPU.

Creatives going back to school can start off on the right foot with an NVIDIA Studio-validated laptop. Whether for 3D modeling, VR, video and photo editing or any other creative endeavor, a powerful laptop is ready to be the backbone of creativity. Explore these task-specific recommendations for NVIDIA Studio laptops.

#CreatorJourney Challenge

In the spirit of learning, the NVIDIA Studio team is posing a challenge for the community to show off personal growth. Participate in the #CreatorJourney challenge for a chance to be showcased on NVIDIA Studio social media channels.

Entering is easy. Post an older piece of artwork alongside a more recent one to showcase your growth as an artist. Follow and tag NVIDIA Studio on Instagram, Twitter or Facebook, and use the #CreatorJourney tag to join.

Learn something new today: Access tutorials on the Studio YouTube channel and get creativity-inspiring updates directly to your inbox by subscribing to the NVIDIA Studio newsletter.

The post Digital Art Professor Kate Parsons Inspires Next Generation of Creators This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.

Read More

Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16

Overview

Recent years, the growing complexity of AI models have been posing requirements on hardware for more and more compute capability. Reduced precision numeric format has been proposed to address this problem. Bfloat16 is a custom 16-bit floating point format for AI which consists of one sign bit, eight exponent bits, and seven mantissa bits. With the same dynamic range as float32, bfloat16 doesn’t require a special handling such as loss scaling. Therefore, bfloat16 is a drop-in replacement for float32 when running deep neural networks for both inference and training.

The 3rd Gen Intel® Xeon® Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel® Advanced Vector Extensions-512 (Intel® AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel® Xeon® Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel® AMX) instruction set extension.

Intel and Meta previously collaborated to enable bfloat16 on PyTorch, and the related work was published in an earlier blog during launch of Cooper Lake. In that blog, we introduced the hardware advancement for native bfloat16 support and showcased a performance boost of 1.4x to 1.6x of bfloat16 over float32 from DLRM, ResNet-50 and ResNext-101-32x4d.

In this blog, we will introduce the latest software enhancement on bfloat16 in PyTorch 1.12, which would apply to much broader scope of user scenarios and showcase even higher performance boost.

Native Level Optimization on Bfloat16

On PyTorch CPU bfloat16 path, the compute intensive operators, e.g., convolution, linear and bmm, use oneDNN (oneAPI Deep Neural Network Library) to achieve optimal performance on Intel CPUs with AVX512_BF16 or AMX support. The other operators, such as tensor operators and neural network operators, are optimized at PyTorch native level. We have enlarged bfloat16 kernel level optimizations to majority of operators on dense tensors, both inference and training applicable (sparse tensor bfloat16 support will be covered in future work), specifically:

  • Bfloat16 vectorization: Bfloat16 is stored as unsigned 16-bit integer, which requires it to be casted to float32 for arithmetic operations such as add, mul, etc. Specifically, each bfloat16 vector will be converted to two float32 vectors, processed accordingly and then converted back. While for non-arithmetic operations such as cat, copy, etc., it is a straight memory copy and no data type conversion will be involved.
  • Bfloat16 reduction: Reduction on bfloat16 data uses float32 as accumulation type to guarantee numerical stability, e.g., sum, BatchNorm2d, MaxPool2d, etc.
  • Channels Last optimization: For vision models, Channels Last is the preferable memory format over Channels First from performance perspective. We have implemented fully optimized CPU kernels for all the commonly used CV modules on channels last memory format, taking care of both float32 and bfloat16.

Run Bfloat16 with Auto Mixed Precision

To run model on bfloat16, typically user can either explicitly convert the data and model to bfloat16, for example:

# with explicit conversion
input = input.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)

or utilize torch.amp (Automatic Mixed Precision) package. The autocast instance serves as context managers or decorators that allow regions of your script to run in mixed precision, for example:

# with AMP
with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    output = model(input)

Generally, the explicit conversion approach and AMP approach have similar performance. Even though, we recommend run bfloat16 models with AMP, because:

  • Better user experience with automatic fallback: If your script includes operators that don’t have bfloat16 support, autocast will implicitly convert them back to float32 while the explicit converted model will give a runtime error.

  • Mixed data type for activation and parameters: Unlike the explicit conversion which converts all the model parameters to bfloat16, AMP mode will run in mixed data type. To be specific, input/output will be kept in bfloat16 while parameters, e.g., weight/bias, will be kept in float32. The mixed data type of activation and parameters will help improve performance while maintaining the accuracy.

Performance Gains

We benchmarked inference performance of TorchVision models on Intel® Xeon® Platinum 8380H CPU @ 2.90GHz (codenamed Cooper Lake), single instance per socket (batch size = 2 x number of physical cores). Results show that bfloat16 has 1.4x to 2.2x performance gain over float32.

The performance boost of bfloat16 over float32 primarily comes from 3 aspects:

  • The compute intensive operators take advantage of the new bfloat16 native instruction VDPBF16PS which doubles the hardware compute throughput.
  • Bfloat16 have only half the memory footprint of float32, so theoretically the memory bandwidth intensive operators will be twice faster.
  • On Channels Last, we intentionally keep the same parallelization scheme for all the memory format aware operators (can’t do this on Channels First though), which increases the data locality when passing each layer’s output to the next. Basically, it keeps the data closer to CPU cores while data would reside in cache anyway. And bfloat16 will have a higher cache hit rate compared with float32 in such scenarios due to smaller memory footprint.

Conclusion & Future Work

In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel® Xeon® Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel® Xeon® Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future!

Acknowledgement

The results presented in this blog is a joint effort of Meta and Intel PyTorch team. Special thanks to Vitaly Fedyunin and Wei Wei from Meta who spent precious time and gave substantial assistance! Together we made one more step on the path of improving the PyTorch CPU eco system.

Reference

Read More