February 2022 – Page 9

Unlocking the Full Potential of Datacenter ML Accelerators with Platform-Aware Neural Architecture Search

Posted by Sheng Li, Staff Software Engineer and Norman P. Jouppi, Google Fellow, Google Research

Continuing advances in the design and implementation of datacenter (DC) accelerators for machine learning (ML), such as TPUs and GPUs, have been critical for powering modern ML models and applications at scale. These improved accelerators exhibit peak performance (e.g., FLOPs) that is orders of magnitude better than traditional computing systems. However, there is a fast-widening gap between the available peak performance offered by state-of-the-art hardware and the actual achieved performance when ML models run on that hardware.

One approach to address this gap is to design hardware-specific ML models that optimize both performance (e.g., throughput and latency) and model quality. Recent applications of neural architecture search (NAS), an emerging paradigm to automate the design of ML model architectures, have employed a platform-aware multi-objective approach that includes a hardware performance objective. While this approach has yielded improved model performance in practice, the details of the underlying hardware architecture are opaque to the model. As a result, there is untapped potential to build full capability hardware-friendly ML model architectures, with hardware-specific optimizations, for powerful DC ML accelerators.

In “Searching for Fast Model Families on Datacenter Accelerators”, published at CVPR 2021, we advanced the state of the art of hardware-aware NAS by automatically adapting model architectures to the hardware on which they will be executed. The approach we propose finds optimized families of models for which additional hardware performance gains cannot be achieved without loss in model quality (called Pareto optimization). To accomplish this, we infuse a deep understanding of hardware architecture into the design of the NAS search space for discovery of both single models and model families. We provide quantitative analysis of the performance gap between hardware and traditional model architectures and demonstrate the advantages of using true hardware performance (i.e., throughput and latency), instead of the performance proxy (FLOPs), as the performance optimization objective. Leveraging this advanced hardware-aware NAS and building upon the EfficientNet architecture, we developed a family of models, called EfficientNetX, that demonstrate the effectiveness of this approach for Pareto-optimized ML models on TPUs and GPUs.

Platform-Aware NAS for DC ML Accelerators
To achieve high performance, ML models need to adapt to modern ML accelerators. Platform-aware NAS integrates knowledge of the hardware accelerator properties into all three pillars of NAS: (i) the search objectives; (ii) the search space; and (iii) the search algorithm (shown below). We focus on the new search space because it contains the building blocks needed to compose the models and is the key link between the ML model architectures and accelerator hardware architectures.

We construct TPU/GPU specialized search spaces with TPU/GPU-friendly operations to infuse hardware awareness into NAS. For example, a key adaptation is maximizing parallelism to ensure different hardware components inside the accelerators work together as efficiently as possible. This includes the matrix multiplication units (MXUs) in TPUs and the TensorCore in GPUs for matrix/tensor computation, as well as the vector processing units (VPUs) in TPUs and CUDA cores in GPUs for vector processing. Maximizing model arithmetic intensity (i.e., optimizing the parallelism between computation and operations on the high bandwidth memory) is also critical to achieve top performance. To tap into the full potential of the hardware, it is crucial for ML models to achieve high parallelism inside and across these hardware components.

Overview of platform-aware NAS on TPUs/GPUs, highlighting the search space and search objectives.

Advanced platform-aware NAS has an optimized search space containing a set of complementary techniques to holistically improve parallelism for ML model execution on TPUs and GPUs:

It uses specialized tensor reshaping techniques to maximize the parallelism in the MXUs / TensorCores.
It dynamically selects different activation functions depending on matrix operation types to ensure overlapping of vector and matrix/tensor processing.
It employs hybrid convolutions and a novel fusion strategy to strike a balance between total compute and arithmetic intensity to ensure that computation and memory access happens in parallel and to reduce the contention on VPUs / CUDA cores.
With latency-aware compound scaling (LACS), which uses hardware performance instead of FLOPs as the performance objective to search for model depth, width and resolutions, we ensure parallelism at all levels for the entire model family on the Pareto-front.

EfficientNet-X: Platform-Aware NAS-Optimized Computer Vision Models for TPUs and GPUs
Using this approach to platform-aware NAS, we have designed EfficientNet-X, an optimized computer vision model family for TPUs and GPUs. This family builds upon the EfficientNet architecture, which itself was originally designed by traditional multi-objective NAS without true hardware-awareness as the baseline. The resulting EfficientNet-X model family achieves an average speedup of ~1.5x–2x over EfficientNet on TPUv3 and GPUv100, respectively, with comparable accuracy.

In addition to the improved speeds, EfficientNet-X has shed light on the non-proportionality between FLOPs and true performance. Many think FLOPs are a good ML performance proxy (i.e., FLOPs and performance are proportional), but they are not. While FLOPs are a good performance proxy for simple hardware such as scalar machines, they can exhibit a margin of error of up to 400% on advanced matrix/tensor machines. For example, because of its hardware-friendly model architecture, EfficientNet-X requires ~2x more FLOPs than EfficientNet, but is ~2x faster on TPUs and GPUs.

EfficientNet-X family achieves 1.5x–2x speedup on average over the state-of-the-art EfficientNet family, with comparable accuracy on TPUv3 and GPUv100.

Self-Driving ML Model Performance on New Accelerator Hardware Platforms
Platform-aware NAS exposes the inner workings of the hardware and leverages these properties when designing hardware-optimized ML models. In a sense, the “platform-awareness” of the model is a “gene” that preserves knowledge of how to optimize performance for a hardware family, even on new generations, without the need to redesign the models. For example, TPUv4i delivers up to 3x higher peak performance (FLOPS) than its predecessor TPUv2, but EfficientNet performance only improves by 30% when migrating from TPUv2 to TPUv4i. In comparison, EfficientNet-X retains its platform-aware properties even on new hardware and achieves a 2.6x speedup when migrating from TPUv2 to TPUv4i, utilizing almost all of the 3x peak performance gain expected when upgrading between the two generations.

Hardware peak performance ratio of TPUv2 to TPUv4i and the geometric mean speedup of EfficientNet-X and EfficientNet families, respectively, when migrating from TPUv2 to TPUv4i.

Conclusion and Future Work
We demonstrate how to improve the capabilities of platform-aware NAS for datacenter ML accelerators, especially TPUs and GPUs. Both platform-aware NAS and the EfficientNet-X model family have been deployed in production and materialize up to ~40% efficiency gains and significant quality improvements for various internal computer vision projects across Google. Additionally, because of its deep understanding of accelerator hardware architecture, platform-aware NAS was able to identify critical performance bottlenecks on TPUv2-v4i architectures and has enabled design enhancements to future TPUs with significant potential performance uplift. As next steps, we are working on expanding platform-aware NAS’s capabilities to the ML hardware and model design beyond computer vision.

Acknowledgements
Special thanks to our co-authors: Mingxing Tan, Ruoming Pang, Andrew Li, Liqun Cheng, Quoc Le. We also thank many collaborators including Jeff Dean, David Patterson, Shengqi Zhu, Yun Ni, Gang Wu, Tao Chen, Xin Li, Yuan Qi, Amit Sabne, Shahab Kamali, and many others from the broad Google research and engineering teams who helped on the research and the subsequent broad production deployment of platform-aware NAS.

Implement MLOps using AWS pre-trained AI Services with AWS Organizations

The AWS Machine Learning Operations (MLOps) framework is an iterative and repetitive process for evolving AI models over time. Like DevOps, practitioners gain efficiencies promoting their artifacts through various environments (such as quality assurance, integration, and production) for quality control. In parallel, customers rapidly adopt multi-account strategies through AWS Organizations and AWS Control Tower to create secure, isolated environments. This combination can introduce challenges for implementing MLOps with AWS pre-trained AI Services, such as Amazon Rekognition Custom Labels. This post discusses design patterns for reducing that complexity while still maintaining security best practices.

Overview

Customers across every industry vertical recognize the value of operationalizing machine learning (ML) efficiently and reducing the time to deliver business value. Most AWS pre-trained AI Services address this situation through out-of-the-box capabilities for computer vision, translation, and fraud detection, among other common use cases. Many use cases require domain-specific predictions that go beyond generic answers. The AI Services can fine-tune the predictive model results using customer-labeled data for those scenarios.

Over time, the domain-specific vocabulary changes and evolves. For example, suppose a tool manufacturer creates a computer vision model to detect its products in images (such as hammers and screwdrivers). In a future release, the business adds support for wrenches and saws. These new labels necessitate code changes on the manufacturer’s websites and custom applications. Now, there are dependencies that both artifacts must release simultaneously.

The AWS MLOps framework addresses these release challenges through iterative and repetitive processes. Before reaching production end-users, the model artifacts must traverse various quality gates like application code. You typically implement those quality gates using multiple AWS accounts within an AWS organization. This approach gives the flexibility to centrally manage these application domains and enforce guardrails and business requirements. It’s becoming increasingly common to have tens or even hundreds of accounts within your organization. However, you must balance your workload isolation needs against the team size and complexity.

MLOps practitioners have standard procedures for promoting artifacts between accounts (such as QA to production). These patterns are straightforward to implement, relying on copying code and binary resources between Amazon Simple Storage Service (Amazon S3) buckets. However, AWS pre-trained AI Services don’t currently support copying the trained custom model across AWS accounts. Until such a mechanism exists, you need to retrain the models in each AWS account using the same dataset. This approach involves time and cost for retraining the model in a new account. This mechanism can be a viable option for some customers. However in this post, we demonstrate the means to define and evolve these custom models centrally while securely sharing them across an AWS organization’s accounts.

Solution overview

This post discusses design patterns for securely sharing AWS pre-trained AI Service domain-specific models. These services include Amazon Fraud Detector, Amazon Transcribe, and Amazon Rekognition, to name a few. Although these strategies are broadly applicable, we focus on Rekognition Custom Labels as a concrete example. We intentionally avoid diving too deep into Rekognition Custom Labels-specific nuances.

The architecture begins with a configured AWS Control Tower in the management account. AWS Control Tower provides the easiest way to set up and govern a secure, multi-account AWS environment. As shown in the following diagram, we use Account Factory in AWS Control Tower to create five AWS accounts:

CI/CD account for deployment orchestration (for example, with AWS CodeStar)
Production account for external end-users (for example, a public website)
Quality assurance account for internal development teams (such as preproduction)
ML account for custom models and supporting systems
AWS Lake House account holding proprietary customer data

This configuration might be too granular or coarse, depending on your regulatory requirements, industry, and size. Refer to Managing the multi-account environment using AWS Organizations and AWS Control Tower for more guidance.

Create the Rekognition Custom Labels model

The first step to creating a Rekognition Custom Labels model is choosing the AWS account to host it. You might begin your ML journey using a single ML account. This approach consolidates any tooling and procedures into one place. However, this centralization can cause bloat in the individual account and lead to monolithic environments. More mature enterprises segment this ML account by team or workload. Regardless of the granularity, the object is the same to define centrally and train models once.

This post demonstrates using a Rekognition Custom Labels model with a single ML account and a separate data lake account (see the following diagram). When the data resides in a different account, you must configure a resource policy to provide cross-account access to the S3 bucket objects. This procedure securely shares the bucket’s contents with the ML account. See the quick start samples for more information on creating an Amazon Rekognition domain-specific model.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AWSRekognitionS3AclBucketRead20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": [
                "s3:GetBucketAcl",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::S3:"
        },
        {
            "Sid": "AWSRekognitionS3GetBucket20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": [
                "s3:GetObject",
                "s3:GetObjectAcl",
                "s3:GetObjectVersion",
                "s3:GetObjectTagging"
            ],
            "Resource": "arn:aws:s3:::S3:/*"
        },
        {
            "Sid": "AWSRekognitionS3ACLBucketWrite20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": "s3:GetBucketAcl",
            "Resource": "arn:aws:s3:::S3:"
        },
        {
            "Sid": "AWSRekognitionS3PutObject20191011",
            "Effect": "Allow",
            "Principal": {
                "Service": "rekognition.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::S3:/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-acl": "bucket-owner-full-control"
                }
            }
        }
    ]
}

Enable cross-account access

After you build and deploy the model, the endpoint is only available within the ML account. Do not use a static key to share access. You must delegate access to the production (or QA) account using AWS Identity and Access Management (IAM) roles. To create a cross-account role in the ML account, complete the following steps:

On the Rekognition Custom Labels console, choose Projects and choose your project name.
Choose Models and your model name.
On the Use model tab, scroll down to the Use your model section.
Copy the model Amazon Resource Name (ARN). It should be formatted as follows: arn:aws:rekognition:region-name:account-id:project/model-name/version/version-id/timestamp.
Create a role with rekognition:DetectCustomLabels permissions to the model ARN and a trust policy allowing sts:AssumeRole from the production (or QA) account (for example, arn:aws:iam::PROD_ACCOUNT_ID_HERE:root).
Optionally, attach additional policies for any workload-specific actions (such as accessing S3 buckets).
Optionally, configure the condition element to enforce additional delegation requirements.
Record the new role’s ARN to use in the next section.

Invoke the endpoint

With the security policies in place, it’s time to test the configuration. A simple approach involves creating an Amazon Elastic Compute Cloud (Amazon EC2) instance and using the AWS Command Line Interface (AWS CLI). Invoke the endpoint with the following steps:

In the production (or QA) account, create a role for Amazon EC2.
Attach a policy allowing sts:AssumeRole to the ML account’s cross-role ARN.
Launch an Amazon Linux 2 instance with the role from the previous step.
Wait for it to provision, then connect to the Linux instance using SSH.
Invoke the command aws iam assume-role to switch to the cross-account role from the previous section.
Start the model endpoint, if not already running, using the Rekognition console or the start-project-version AWS CLI command.
Invoke the command aws rekognition detect-custom-labels to test the operation.

You can also perform this test using the AWS SDK and another compute resource (for example, AWS Lambda).

Avoiding the public internet

In the previous section, the detect-custom-labels request uses the virtual private cloud’s (VPC) internet gateway and traverses the public internet. TLS/SSL encryption sufficiently secures the communication channel for many workloads. You can use AWS PrivateLink to enable connections between the VPC and supporting services without requiring an internet gateway, NAT device, VPN connection, transit gateway, or AWS Direct Connect connection. Then, the detect-custom-labels request never leaves the AWS network exposed to the public internet. AWS PrivateLink supports all services used within this post. You can also enforce pre-trained AI Services using private connectivity with IAM in the cross-role policy. This control adds another level of protection that prevents misconfigured clients from using the pre-trained AI Service’s internet-facing endpoint. For additional information, see Using Amazon Rekognition with Amazon VPC endpoints, AWS PrivateLink for Amazon S3, and Using AWS STS interface VPC endpoints.

The following diagram illustrates the VPC endpoint configuration between the production account, ML account, and QA account.

Build a CI/CD pipeline for promoting models

AWS recommends continuously providing more training and test data to custom label the Amazon Rekognition project dataset to improve models. After you add more data to a project, a new model can enhance accuracy or alter labels.

In MLOps, model artifacts must be consistent. To accomplish this with pre-trained AI Services, AWS recommends promoting the model endpoint by updating the code’s reference to the new model version’s ARN. This approach avoids retraining the domain-specific model in each environment (such as QA and production accounts). Your applications can use the new model’s ARN as a runtime variable using AWS Systems Manager within a multi-account or multi-stage environment.

Three granularity levels limit access to the cross-account model, specifically at the account, project, and model version level. Models are idempotent and have a unique ARN that maps to specific point-in-time training: arn:aws:rekognition:account:region:project/project_name/version/name/timestamp.

The following diagram illustrates the model rotation from QA to production.

In the preceding architecture, the production and QA applications make API calls to use the v2 or v3 model endpoints through their respective VPC endpoints. They receive the ARN from its configuration store (for example, Amazon Systems Manager Parameter Store or AWS AppConfig). This process works with n number of environments, but we demonstrate only using two accounts for simplicity. Optionally, removing the superseded model versions prevents additional consumption of those resources.

The ML account has an IAM role for each environment-specific (such as the Production account) that requires access. The CI/CD pipeline as part of the deployment alters the inline policy of the IAM role to allow for access to the appropriate model.

Consider the scenario of promoting Model-v2 from the QA account to the production account. This process requires the following steps:

On the Rekognition Custom Labels console, transition the Model-v2 endpoint into a running state.
Grant the IAM cross-account role in the ML account access to the new version of Model-v2.

Note that the resource element supports wildcards in the ARN.

Send a test invocation to Model-v2 from the production application using the delegation role.
Optionally, remove the cross-account role’s access to Model-v1.
Optionally, repeat steps 2–3 for each additional AWS account.
Optionally, stop the Model-v1 endpoint to avoid incurring costs.

Global policy propagation from the IAM control plane to the IAM data plane in every Region is an eventually consistent operation. This design can create slight delays for multi-Regional configurations.

Create guardrails through service control policies

Using cross-account roles creates a secure mechanism for sharing pre-trained managed AI resources. But what happens when that role’s policy is too permissive? You can mitigate these risks by using service control policies (SCPs) to set permission guardrails across accounts. Guardrails specify the maximum permissions available for an IAM identity. These capabilities can prevent a model consumer account from, for example, stopping the shared Amazon Rekognition endpoint. After defining appropriate guardrail requirements, organizational units within Organizations allow centrally managing those policies across multiple accounts.

{    
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyModifyingRekgnotionProjects",
      "Effect": "Deny",
      "Action": [
        "rekognition:CreateProject*",
        "rekognition:DeleteProject*",
        "rekognition:StartProject*",
        "rekognition:StopProject*",
      ],
      "Resource": [
        “arn:aws:rekognition:*:*:project/*
      ]
    }
  ]
}

You can also configure detective controls to monitor their configuration and make sure it doesn’t drift out of compliance. AWS IAM Access Analyzer supports assessing policies across the organization and reporting unused permissions. Additionally, AWS Config enables assessing, auditing, and evaluating configurations of AWS resources. This capability supports standard security and compliance requirements, such as verifying and remediating the S3 bucket’s encryption settings.

Conclusion

You need out-of-the-box solutions to add ML capabilities like computer vision, translation, and fraud detection. You also need security boundaries that isolate your different environments for quality control, compliance, and regulatory purposes. AWS pre-trained AI services and AWS Control Tower deliver that functionality in a manner that is easily accessible and secure.

AWS pre-trained AI services don’t currently support copying the trained custom model across AWS accounts. Until such a mechanism exists, you need to retrain the models in each AWS account using the same dataset. This post demonstrates an alternative design approach using IAM cross-account policies to share model endpoints while maintaining robust security control. Furthermore, you can stop paying for redundant training jobs! For more information on cross-account policies, see IAM tutorial: Delegate access across AWS accounts using IAM roles.

About the Authors

Nate Bachmeier is an AWS Senior Solutions Architect that nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers’ workloads. Besides this, Nate is a full-time student and has two kids.

Mario Bourgoin is a Senior Partner Solutions Architect for AWS, an AI/ML specialist, and the global tech lead for MLOps. He works with enterprise customers and partners deploying AI solutions in the cloud. He has more than 30 years experience doing machine learning and AI at startups and in enterprises, starting with creating one of the first commercial machine learning systems for big data. Mario spends the balance of his time playing with his three Belgian Tervurens, cooking dinners for his family, and learning about mathematics and cosmology.

Tim Murphy is a Senior Solutions Architect for AWS, working with enterprise customers in various industries to build business based solutions in the cloud. He has spent the last decade working with startups, non-profits, commercial enterprise, and government agencies, deploying infrastructure at scale. In his spare time when he isn’t tinkering with technology, you’ll most likely find him in far flung areas of the earth hiking mountains, surfing waves, or biking through a new city.

Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience

Eating into open hours and menus, a labor shortage has gobbled up fast-food services employees, but some restaurants are trying out a new staff member to bring back the drive-thru good times: AI.

Toronto startup HuEx is in pilot tests with a conversational AI assistant for drive-thrus to help support service at several popular Canadian chains.

Chronically understaffed, food services jobs have among the highest rate of employee departures, according to the U.S. Bureau of Labor Statistics.

HuEx’s voice service — dubbed AiDA — is helping behind the drive-up window at popular fast-service chains across North America.

AiDA handles order requests from customers at the drive-thru speaker box. Driven by HuEx’s proprietary models running on the NVIDIA Jetson edge AI platform, AiDA transcribes the voice orders to text for staff members to see and serve. And it can reply with voice in response.

It can understand 300,000-plus product combinations. “Things like ‘coffee with milk, coffee with sugar’ are common, but some people even order coffee with butter — it can handle that, too,” said Anik Seth, founder and CEO of HuEx.

The company is a member of NVIDIA Inception, a program that offers go-to-market support, expertise and technology for AI, data science and HPC startups.

All in the Family

Seth is intimately familiar with fast-service restaurants. He is part of a family business operating multiple quick-service restaurant locations.

Noticing a common problem, he has seen team members and guests struggling during drive-thru interactions, something he aims to address.

“AiDA’s voice recognition technology is easily handled by the NVIDIA Jetson for real-time interactions, which helps smooth the ordering process,” he said.

Talk AI to Me

The technology, integrated with the existing drive-thru headset system, allows for team members to hear the orders and jump in if needed to assist.

AiDA, first deployed in 2018, has been used in “thousands of transactions” in implementations in Canada, said Seth.

The system promises to help improve service time by taking on the drive-thru while other team members focus on fulfilling orders. Its natural language processing system is capable of 90 percent accuracy when taking orders, he said.

As new menu items, specials and promotions are introduced, the database is updated constantly to answer questions about them.

“The team is always in the know,” Seth said. “The moment you order a coffee, the AI is taking the order, while simultaneously, there’s a team member fulfilling it.”

Image credit: Robert Penaloza via Unsplash.

The post Burgers, Fries and a Side of AI: Startup Offers Taste of Drive-Thru Convenience appeared first on The Official NVIDIA Blog.

Ask a Techspert: What does AI do when it doesn’t know?

As humans, we constantly learn from the world around us. We experience inputs that shape our knowledge — including the boundaries of both what we know and what we don’t know.

Many of today’s machines also learn by example. However, these machines are typically trained on datasets and information that doesn’t always include rare or out-of-the-ordinary examples that inevitably come up in real-life scenarios. What is an algorithm to do when faced with the unknown?

I recently spoke with Abhijit Guha Roy, an engineer on the Health AI team, and Ian Kivlichan, an engineer on the Jigsaw team, to hear more about using AI in real-world scenarios and better understand the importance of training it to know when it doesn’t know.

Abhijit, tell me about your recent research in the dermatology space.

We’re applying deep learning to a number of areas in health, including in medical imaging where it can be used to aid in the identification of health conditions and diseases that might require treatment. In the dermatological field, we have shown that AI can be used to help identify possible skin issues and are in the process of advancing research and products, including DermAssist, that can support both clinicians and people like you and me.

In these real-world settings, the algorithm might come up against something it’s never seen before. Rare conditions, while individually infrequent, might not be so rare in aggregate. These so-called “out-of-distribution” examples are a common problem for AI systems which can perform less well when it’s exposed to things they haven’t seen before in its training.

Can you explain what “out-distribution” means for AI?

Most traditional machine learning examples that are used to train AI deal with fairly unsubtle — or obvious — changes. For example, if an algorithm that is trained to identify cats and dogs comes across a car, then it can typically detect that the car — which is an “out-of-distribution” example — is an outlier. Building an AI system that can recognize the presence of something it hasn’t seen before in training is called “out-of-distribution detection,” and is an active and promising field of AI research.

Okay, let’s go back to how this applies to AI in medical settings.

Going back to our research in the dermatology space, the differences between skin conditions can be much more subtle than recognizing a car from a dog or a cat, even more subtle than recognizing a previously unseen “pick-up truck” from a “truck”. As such, the out-of-distribution detection task in medical AI demands even more of our focused attention.

This is where our latest research comes in. We trained our algorithm to recognize even the most subtle of outliers (a so-called “near-out of distribution” detection task). Then, instead of the model inaccurately guessing, it can take a safer course of action — like deferring to human experts.

Ian, you’re working on another area where AI needs to know when it doesn’t know something. What’s that?

The field of content moderation. Our team at Jigsaw used AI to build a free tool called Perspective that scores comments according to how likely they are to be considered toxic by readers. Our AI algorithms help identify toxic language and online harassment at scale so that human content moderators can make better decisions for their online communities. A range of online platforms use Perspective more than 600 million times a day to reduce toxicity and the human time required to moderate content.

In the real world, online conversations — both the things people say and even the ways they say them — are continually changing. For example, two years ago, nobody would have understood the phrase “non-fungible token (NFT).” Our language is always evolving, which means a tool like Perspective doesn’t just need to identify potentially toxic or harassing comments, it also needs to “know when it doesn’t know,” and then defer to human moderators when it encounters comments very different from anything it has encountered before.

In our recent research, we trained Perspective to identify comments it was uncertain about and flag them for separate human review. By prioritizing these comments, human moderators can correct more than 80% of the mistakes the AI might otherwise have made.

What connects these two examples?

We have more in common with the dermatology problem than you’d expect at first glance — even though the problems we try to solve are so different.

Building AI that knows when it doesn’t know something means you can prevent certain errors that might have unintended consequences. In both cases, the safest course of action for the algorithm entails deferring to human experts rather than trying to make a decision that could lead to potentially negative effects downstream.

There are some fields where this isn’t as important and others where it’s critical. You might not care if an automated vegetable sorter incorrectly sorts a purple carrot after being trained on orange carrots, but you would definitely care if an algorithm didn’t know what to do about an abnormal shadow on an X-ray that a doctor might recognize as an unexpected cancer.

How is AI uncertainty related to AI safety?

Most of us are familiar with safety protocols in the workplace. In safety-critical industries like aviation or medicine, protocols like “safety checklists” are routine and very important in order to prevent harm to both the workers and the people they serve.

It’s important that we also think about safety protocols when it comes to machines and algorithms, especially when they are integrated into our daily workflow and aid in decision-making or triaging that can have a downstream impact.

Teaching algorithms to refrain from guessing in unfamiliar scenarios and to ask for help from human experts falls within these protocols, and is one of the ways we can reduce harm and build trust in our systems. This is something Google is committed to, as outlined in its AI Principles.

Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop

Editor’s note: This is one in a series of Meet the Omnivore posts, featuring individual creators and developers who use the NVIDIA Omniverse 3D simulation and collaboration platform to boost their artistic or engineering processes.

Don’t be fooled by the candy canes, hot cocoa and CEO’s jolly demeanor.

Santa’s workshop is the very model of a 21st-century enterprise: pioneering mass customization and perfecting a worldwide distribution system able to meet almost bottomless global demand.

Michael Wagner

So it makes sense that Michael Wagner, CTO of ipolog, a digital twin software company for assembly and logistics planning, would make a virtual representation, or digital twin, of Santa’s workshop.

Digital twins like Wagner’s “santa-factory” can be used “to map optimal employee paths around a facility, simulate processes like material flow, as well as detect bottlenecks before they occur,” he said.

Wagner built an NVIDIA Omniverse Extension — a tool to use in conjunction with Omniverse apps — for what he calls the science of santa-facturing.

A rendering of the assembly room in Santa’s workshop, created with NVIDIA Omniverse.

Creating the ‘Santa-Facturing’ Extension

To deck the halls of the santa-factory, Wagner needed a virtual environment where he could depict the North Pole, Santa himself, hundreds of elves and millions of toy parts. Omniverse provided the tools to create such a highly detailed environment.

“Omniverse is the only platform that’s able to visualize such a vast amount of components in high fidelity and make the simulation physically accurate,” Wagner said. “My work is a proof of concept — if Omniverse is fit to visualize Santa’s factory, it’s fit to visualize the daily material provisioning load for a real-world automotive factory, for example, which has a similar order of complexity.”

Ipolog recently provided BMW with highly detailed elements like racks and boxes for a digital twin of the automaker’s factory.

With the help of ipolog software and other tools, BMW is creating a digital twin-based factory of the future with NVIDIA Omniverse, which enables the automaker to simulate complex production scenarios taking place in more than 6 million square meters of factory space.

Digital twin simulation speeds output and increases efficiency for BMW’s entire production cycle — from the examination of engineering detail for vehicle parts to the optimization of workflow at the factory-plant level.

Wagner used Omniverse Kit, a toolkit for building Omniverse-native extensions and applications, to create the santa-facturing environment.

The developer is also exploring Omniverse Code — a recently launched app that serves as an integrated development environment for developers to easily build Omniverse extensions, apps or microservices.

“The principle of building on the shoulders of giants is in the DNA of the Omniverse ecosystem and the kit-based environment,” Wagner said. “Existing open-source extensions, which any developer can contribute to, provide a good base from which to start off and quickly create a dedicated app or extension for digital twins.”

Visualizing the ‘Santa-Factory’

Using Omniverse, which includes PhysX — a software development kit that provides advanced physics simulation — Wagner transformed 2D illustrations of the santa-factory into a physically accurate 3D scene. The process was simple, he said. He “piled up a lot of elements and let PhysX work its magic.”

A 2D representation of Santa’s workshop turned into a 3D rendering using Omniverse.

To create the glacial North Pole environment, Wagner used the Unreal Engine 4 Omniverse Connector. To bring the trusty elves to life, he brought in animations from Blender. And to convert the huge datasets to Universal Scene Description format, Wagner worked with Germany-based 3D software development company NetAllied Systems.

A rendering of elves tending to reindeer near Santa’s workshop, created with NVIDIA Omniverse.

What better example of material supply and flow in manufacturing than millions of toy parts getting delivered to Santa’s workshop? Watch Wagner’s stunning demo of this, created in Omniverse:

Such use of digital twin simulations, Wagner said, allows manufacturers to visualize and plan their most efficient workflow, often reducing the time it takes to complete a manufacturing project by 30 percent.

Looking forward, Wagner and his team at ipolog plan to create a full suite of apps, extensions and backend services to enable a manufacturing virtual world entirely based on Omniverse.

Learn more about the santa-facturing project and how Wagner uses Omniverse Kit.

Attend Wagner’s session on digital twins for manufacturing at GTC, which will take place March 21-24.

Creators and developers can download NVIDIA Omniverse for free and get started with step-by-step tutorials on the Omniverse YouTube channel. Follow Omniverse on Instagram, Twitter and Medium for additional resources and inspiration. Check out the Omniverse forums and join our Discord Server to chat with the community.

The post Meet the Omnivore: Developer Sleighs Complex Manufacturing Workflows With Digital Twin of Santa’s Workshop appeared first on The Official NVIDIA Blog.

Amazon expands SURE program to boost diversity in STEM education

New programs with Georgia Tech and the University of Southern California are established; existing Columbia University program expands.Read More

Researchers’ role in diversity, equity, and inclusion

Practical Quantization in PyTorch

Quantization is a cheap and easy way to make your DNN run faster and with lower memory requirements. PyTorch offers a few different approaches to quantize your model. In this blog post, we’ll lay a (quick) foundation of quantization in deep learning, and then take a look at how each technique looks like in practice. Finally we’ll end with recommendations from the literature for using quantization in your workflows.

Fig 1. PyTorch <3 Quantization

Contents

Fundamentals of Quantization

If someone asks you what time it is, you don’t respond “10:14:34:430705”, but you might say “a quarter past 10”.

Quantization has roots in information compression; in deep networks it refers to reducing the numerical precision of its weights and/or activations.

Overparameterized DNNs have more degrees of freedom and this makes them good candidates for information compression [1]. When you quantize a model, two things generally happen – the model gets smaller and runs with better efficiency. Hardware vendors explicitly allow for faster processing of 8-bit data (than 32-bit data) resulting in higher throughput. A smaller model has lower memory footprint and power consumption [2], crucial for deployment at the edge.

Mapping function

The mapping function is what you might guess – a function that maps values from floating-point to integer space. A commonly used mapping function is a linear transformation given by $Q(r) = round(r/S + Z)$ , where $r$ is the input and $S, Z$ are quantization parameters.

To reconvert to floating point space, the inverse function is given by $\tilde r = (Q(r) - Z) \cdot S$ .

$\tilde r \neq r$ , and their difference constitutes the quantization error.

Quantization Parameters

The mapping function is parameterized by the scaling factor $S$ and zero-point $Z$ .

$S$ is simply the ratio of the input range to the output range
$S = \frac{\beta - \alpha}{\beta_q - \alpha_q}$

where [ $\alpha, \beta$ ] is the clipping range of the input, i.e. the boundaries of permissible inputs. [ $\alpha_q, \beta_q$ ] is the range in quantized output space that it is mapped to. For 8-bit quantization, the output range $\beta_q - \alpha_q <= (2^8 - 1)$ .

$Z$ acts as a bias to ensure that a 0 in the input space maps perfectly to a 0 in the quantized space. $Z = -(\frac{\alpha}{S} - \alpha_q)$

Calibration

The process of choosing the input clipping range is known as calibration. The simplest technique (also the default in PyTorch) is to record the running mininmum and maximum values and assign them to $\alpha$ and $\beta$ . TensorRT also uses entropy minimization (KL divergence), mean-square-error minimization, or percentiles of the input range.

In PyTorch, Observer modules (docs, code) collect statistics on the input values and calculate the qparams $S, Z$ . Different calibration schemes result in different quantized outputs, and it’s best to empirically verify which scheme works best for your application and architecture (more on that later).

from torch.quantization.observer import MinMaxObserver, MovingAverageMinMaxObserver, HistogramObserver
C, L = 3, 4
normal = torch.distributions.normal.Normal(0,1)
inputs = [normal.sample((C, L)), normal.sample((C, L))]
print(inputs)

# >>>>>
# [tensor([[-0.0590,  1.1674,  0.7119, -1.1270],
#          [-1.3974,  0.5077, -0.5601,  0.0683],
#          [-0.0929,  0.9473,  0.7159, -0.4574]]]),

# tensor([[-0.0236, -0.7599,  1.0290,  0.8914],
#          [-1.1727, -1.2556, -0.2271,  0.9568],
#          [-0.2500,  1.4579,  1.4707,  0.4043]])]

observers = [MinMaxObserver(), MovingAverageMinMaxObserver(), HistogramObserver()]
for obs in observers:
  for x in inputs: obs(x) 
  print(obs.__class__.__name__, obs.calculate_qparams())

# >>>>>
# MinMaxObserver (tensor([0.0112]), tensor([124], dtype=torch.int32))
# MovingAverageMinMaxObserver (tensor([0.0101]), tensor([139], dtype=torch.int32))
# HistogramObserver (tensor([0.0100]), tensor([106], dtype=torch.int32))

Affine and Symmetric Quantization Schemes

Affine or asymmetric quantization schemes assign the input range to the min and max observed values. Affine schemes generally offer tighter clipping ranges and are useful for quantizing non-negative activations (you don’t need the input range to contain negative values if your input tensors are never negative). The range is calculated as
$\alpha = min(r), \beta = max(r)$ . Affine quantization leads to more computationally expensive inference when used for weight tensors [3].

Symmetric quantization schemes center the input range around 0, eliminating the need to calculate a zero-point offset. The range is calculated as
$-\alpha = \beta = max(|max(r)|,|min(r)|)$ . For skewed signals (like non-negative activations) this can result in bad quantization resolution because the clipping range includes values that never show up in the input (see the pyplot below).

act =  torch.distributions.pareto.Pareto(1, 10).sample((1,1024))
weights = torch.distributions.normal.Normal(0, 0.12).sample((3, 64, 7, 7)).flatten()

def get_symmetric_range(x):
  beta = torch.max(x.max(), x.min().abs())
  return -beta.item(), beta.item()

def get_affine_range(x):
  return x.min().item(), x.max().item()

def plot(plt, data, scheme):
  boundaries = get_affine_range(data) if scheme == 'affine' else get_symmetric_range(data)
  a, _, _ = plt.hist(data, density=True, bins=100)
  ymin, ymax = np.quantile(a[a>0], [0.25, 0.95])
  plt.vlines(x=boundaries, ls='--', colors='purple', ymin=ymin, ymax=ymax)

fig, axs = plt.subplots(2,2)
plot(axs[0, 0], act, 'affine')
axs[0, 0].set_title("Activation, Affine-Quantized")

plot(axs[0, 1], act, 'symmetric')
axs[0, 1].set_title("Activation, Symmetric-Quantized")

plot(axs[1, 0], weights, 'affine')
axs[1, 0].set_title("Weights, Affine-Quantized")

plot(axs[1, 1], weights, 'symmetric')
axs[1, 1].set_title("Weights, Symmetric-Quantized")
plt.show()

Fig 2. Clipping ranges (in purple) for affine and symmetric schemes

In PyTorch, you can specify affine or symmetric schemes while initializing the Observer. Note that not all observers support both schemes.

for qscheme in [torch.per_tensor_affine, torch.per_tensor_symmetric]:
  obs = MovingAverageMinMaxObserver(qscheme=qscheme)
  for x in inputs: obs(x)
  print(f"Qscheme: {qscheme} | {obs.calculate_qparams()}")

# >>>>>
# Qscheme: torch.per_tensor_affine | (tensor([0.0101]), tensor([139], dtype=torch.int32))
# Qscheme: torch.per_tensor_symmetric | (tensor([0.0109]), tensor([128]))

Per-Tensor and Per-Channel Quantization Schemes

Quantization parameters can be calculated for the layer’s entire weight tensor as a whole, or separately for each channel. In per-tensor, the same clipping range is applied to all the channels in a layer

Fig 3. Per-Channel uses one set of qparams for each channel. Per-tensor uses the same qparams for the entire tensor.

For weights quantization, symmetric-per-channel quantization provides better accuracies; per-tensor quantization performs poorly, possibly due to high variance in conv weights across channels from batchnorm folding [3].

from torch.quantization.observer import MovingAveragePerChannelMinMaxObserver
obs = MovingAveragePerChannelMinMaxObserver(ch_axis=0)  # calculate qparams for all `C` channels separately
for x in inputs: obs(x)
print(obs.calculate_qparams())

# >>>>>
# (tensor([0.0090, 0.0075, 0.0055]), tensor([125, 187,  82], dtype=torch.int32))

Backend Engine

Currently, quantized operators run on x86 machines via the FBGEMM backend, or use QNNPACK primitives on ARM machines. Backend support for server GPUs (via TensorRT and cuDNN) is coming soon. Learn more about extending quantization to custom backends: RFC-0019.

backend = 'fbgemm' if x86 else 'qnnpack'
qconfig = torch.quantization.get_default_qconfig(backend)  
torch.backends.quantized.engine = backend

QConfig

The QConfig (code, docs) NamedTuple stores the Observers and the quantization schemes used to quantize activations and weights.

Be sure to pass the Observer class (not the instance), or a callable that can return Observer instances. Use with_args() to override the default arguments.

my_qconfig = torch.quantization.QConfig(
  activation=MovingAverageMinMaxObserver.with_args(qscheme=torch.per_tensor_affine),
  weight=MovingAveragePerChannelMinMaxObserver.with_args(qscheme=torch.qint8)
)
# >>>>>
# QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>, qscheme=torch.per_tensor_affine){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MovingAveragePerChannelMinMaxObserver'>, qscheme=torch.qint8){})

In PyTorch

PyTorch allows you a few different ways to quantize your model depending on

if you prefer a flexible but manual, or a restricted automagic process (Eager Mode v/s FX Graph Mode)
if qparams for quantizing activations (layer outputs) are precomputed for all inputs, or calculated afresh with each input (static v/s dynamic),
if qparams are computed with or without retraining (quantization-aware training v/s post-training quantization)

FX Graph Mode automatically fuses eligible modules, inserts Quant/DeQuant stubs, calibrates the model and returns a quantized module – all in two method calls – but only for networks that are symbolic traceable. The examples below contain the calls using Eager Mode and FX Graph Mode for comparison.

In DNNs, eligible candidates for quantization are the FP32 weights (layer parameters) and activations (layer outputs). Quantizing weights reduces the model size. Quantized activations typically result in faster inference.

As an example, the 50-layer ResNet network has ~26 million weight parameters and computes ~16 million activations in the forward pass.

Post-Training Dynamic/Weight-only Quantization

Here the model’s weights are pre-quantized; the activations are quantized on-the-fly (“dynamic”) during inference. The simplest of all approaches, it has a one line API call in torch.quantization.quantize_dynamic. Currently only Linear and Recurrent (LSTM, GRU, RNN) layers are supported for dynamic quantization.

(+) Can result in higher accuracies since the clipping range is exactly calibrated for each input [1].

(+) Dynamic quantization is preferred for models like LSTMs and Transformers where writing/retrieving the model’s weights from memory dominate bandwidths [4].

(-) Calibrating and quantizing the activations at each layer during runtime can add to the compute overhead.

import torch
from torch import nn

# toy model
m = nn.Sequential(
  nn.Conv2d(2, 64, (8,)),
  nn.ReLU(),
  nn.Linear(16,10),
  nn.LSTM(10, 10))

m.eval()

## EAGER MODE
from torch.quantization import quantize_dynamic
model_quantized = quantize_dynamic(
    model=m, qconfig_spec={nn.LSTM, nn.Linear}, dtype=torch.qint8, inplace=False
)

## FX MODE
from torch.quantization import quantize_fx
qconfig_dict = {"": torch.quantization.default_dynamic_qconfig}  # An empty key denotes the default applied to all modules
model_prepared = quantize_fx.prepare_fx(m, qconfig_dict)
model_quantized = quantize_fx.convert_fx(model_prepared)

Post-Training Static Quantization (PTQ)

PTQ also pre-quantizes model weights but instead of calibrating activations on-the-fly, the clipping range is pre-calibrated and fixed (“static”) using validation data. Activations stay in quantized precision between operations during inference. About 100 mini-batches of representative data are sufficient to calibrate the observers [2]. The examples below use random data in calibration for convenience – using that in your application will result in bad qparams.

Fig 4. Steps in Post-Training Static Quantization

Module fusion combines multiple sequential modules (eg: [Conv2d, BatchNorm, ReLU]) into one. Fusing modules means the compiler needs to only run one kernel instead of many; this speeds things up and improves accuracy by reducing quantization error.

(+) Static quantization has faster inference than dynamic quantization because it eliminates the float<->int conversion costs between layers.

(-) Static quantized models may need regular re-calibration to stay robust against distribution-drift.

# Static quantization of a model consists of the following steps:

#     Fuse modules
#     Insert Quant/DeQuant Stubs
#     Prepare the fused module (insert observers before and after layers)
#     Calibrate the prepared module (pass it representative data)
#     Convert the calibrated module (replace with quantized version)

import torch
from torch import nn

backend = "fbgemm"  # running on a x86 CPU. Use "qnnpack" if running on ARM.

m = nn.Sequential(
     nn.Conv2d(2,64,3),
     nn.ReLU(),
     nn.Conv2d(64, 128, 3),
     nn.ReLU()
)

## EAGER MODE
"""Fuse
- Inplace fusion replaces the first module in the sequence with the fused module, and the rest with identity modules
"""
torch.quantization.fuse_modules(m, ['0','1'], inplace=True) # fuse first Conv-ReLU pair
torch.quantization.fuse_modules(m, ['2','3'], inplace=True) # fuse second Conv-ReLU pair

"""Insert stubs"""
m = nn.Sequential(torch.quantization.QuantStub(), 
                  *m, 
                  torch.quantization.DeQuantStub())

"""Prepare"""
m.qconfig = torch.quantization.get_default_qconfig(backend)
torch.quantization.prepare(m, inplace=True)

"""Calibrate
- This example uses random data for convenience. Use representative (validation) data instead.
"""
with torch.inference_mode():
  for _ in range(10):
    x = torch.rand(1,2, 28, 28)
    m(x)
    
"""Convert"""
torch.quantization.convert(m, inplace=True)

"""Check"""
print(m[[1]].weight().element_size()) # 1 byte instead of 4 bytes for FP32


## FX GRAPH
from torch.quantization import quantize_fx
m.eval()
qconfig_dict = {"": torch.quantization.get_default_qconfig(backend)}
# Prepare
model_prepared = quantize_fx.prepare_fx(model_to_quantize, qconfig_dict)
# Calibrate - Use representative (validation) data.
with torch.inference_mode():
  for _ in range(10):
    x = torch.rand(1,2,28, 28)
    model_prepared(x)
# quantize
model_quantized = quantize_fx.convert_fx(model_prepared)

Quantization-aware Training (QAT)

Fig 5. Steps in Quantization-Aware Training

The PTQ approach is great for large models, but accuracy suffers in smaller models [[6]]. This is of course due to the loss in numerical precision when adapting a model from FP32 to the INT8 realm (Figure 6(a)). QAT tackles this by including this quantization error in the training loss, thereby training an INT8-first model.

Fig 6. Comparison of PTQ and QAT convergence [3]

All weights and biases are stored in FP32, and backpropagation happens as usual. However in the forward pass, quantization is internally simulated via FakeQuantize modules. They are called fake because they quantize and immediately dequantize the data, adding quantization noise similar to what might be encountered during quantized inference. The final loss thus accounts for any expected quantization errors. Optimizing on this allows the model to identify a wider region in the loss function (Figure 6(b)), and identify FP32 parameters such that quantizing them to INT8 does not significantly affect accuracy.

Fig 7. Fake Quantization in the forward and backward pass

Image source: https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt

(+) QAT yields higher accuracies than PTQ.

(+) Qparams can be learned during model training for more fine-grained accuracy (see LearnableFakeQuantize)

(-) Computational cost of retraining a model in QAT can be several hundred epochs [1]

# QAT follows the same steps as PTQ, with the exception of the training loop before you actually convert the model to its quantized version

import torch
from torch import nn

backend = "fbgemm"  # running on a x86 CPU. Use "qnnpack" if running on ARM.

m = nn.Sequential(
     nn.Conv2d(2,64,8),
     nn.ReLU(),
     nn.Conv2d(64, 128, 8),
     nn.ReLU()
)

"""Fuse"""
torch.quantization.fuse_modules(m, ['0','1'], inplace=True) # fuse first Conv-ReLU pair
torch.quantization.fuse_modules(m, ['2','3'], inplace=True) # fuse second Conv-ReLU pair

"""Insert stubs"""
m = nn.Sequential(torch.quantization.QuantStub(), 
                  *m, 
                  torch.quantization.DeQuantStub())

"""Prepare"""
m.train()
m.qconfig = torch.quantization.get_default_qconfig(backend)
torch.quantization.prepare_qat(m, inplace=True)

"""Training Loop"""
n_epochs = 10
opt = torch.optim.SGD(m.parameters(), lr=0.1)
loss_fn = lambda out, tgt: torch.pow(tgt-out, 2).mean()
for epoch in range(n_epochs):
  x = torch.rand(10,2,24,24)
  out = m(x)
  loss = loss_fn(out, torch.rand_like(out))
  opt.zero_grad()
  loss.backward()
  opt.step()

"""Convert"""
m.eval()
torch.quantization.convert(m, inplace=True)

Sensitivity Analysis

Not all layers respond to quantization equally, some are more sensitive to precision drops than others. Identifying the optimal combination of layers that minimizes accuracy drop is time-consuming, so [3] suggest a one-at-a-time sensitivity analysis to identify which layers are most sensitive, and retaining FP32 precision on those. In their experiments, skipping just 2 conv layers (out of a total 28 in MobileNet v1) give them near-FP32 accuracy. Using FX Graph Mode, we can create custom qconfigs to do this easily:

# ONE-AT-A-TIME SENSITIVITY ANALYSIS 

for quantized_layer, _ in model.named_modules():
  print("Only quantizing layer: ", quantized_layer)

  # The module_name key allows module-specific qconfigs. 
  qconfig_dict = {"": None, 
  "module_name":[(quantized_layer, torch.quantization.get_default_qconfig(backend))]}

  model_prepared = quantize_fx.prepare_fx(model, qconfig_dict)
  # calibrate
  model_quantized = quantize_fx.convert_fx(model_prepared)
  # evaluate(model)

Another approach is to compare statistics of the FP32 and INT8 layers; commonly used metrics for these are SQNR (Signal to Quantized Noise Ratio) and Mean-Squre-Error. Such a comparative analysis may also help in guiding further optimizations.

Fig 8. Comparing model weights and activations

PyTorch provides tools to help with this analysis under the Numeric Suite. Learn more about using Numeric Suite from the full tutorial.

# extract from https://pytorch.org/tutorials/prototype/numeric_suite_tutorial.html
import torch.quantization._numeric_suite as ns

def SQNR(x, y):
    # Higher is better
    Ps = torch.norm(x)
    Pn = torch.norm(x-y)
    return 20*torch.log10(Ps/Pn)

wt_compare_dict = ns.compare_weights(fp32_model.state_dict(), int8_model.state_dict())
for key in wt_compare_dict:
    print(key, compute_error(wt_compare_dict[key]['float'], wt_compare_dict[key]['quantized'].dequantize()))

act_compare_dict = ns.compare_model_outputs(fp32_model, int8_model, input_data)
for key in act_compare_dict:
    print(key, compute_error(act_compare_dict[key]['float'][0], act_compare_dict[key]['quantized'][0].dequantize()))

Recommendations for your workflow

Fig 9. Suggested quantization workflow

Click for larger image

Points to note

Large (10M+ parameters) models are more robust to quantization error. [2]
Quantizing a model from a FP32 checkpoint provides better accuracy than training an INT8 model from scratch.[2]
Profiling the model runtime is optional but it can help identify layers that bottleneck inference.
Dynamic Quantization is an easy first step, especially if your model has many Linear or Recurrent layers.
Use symmetric-per-channel quantization with MinMax observers for quantizing weights. Use affine-per-tensor quantization with MovingAverageMinMax observers for quantizing activations[2, 3]
Use metrics like SQNR to identify which layers are most suscpetible to quantization error. Turn off quantization on these layers.
Use QAT to fine-tune for around 10% of the original training schedule with an annealing learning rate schedule starting at 1% of the initial training learning rate. [3]
If the above workflow didn’t work for you, we want to know more. Post a thread with details of your code (model architecture, accuracy metric, techniques tried). Feel free to cc me @suraj.pt.

That was a lot to digest, congratulations for sticking with it! Next, we’ll take a look at quantizing a “real-world” model that uses dynamic control structures (if-else, loops). These elements disallow symbolic tracing a model, which makes it a bit tricky to directly quantize the model out of the box. In the next post of this series, we’ll get our hands dirty on a model that is chock full of loops and if-else blocks, and even uses third-party libraries in the forward call.

We’ll also cover a cool new feature in PyTorch Quantization called Define-by-Run, that tries to ease this constraint by needing only subsets of the model’s computational graph to be free of dynamic flow. Check out the Define-by-Run poster at PTDD’21 for a preview.

Thanks to Mark Saroufim for useful comments and feedback!

References

[1] Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., & Keutzer, K. (2021). A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630.

[2] Krishnamoorthi, R. (2018). Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342.

[3] Wu, H., Judd, P., Zhang, X., Isaev, M., & Micikevicius, P. (2020). Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602.

[4] PyTorch Quantization Docs

Improve high-value research with Hugging Face and Amazon SageMaker asynchronous inference endpoints

Many of our AWS customers provide research, analytics, and business intelligence as a service. This type of research and business intelligence enables their end customers to stay ahead of markets and competitors, identify growth opportunities, and address issues proactively. For example, some of our financial services sector customers do research for equities, hedge funds, and investment management companies to help them understand trends and identify portfolio strategies. In the health industry, an increasingly large portion of health research is now information-based. A great deal of research entails the analysis of data that was initially collected for diagnostic, treatment, or for other research projects, and is now being used for new research purposes. These forms of health research have led to effective primary prevention to avoid new cases, secondary prevention for early detection, and prevention for better disease management. The research outcomes not only improve the quality of life but also help reduce healthcare expenses.

Customers tend to digest the information from public and private sources. They then apply established or custom natural language processing (NLP) models to summarize and identify a trend and generate insights based on this information. The NLP models used for these types of research tasks deal with large models and usually involve long articles to be summarized considering the size of the corpus—and dedicated endpoints, which aren’t cost-optimized at the moment. These applications receive a burst of incoming traffic at different times of the day.

We believe customers would greatly benefit from the ability to scale down to zero and ramp up their inference capability on as needed basis. This optimizes the research cost and still doesn’t compromise on quality of inferences. This post discusses how Hugging Face along with Amazon SageMaker asynchronous inference can help achieve this.

You can build text summarization models with multiple deep-learning frameworks like TensorFlow, PyTorch, and Apache MXNet. These models typically have a large input payload of multiple text documents of varying size. Advanced deep learning models require compute-intensive preprocessing before model inference. Processing times can be as long as a few minutes, which removes the option to run real-time inference by passing payloads over an HTTP API. Instead, you need to process input payloads asynchronously from an object store like Amazon Simple Storage Service (Amazon S3) with automatic queuing and a predefined concurrency threshold. The system should be able to receive status notifications and reduce unnecessary costs by cleaning up resources when the tasks are complete.

SageMaker helps data scientists and developers prepare, build, train, and deploy high-quality machine learning (ML) models quickly by bringing together a broad set of capabilities purpose-built for ML. SageMaker provides the most advanced open-source model-serving containers for XGBoost (container, SDK), Scikit-Learn (container, SDK), PyTorch (container, SDK), TensorFlow (container, SDK), and Apache MXNet (container, SDK).

SageMaker provides four options to deploy trained ML models for generating inferences on new data.

Real-time inference endpoints are suitable for workloads that need to be processed with low latency requirements in the order of ms to seconds.
Batch transform is ideal for offline predictions on large batches of data.
Amazon SageMaker Serverless Inference (in preview mode and not recommended for production workloads as of this writing) is a purpose-built inference option that makes it easy for you to deploy and scale ML models. Serverless Inference is ideal for workloads that have idle periods between traffic spurts and can tolerate cold starts.
Asynchronous Inference endpoints queue incoming requests. They’re ideal for workloads where the request sizes are large (up to 1 GB) and inference processing times are in the order of minutes (up to 15 minutes). Asynchronous inference enables you to save on costs by auto scaling the instance count to zero when there are no requests to process.

Solution overview

In this post, we deploy a PEGASUS model that was pre-trained to do text summarization from Hugging Face to SageMaker hosting services. We use the model as is from Hugging Face for simplicity. However, you can fine-tune the model based on a custom dataset. You can also try out other models available in the Hugging Face Model Hub. We also provision an asynchronous inference endpoint that hosts this model, from which you can get predictions.

The asynchronous inference endpoint’s inference handler expects an article as input payload. The summarized text of the article is the output. The output is stored in the database for analyzing the trends or be fed downstream for further analytics. This downstream analysis derives data insights that help with the research.

We demonstrate how asynchronous inference endpoints enable you to have user-defined concurrency and completion notifications. We configure auto scaling of instances behind the endpoint to scale down to zero when traffic subsides and scale back up as the request queue fills up.

We also use Amazon CloudWatch metrics to monitor the queue size, total processing time, and invocations processed.

In the following diagram, we show the steps involved while performing inference using an asynchronous inference endpoint.

Our pre-trained PEGASUS ML model is first hosted on the scaling endpoint.
The user uploads the article to be summarized to an input S3 bucket.
The asynchronous inference endpoint is invoked using an API.
After the inference is complete, the result is saved to the output S3 bucket.
An Amazon Simple Notification Service (Amazon SNS) notification is sent to the user notifying them of the completed success or failure.

Create an asynchronous inference endpoint

We create the asynchronous inference endpoint similar to a real-time hosted endpoint. The steps include creating a SageMaker model, followed by configuring the endpoint and deploying the endpoint. The difference between the two types of endpoints is that the asynchronous inference endpoint configuration contains an AsyncInferenceConfig section. Here we specify the S3 output path for the results from the endpoint invocation and optionally include SNS topics for notifications on success and failure. We also specify the maximum number of concurrent invocations per instance as determined by the customer. See the following code:

AsyncInferenceConfig={
        "OutputConfig": {
            "S3OutputPath": f"s3://{bucket}/{bucket_prefix}/output",
            #  Optionally specify Amazon SNS topics for notifications
            "NotificationConfig": {
              "SuccessTopic": success_topic,
              "ErrorTopic": error_topic,
            }
        },
        "ClientConfig": {
            "MaxConcurrentInvocationsPerInstance": 2 #increase this value up to throughput peak for ideal performance
        }
    }

For details on the API to create an endpoint configuration for asynchronous inference, see Create an Asynchronous Inference Endpoint.

Invoke the asynchronous inference endpoint

The following screenshot shows a brief article that we use as our input payload:

The following code uploads the article as an input.json file to Amazon S3:

sm_session.upload_data(
        input_location, 
        bucket=sm_session.default_bucket(),
        key_prefix=prefix, 
        extra_args={"ContentType": "text/plain"})

We use the Amazon S3 URI to the input payload file to invoke the endpoint. The response object contains the output location in Amazon S3 to retrieve the results after completion:

response = sm_runtime.invoke_endpoint_async(EndpointName=endpoint_name, 
InputLocation=input_1_s3_location)
output_location = response['OutputLocation']

The following screenshot shows the sample output post summarization:

For details on the API to invoke an asynchronous inference endpoint, see Invoke an Asynchronous Inference Endpoint.

Queue the invocation requests with user-defined concurrency

The asynchronous inference endpoint automatically queues the invocation requests. This is a fully managed queue with various monitoring metrics and doesn’t require any further configuration. It uses the MaxConcurrentInvocationsPerInstance parameter in the preceding endpoint configuration to process new requests from the queue after previous requests are complete. MaxConcurrentInvocationsPerInstance is the maximum number of concurrent requests sent by the SageMaker client to the model container. If no value is provided, SageMaker chooses an optimal value for you.

Auto scaling instances within the asynchronous inference endpoint

We set the auto scaling policy with a minimum capacity of zero and a maximum capacity of five instances. Unlike real-time hosted endpoints, asynchronous inference endpoints support scaling down instances to zero by setting the minimum capacity to zero. We use the ApproximateBacklogSizePerInstance metric for the scaling policy configuration with a target queue backlog of five per instance to scale out further. We set the cooldown period for ScaleInCooldown to 120 seconds and the ScaleOutCooldown to 120 seconds. The value for ApproximateBacklogSizePerInstance is chosen based on the traffic and your sensitivity to scaling speed. The faster you scale in, the less cost you incur, but the more likely you’ll have to scale up again when new requests come in. The slower you scale in, the more cost you incur, but you’re less likely to have a request come in when you’re under-scaled.

client = boto3.client('application-autoscaling') # Common class representing Application Auto Scaling for SageMaker amongst other services

resource_id='endpoint/' + endpoint_name + '/variant/' + 'variant1' # This is the format in which application autoscaling references the endpoint

response = client.register_scalable_target(
ServiceNamespace='sagemaker', #
ResourceId=resource_id,
ScalableDimension='sagemaker:variant:DesiredInstanceCount',
MinCapacity=0,
MaxCapacity=5
)

response = client.put_scaling_policy(
PolicyName='Invocations-ScalingPolicy',
ServiceNamespace='sagemaker', # The namespace of the AWS service that provides the resource.
ResourceId=resource_id, # Endpoint name
ScalableDimension='sagemaker:variant:DesiredInstanceCount', # SageMaker supports only Instance Count
PolicyType='TargetTrackingScaling', # 'StepScaling'|'TargetTrackingScaling'
TargetTrackingScalingPolicyConfiguration={
'TargetValue': 5.0, # The target value for the metric.
'CustomizedMetricSpecification': {
'MetricName': 'ApproximateBacklogSizePerInstance',
'Namespace': 'AWS/SageMaker',
'Dimensions': [{'Name': 'EndpointName', 'Value': endpoint_name }],
'Statistic': 'Average',
},
'ScaleInCooldown': 120, # ScaleInCooldown - The amount of time, in seconds, after a scale-in activity completes before another scale in activity can start.
'ScaleOutCooldown': 120 # ScaleOutCooldown - The amount of time, in seconds, after a scale-out activity completes before another scale out activity can start.
# 'DisableScaleIn': True|False - indicates whether scale in by the target tracking policy is disabled.
# If the value is true, scale-in is disabled and the target tracking policy won't remove capacity from the scalable resource.
}
)

For details on the API to auto scale an asynchronous inference endpoint, see the Autoscale an Asynchronous Inference Endpoint.

Configure notifications from the asynchronous inference endpoint

We create two separate SNS topics for success and error notifications for each endpoint invocation result:

sns_client = boto3.client('sns')
response = sns_client.create_topic(Name="Async-Demo-ErrorTopic2")
error_topic = response['TopicArn']
response = sns_client.create_topic(Name="Async-Demo-SuccessTopic2")
success_topic = response['TopicArn']

Other options for notifications include periodically checking the output of the S3 bucket, or using S3 bucket notifications to initialize an AWS Lambda function on file upload. SNS notifications are included in the endpoint configuration section as previously described.

For details on how to set up notifications from an asynchronous inference endpoint, see Check Prediction Results.

Monitor the asynchronous inference endpoint

We monitor the asynchronous inference endpoint with built-in additional CloudWatch metrics specific to asynchronous inference. For example, we monitor the queue length in each instance with ApproximateBacklogSizePerInstance and total queue length with ApproximateBacklogSize.

For a complete list of metrics, refer to Monitoring Asynchronous Inference Endpoints.

We can optimize the endpoint configuration to get the most cost-effective instance with high performance. For example, we can use an instance with Amazon Elastic Inference or AWS Inferentia. We can also gradually increase the concurrency level up to the throughput peak while adjusting other model server and container parameters.

CloudWatch graphs

We simulated a traffic of 10,000 inference requests flowing in over a period to the asynchronous inference endpoint enabled with the auto scaling policy described in the previous section.

The following screenshot shows instance metrics before requests started flowing in. We start with a live endpoint with zero instances running:

The following graph showcases how the BacklogSize and BacklogSizePerInstance metrics change as the auto scaling kicks in and the load on the endpoint is shared by multiple instances that were provisioned as part of the auto scaling process.

As shown in the following screenshot, the number of instances increased as the inference count scaled up:

The following screenshot shows how the scaling in brings back the endpoint to the initial state of zero running instances:

Clean up

After all the requests are complete, we can delete the endpoint similar to deleting real-time hosted endpoints. Note that if we set the minimum capacity of asynchronous inference endpoints to zero, there are no instance charges incurred after it scales down to zero.

If you enabled auto scaling for your endpoint, make sure you deregister the endpoint as a scalable target before deleting the endpoint. To do this, run the following code:

response = client.deregister_scalable_target(ServiceNamespace='sagemaker',ResourceId='resource_id',ScalableDimension='sagemaker:variant:DesiredInstanceCount')

Remember to delete your endpoint after use as you will be charged for the instances used in this demo.

sm_client.delete_endpoint(EndpointName=endpoint_name)

You also need to delete the S3 objects and SNS topics. If you created any other AWS resources to consume and action on the SNS notifications, you may also want to delete them.

Conclusion

In this post, we demonstrated how to use the new asynchronous inference capability from SageMaker to process a typical large input payload that is part of a summarization task. For inference, we used a model from Hugging Face and deployed it on asynchronous inference endpoint. We explained the common challenges of burst traffic, high model processing times, and large payloads involved with research analytics. The asynchronous inference endpoint’s inherent ability to manage internal queues, predefined concurrency limits, configure response notifications and to automatically scale down to zero helped us address these challenges. The complete code for this example is available on GitHub.

To get started with SageMaker asynchronous inference, check out Asynchronous Inference.

About the Authors

Dinesh Kumar Subramani is a Senior Solutions Architect with the UKIR SMB team, based in Edinburgh, Scotland. He specializes in artificial intelligence and machine learning. Dinesh enjoys working with customers across industries to help them solve their problems with AWS services. Outside of work, he loves spending time with his family, playing chess and enjoying music across genres.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

How Margarita Chli is using drones to go where people can’t

When it comes to search-and-rescue missions, dogs are second to none, but an Amazon Research Award recipient says they might have competition from drones.Read More

Overview

Solution overview

Create the Rekognition Custom Labels model

Enable cross-account access

Invoke the endpoint

Avoiding the public internet

Build a CI/CD pipeline for promoting models

Create guardrails through service control policies

Conclusion

About the Authors

All in the Family

Talk AI to Me

Creating the ‘Santa-Facturing’ Extension

Visualizing the ‘Santa-Factory’

Fundamentals of Quantization

Mapping function

Quantization Parameters

Calibration

Affine and Symmetric Quantization Schemes

Per-Tensor and Per-Channel Quantization Schemes

Backend Engine

QConfig

In PyTorch

Post-Training Dynamic/Weight-only Quantization

Post-Training Static Quantization (PTQ)

Quantization-aware Training (QAT)

Sensitivity Analysis

Recommendations for your workflow

Points to note

References

Solution overview

Create an asynchronous inference endpoint

Invoke the asynchronous inference endpoint

Queue the invocation requests with user-defined concurrency

Auto scaling instances within the asynchronous inference endpoint

Configure notifications from the asynchronous inference endpoint

Monitor the asynchronous inference endpoint

CloudWatch graphs

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.