Content Creation ‘In the NVIDIA Studio’ Gets Boost From New Professional GPUs, AI Tools, Omniverse and OpenUSD Collaboration Features

Content Creation ‘In the NVIDIA Studio’ Gets Boost From New Professional GPUs, AI Tools, Omniverse and OpenUSD Collaboration Features

AI and accelerated computing were in the spotlight at SIGGRAPH — the world’s largest gathering of computer graphics experts — as NVIDIA founder and CEO Jensen Huang announced during his keynote address updates to NVIDIA Omniverse, a platform for building and connecting 3D tools and applications, as well as acceleration for Universal Scene Description (known as OpenUSD), the open and extensible ecosystem for 3D worlds.

This follows the recent announcement of NVIDIA joining Pixar, Adobe, Apple and Autodesk to form the Alliance for OpenUSD. It marks a major leap toward unlocking the next era of 3D graphics, design and simulation by ensuring compatibility in 3D tools and content for digitalization across industries.

NVIDIA launched three new desktop workstation Ada Generation GPUs — the NVIDIA RTX 5000, RTX 4500 and RTX 4000 — which deliver the latest AI, graphics and real-time rendering technology to professionals worldwide.

Shutterstock is bringing generative AI to 3D scene backgrounds with a foundation model trained using NVIDIA Picasso, a cloud-based foundry for building visual generative AI models. Picasso-trained models can now generate photorealistic, 8K, 360-degree high-dynamic-range imaging (HDRi) environment maps for quicker scene development. Autodesk will also integrate generative AI content-creation services — developed using foundation models in Picasso — with its popular Autodesk Maya software.

Each month, NVIDIA Studio Driver releases provide artists, creators and 3D developers with the best performance and reliability when working with creative applications. Available today, the August NVIDIA Studio Driver gives creators peak reliability for using their favorite creative apps. It includes support for updates to Omniverse, XSplit Broadcaster and Reallusion iClone.

Plus, this week’s featured In the NVIDIA Studio artist Andrew Averkin shows how AI influenced his process in building a delightful cup of joe for his Natural Coffee piece.

Omniverse Expands

Omniverse received a major upgrade, bringing new connectors and advancements to the platform.

These updates are showcased in Omniverse foundation applications, which are fully customizable reference applications that creators, enterprises and developers can copy, extend or enhance.

Upgraded Omniverse applications include Omniverse USD Composer, which lets 3D users assemble large-scale, OpenUSD-based scenes. Omniverse Audio2Face — which provides generative AI application programming interfaces that create realistic facial animations and gestures from only an audio file — now includes multilingual support and a new female base model.

The update brings boosted efficiency and an improved user experience. New rendering optimizations take full advantage of the NVIDIA Ada Lovelace architecture enhancements in NVIDIA RTX GPUs with DLSS 3 technology fully integrated into the Omniverse RTX Renderer. In addition, a new AI denoiser enables real-time 4K path tracing of massive industrial scenes.

New application and experience templates provide developers getting started with OpenUSD and Omniverse a major headstart with minimal coding.

A new Omniverse Kit Extension Registry, a central repository for accessing, sharing and managing Omniverse extensions, lets developers easily turn functionality on and off in their application, making it easier than ever to build custom apps from over 500 core Omniverse extensions provided by NVIDIA.

New extended-reality developer tools let users build spatial-computing options natively into their Omniverse-based applications, giving users the flexibility to experience their 3D projects and virtual worlds however they like.

Expanding their collaboration across Adobe Substance 3D, generative AI and OpenUSD initiatives, Adobe and NVIDIA announced plans to make Adobe Firefly, Adobe’s family of creative generative AI models, available as APIs in Omniverse, enabling developers and creators to enhance their design processes.

Developers and industrial enterprises have new foundation apps and services to optimize and enhance 3D pipelines with the OpenUSD framework and generative AI.

Studio professionals can connect the world of generative AI to their workflows to accelerate entire projects — from environment creation and character animation to scene-setting and more. With Kit AI Agent, OpenUSD Connectors and extensions to prompt top generative AI tools and APIs, Omniverse can aggregate the final result in a unified viewport — collectively reducing the time from conception to creation.

RTX: The Next Generation

The new NVIDIA RTX 5000, RTX 4500 and RTX 4000 Ada Generation professional desktop GPUs feature the latest NVIDIA Ada Lovelace architecture technologies, including DLSS 3, for smoother rendering and real-time interactivity in 3D applications such as Unreal Engine.

These workstation-class GPUs feature third-generation RT Cores with up to 2x the throughput of the previous generation. This enables users to work with larger, higher-fidelity images in real time, helping artists and designers maintain their creative flow.

Fourth-generation Tensor Cores deliver up to 2x the AI performance of the previous generation for AI training and development as well as inferencing and generative AI workloads. ‌Large GPU memory enables AI-augmented multi-application workflows with the latest generative AI-enabled tools and applications.

The Ada architecture provides these new GPUs with up to twice the video encode and decode capability of the previous generation, encoding up to 8K60 video in real time, with support for AV1 encode and decode. Combined with next-generation AI performance, these capabilities make the new professional GPUs ideal for multi-stream video editing workflows with high-resolution content using  AI-augmented video editing applications such as Adobe Premiere and DaVinci Resolve.

Designed for high-end creative, multi-application professional workflows that require large models and datasets, these new GPUs provide large GDDR6 memory: 20GB for the RTX 4000, 24GB for the RTX 4500 and 32GB for the RTX 5000 — all supporting error-correcting code for error-free computing.

A Modern-Day Picasso

3D artists regularly face the monumental task of bringing scenes to life by artistically mixing hero assets with props, materials, backgrounds and lighting. Generative AI technologies can help streamline this workflow by generating secondary assets, like environment maps that light the scene.

At SIGGRAPH, Shutterstock announced that it’s tapping into NVIDIA Picasso to train a generative AI model that can create 360 HDRi photorealistic environment maps. The model is built using Shutterstock’s responsibly licensed libraries.

Shutterstock using NVIDIA Picasso to create 360 HDRi photorealistic environment maps.

Previously, artists needed to use expensive 360-degree cameras to create backgrounds and environment maps from scratch, or choose from fixed options that may not precisely match their 3D scene. Now, from simple prompts or using their desired background as a reference, the Shutterstock generative AI feature will quickly generate custom 360-degree, 8K-resolution, HDRi environment maps, which artists can use to set a background and light a scene. This allows more time to work on hero 3D assets, which are the primary assets of a 3D scene that viewers will focus on.

Autodesk also announced that it will integrate generative AI content-creation services — developed using foundation models in Picasso — with its popular 3D software Autodesk Maya.

Autodesk Maya generative AI content-creation services developed using foundation models in Picasso.

August Studio Driver Delivers

The August Studio Driver supports these updates and more, including the latest release of XSplit Broadcaster, the popular streaming software that lets users simultaneously stream to multiple platforms.

XSplit Broadcaster 4.5 introduces NVIDIA Encoder (NVENC) AV1 support. GeForce and NVIDIA RTX 40 Series GPU users can now stream in high-quality 4K 60 frames per second directly to YouTube Live, dramatically improving video quality.

XSplit Broadcaster 4.5 adds AV1 livestreaming support for YouTube.

Streaming in AV1 with RTX GPUs provides 40% better efficiency than H.264, reducing bandwidth requirements for livestreaming or reducing file size for high-quality local captures.

H.264 vs. AV1: 4K60 source encoded at 8 Mbps.

An update to the Reallusion iClone Omniverse Connector includes new features such as real-time synchronization of projects, as well as enhanced import functionality for OpenUSD. This makes work between iClone and Omniverse quicker, smoother and more efficient.

Brew-tiful Artwork

Words can’t espresso the stunning 3D scene Natural Coffee.

Do they accept reservations?

NVIDIA artist Andrew Averkin has over 15 years of experience in the creative field. He finds joy in a continuous journey — blending art and technology — to bring his vivid imagination to life.

His work, Natural Coffee, has a compelling origin story. Once upon a time, in a bustling office, there was a cup of “natural coffee” known for its legendary powers. It gave artists nerves of steel at work, improved performance across the board and, as a small bonus, offered magical music therapy.

Averkin used an image generator to quickly cycle through visual ideas created from simple text-based prompts. Using AI to brainstorm imagery at the beginning of creative workflows is becoming more popular by artists looking to save time on iteration.

Averkin iterates for inspiration.

With a visual foundation, Averkin speeds up the process by acquiring 3D assets from online stores to quickly build a 3D blockout of the scene, a rough-draft level built using simple 3D shapes without details or polished details.

Next, Averkin polished individual assets in Autodesk 3ds Max, sculpting models with fine detail, testing and applying different textures and materials. His GeForce RTX 4090 GPU unlocked RTX-accelerated AI denoising — with the default Autodesk Arnold renderer — delivering interactive 3D modeling, which helped tremendously while composing the scene.

Averkin working in Autodesk 3ds Max.

“I chose a GeForce RTX graphics card for quality, speed and safety, plain and simple,” said Averkin.

Averkin then exported Natural Coffee to the NVIDIA Omniverse USD Composer app via the Autodesk 3ds Max Connector. “Inside USD Composer I added more details, played a lot with a built-in collection of materials, plus did a lot of lighting work to make composition look more realistic,” he explained.

Real-time rendering in Omniverse USD Composer.

One of the biggest benefits in USD Composer is the ability to review scenes rendering in real time with photorealistic light, shadows, textures and more. This dramatically improves the process of editing massive 3D scenes, making it quicker and easier. Averkin was even able to add a camera fly animation, further elevating the scene.

The final step was to add a few touch-ups in Adobe Photoshop. Over 30 GPU-accelerated features gave Averkin plenty of options for playing with colors and contrast, and making final image adjustments smoothly and quickly.

Averkin encourages advanced 3D artists to experiment with the OpenUSD framework. “I use it a lot in my work at NVIDIA and in personal projects,” he said. “OpenUSD is very powerful. It helps with work in multiple creative apps in a non-destructive way, and other great features make the entire process easier and more flexible.”

NVIDIA artist Andrew Averkin.

Check out Averkin’s portfolio on ArtStation.

Follow NVIDIA Studio on Instagram, Twitter and Facebook. Access tutorials on the Studio YouTube channel and get updates directly in your inbox by subscribing to the Studio newsletter.

Read More

Shutterstock Brings Generative AI to 3D Scene Backgrounds With NVIDIA Picasso

Shutterstock Brings Generative AI to 3D Scene Backgrounds With NVIDIA Picasso

Picture this: Creators can quickly create and customize 3D scene backgrounds with the help of generative AI, thanks to cutting-edge tools from Shutterstock.

The visual-content provider is building services using NVIDIA Picasso — a cloud-based foundry for developing generative AI models for visual design.

The work incorporates Picasso’s latest feature — announced today during NVIDIA founder and CEO Jensen Huang’s SIGGRAPH keynote — which will help artists enhance and light 3D scenes based on simple text or image prompts, all with AI models built using fully licensed, rights-reserved data.

From these prompts, the new gen AI feature quickly generates custom 360-degree, 8K-resolution, high-dynamic-range imaging (HDRi) environment maps, which artists can use to set a background and light a scene.

This expands on NVIDIA’s collaboration with Shutterstock to empower the next generation of digital content-creation tools and accelerate 3D model generation.

To meet a surging demand for immersive visuals in films, games, virtual worlds, advertising and more, the 3D artist community is rapidly expanding, with over 20% growth in the past year.

Many of these artists are tapping generative AI to bolster their complex workflows — and will be able to use the technology to quickly create and customize environment maps. This allows more time to work on hero 3D assets, which are the primary assets of a 3D scene that viewers will focus on. It makes a panoramic difference when creating compelling 3D visuals.

“We’re committed to hyper-enabling 3D artists and collaborators — helping them build the immersive environments they envision faster than ever before and streamlining their content-creation workflows using NVIDIA Picasso,” said Dade Orgeron, vice president of 3D innovation at Shutterstock.

Generating Photorealistic Environment Maps

Previously, artists needed to buy expensive 360-degree cameras to create backgrounds and environment maps from scratch, or choose from fixed options that may not precisely match their 3D scene.

Now, users can simply provide a prompt — whether that’s text or a reference image — and the 360 HDRi services built on Picasso will quickly generate panoramic images. Plus, thanks to generative AI, the custom environment map can automatically match the background image that’s inputted as a prompt.

Users can then customize the maps and quickly iterate on ideas until they achieve the vision they want.

Collaboration to Boost 3D World-Building

Autodesk, a provider of 3D software and tools for creators in media and entertainment, is focused on giving artists the creative freedom to inspire and delight audiences worldwide.

Enabling artists to trade mundane tasks for unbridled creativity, Autodesk will integrate generative AI content-creation services — developed using foundation models in Picasso — with its popular 3D software Maya.

Supercharging Autodesk customer workflows with AI allows artists to focus on creating — and to ultimately produce content faster.

Generative AI Model Foundry

Picasso is part of NVIDIA AI Foundations, which advances enterprise-level generative AI for text, visual content and even biology.

The foundry will also adopt new NVIDIA research to generate physics-based rendering materials from text and image prompts, demonstrated at SIGGRAPH’s Real-Time Live competition. This will enable content providers to create 3D services, software and tools that enhance and expedite the simulation of diverse physical materials, such as tiles, metals and wood — complete with texture-mapping techniques, including normal, roughness and ambient occlusion.

Picasso runs on the NVIDIA Omniverse Cloud platform-as-a-service and is accessible via a serverless application programming interface that content and service providers like Shutterstock can easily connect to their websites and applications.

Learn about the latest advances in generative AI, graphics and more by joining NVIDIA at SIGGRAPH, running through Thursday, Aug. 10.

Read More

A Textured Approach: NVIDIA Research Shows How Gen AI Helps Create and Edit Photorealistic Materials

NVIDIA researchers are taking the stage at SIGGRAPH, the world’s largest computer graphics conference, to demonstrate a generative AI workflow that helps artists rapidly create and iterate on materials for 3D scenes.

The research demo, which will be presented today at the show’s Real-Time Live event, showcases how artists can use text or image prompts to generate custom textured materials — such as fabric, wood and stone — faster and with finer creative control. These capabilities will be coming to NVIDIA Picasso, allowing enterprises, software creators and service providers to create custom generative AI models for materials, developed using their own fully licensed data.

This set of AI models will facilitate iterative creating and editing of materials, enabling companies to offer new tools that’ll help artists rapidly refine a 3D object’s appearance until they achieve the desired result.

In the demo, NVIDIA researchers experiment with a living-room scene, like an interior designer assisted by AI might do in any 3D rendering application. In this case, researchers use NVIDIA Omniverse USD Composer — a reference application for scene assembly and composition using Universal Scene Description, known as OpenUSD — to add a brick-textured wall, to create and modify fabric choices for the sofa and throw pillows, and to incorporate an abstract animal design in a specific area of the wall.

Generative AI Enables Iterative Design 

The Real-Time Live demo combines several optimized AI models — a palette of tools that developers using Picasso will be able to customize and integrate into creative applications for artists.

Once integrated into creative applications, these features will allow artists to enter a brief text prompt to generate materials — such as a brick or a mosaic pattern — that are tileable, meaning they can be seamlessly replicated over a surface of any size. Or, they can import a reference image, such as a swatch of flannel fabric, and apply it to any object in the virtual scene.

An AI editing tool lets artists modify a specific area of the material they’re working on, such as the center of a coffee table texture.

The AI-generated materials support physics-based rendering, responding realistically to changes in the scene’s lighting. They include normal, roughness and ambient occlusion maps — features that are critical to creating and fine-tuning materials for photorealistic 3D scenes.

When accelerated on NVIDIA Tensor Core GPUs, materials can be generated in near real time, and can be upscaled in the background, achieving up to 4K resolution while creators continue to refine other parts of the scene.

Across creative industries — including architecture, game development and interior design — these capabilities could help artists quickly explore ideas and experiment with different aesthetic styles to create multiple versions of a scene.

A game developer, for example, could use these generative AI features to speed up the process of designing an open world environment or creating a character’s wardrobe. An architect could experiment with different styles of building facades in various lighting environments.

Build Generative AI Services With NVIDIA Picasso 

These capabilities for physics-based material generation will be made available in NVIDIA Picasso, a cloud-based foundry that allows companies to build, optimize and fine-tune their own generative AI foundational models for visual content.

Picasso enables content providers to develop generative AI tools and services trained on fully licensed, rights-reserved data. It’s part of NVIDIA AI Foundations, a set of model-making services that advance generative AI across text, visual content and biology.

At today’s SIGGRAPH keynote, NVIDIA founder and CEO Jensen Huang also announced a new Picasso feature to generate photorealistic 360 HDRi environment maps to light 3D scenes using simple text or image prompts.

See This Research at SIGGRAPH’s Real-Time Live 

Real-Time Live is one of the most anticipated events at SIGGRAPH. This year, the showcase features more than a dozen jury-reviewed projects, including those from teams at Roblox, the University of Utah and Metaphysic, a member of the NVIDIA Inception program for cutting-edge startups.

At the event, NVIDIA researchers will present this interactive materials research live, including a demo of the super resolution tool. Conference attendees can catch the session today at 6 p.m. PT in West Hall B at the Los Angeles Convention Center.

Learn about the latest advances in generative AI, graphics and more by joining NVIDIA at SIGGRAPH, running through Thursday, Aug. 10.

Read More

DENZA Collaborates With WPP to Build and Deploy Advanced Car Configurators on NVIDIA Omniverse Cloud

DENZA Collaborates With WPP to Build and Deploy Advanced Car Configurators on NVIDIA Omniverse Cloud

DENZA, the luxury EV brand joint venture between BYD and Mercedes-Benz, has collaborated with marketing and communications giant WPP and NVIDIA Omniverse Cloud to build and deploy its next generation of car configurators, NVIDIA founder and CEO Jensen Huang announced at SIGGRAPH.

WPP is using Omniverse Cloud — a platform for developing, deploying and managing industrial digitalization applications — to help unify the automaker’s highly complex design and marketing pipeline.

Omniverse Cloud enables WPP to build a single, physically accurate, real-time digital twin of the DENZA N7 model by integrating full-fidelity design data from the EV maker’s preferred computer-aided design tools via Universal Scene Description, or OpenUSD.

OpenUSD is a 3D framework that enables interoperability between software tools and data types for the building of virtual worlds.

The implementation of a new unified asset pipeline breaks down proprietary data silos, fostering enhanced data accessibility and facilitating collaborative, iterative reviews for the organization’s large design teams and stakeholders. It enables WPP to work on launch campaigns earlier in the design process, making iterations faster and less costly.

Unifying Asset Pipelines With Omniverse Cloud

Using Omniverse Cloud, WPP’s teams can connect their own pipeline of OpenUSD-enabled design and content creation tools such as Autodesk Maya and Adobe Substance 3D Painter to develop a new configurator for the DENZA N7. With a unified asset pipeline in Omniverse, WPP’s teams of artists can iterate and edit in real time a path-traced view of the full engineering dataset of the DENZA N7 — ensuring the virtual car accurately represents the physical car.

Traditional car configurators require hundreds of thousands of images to be prerendered to represent all possible options and variants. OpenUSD makes it possible for WPP to create a digital twin of the car that includes all possible variants in one single asset. No prerendered images are required.

In parallel, WPP’s environmental artists create fully interactive, live 3D virtual sets. These can start with a scan of a real-world environment, such as those WPP captures with their robot dog, or tap into generative AI tools from providers such as Shutterstock to instantly generate 360-degree HDRi backgrounds to maximize opportunity for personalization.

Shutterstock is using NVIDIA Picasso — a foundry for building generative AI visual models — to  develop a variety of generative AI services to accelerate 3D workflows. At SIGGRAPH, Shutterstock announced the first offering of these new services – 360 HDRi – to create photorealistic HDR environment maps to relight a scene. With this feature, artists can rapidly create custom environments that fit their needs.

One-Click Publish to GDN

Once the 3D experience is complete, with just one click, WPP can publish it to Graphics Delivery Network (GDN), part of NVIDIA Omniverse Cloud. GDN is a network of data centers built to serve real-time, high-fidelity 3D content to nearly any web device, enabling interactive experiences in the dealer showroom as well as on consumers’ mobile devices.

This eliminates the tedious process of manually packaging, deploying, hosting and managing the experience themselves. If updates are needed, just like with the initial deployment, WPP can publish them with a single click.

CTA: Learn more about Omniverse Cloud and GDN.

Read More

Host the Spark UI on Amazon SageMaker Studio

Host the Spark UI on Amazon SageMaker Studio

Amazon SageMaker offers several ways to run distributed data processing jobs with Apache Spark, a popular distributed computing framework for big data processing.

You can run Spark applications interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive sessions, you can choose Apache Spark or Ray to easily process large datasets, without worrying about cluster management.

Alternately, if you need more control over the environment, you can use a pre-built SageMaker Spark container to run Spark applications as batch jobs on a fully managed distributed cluster with Amazon SageMaker Processing. This option allows you to select several types of instances (compute optimized, memory optimized, and more), the number of nodes in the cluster, and the cluster configuration, thereby enabling greater flexibility for data processing and model training.

Finally, you can run Spark applications by connecting Studio notebooks with Amazon EMR clusters, or by running your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).

All these options allow you to generate and store Spark event logs to analyze them through the web-based user interface commonly named the Spark UI, which runs a Spark History Server to monitor the progress of Spark applications, track resource usage, and debug errors.

In this post, we share a solution for installing and running Spark History Server on SageMaker Studio and accessing the Spark UI directly from the SageMaker Studio IDE, for analyzing Spark logs produced by different AWS services (AWS Glue Interactive Sessions, SageMaker Processing jobs, and Amazon EMR) and stored in an Amazon Simple Storage Service (Amazon S3) bucket.

Solution overview

The solution integrates Spark History Server into the Jupyter Server app in SageMaker Studio. This allows users to access Spark logs directly from the SageMaker Studio IDE. The integrated Spark History Server supports the following:

  • Accessing logs generated by SageMaker Processing Spark jobs
  • Accessing logs generated by AWS Glue Spark applications
  • Accessing logs generated by self-managed Spark clusters and Amazon EMR

A utility command line interface (CLI) called sm-spark-cli is also provided for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli enables managing Spark History Server without leaving SageMaker Studio.

The solution consists of shell scripts that perform the following actions:

  • Install Spark on the Jupyter Server for SageMaker Studio user profiles or for a SageMaker Studio shared space
  • Install the sm-spark-cli for a user profile or shared space

Install the Spark UI manually in a SageMaker Studio domain

To host Spark UI on SageMaker Studio, complete the following steps:

  1. Choose System terminal from the SageMaker Studio launcher.

  1. Run the following commands in the system terminal:
curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh

The commands will take a few seconds to complete.

  1. When the installation is complete, you can start the Spark UI by using the provided sm-spark-cli and access it from a web browser by running the following code:

sm-spark-cli start s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>

The S3 location where the event logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are stored can be configured when running Spark applications.

For SageMaker Studio notebooks and AWS Glue Interactive Sessions, you can set up the Spark event log location directly from the notebook by using the sparkmagic kernel.

The sparkmagic kernel contains a set of tools for interacting with remote Spark clusters through notebooks. It offers magic (%spark, %sql) commands to run Spark code, perform SQL queries, and configure Spark settings like executor memory and cores.

For the SageMaker Processing job, you can configure the Spark event log location directly from the SageMaker Python SDK.

Refer to the AWS documentation for additional information:

You can choose the generated URL to access the Spark UI.

The following screenshot shows an example of the Spark UI.

You can check the status of the Spark History Server by using the sm-spark-cli status command in the Studio System terminal.

You can also stop the Spark History Server when needed.

Automate the Spark UI installation for users in a SageMaker Studio domain

As an IT admin, you can automate the installation for SageMaker Studio users by using a lifecycle configuration. This can be done for all user profiles under a SageMaker Studio domain or for specific ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.

You can create a lifecycle configuration from the install-history-server.sh script and attach it to an existing SageMaker Studio domain. The installation is run for all the user profiles in the domain.

From a terminal configured with the AWS Command Line Interface (AWS CLI) and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz

cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`

aws sagemaker create-studio-lifecycle-config 
	--studio-lifecycle-config-name install-spark-ui-on-jupyterserver 
	--studio-lifecycle-config-content $LCC_CONTENT 
	--studio-lifecycle-config-app-type JupyterServer 
	--query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
	--region {YOUR_AWS_REGION} 
	--domain-id {YOUR_STUDIO_DOMAIN_ID} 
	--default-user-settings 
	'{
	"JupyterServerAppSettings": {
	"DefaultResourceSpec": {
	"LifecycleConfigArn": "arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver",
	"InstanceType": "system"
	},
	"LifecycleConfigArns": [
	"arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver"
	]
	}}'

After Jupyter Server restarts, the Spark UI and the sm-spark-cli will be available in your SageMaker Studio environment.

Clean up

In this section, we show you how to clean up the Spark UI in a SageMaker Studio domain, either manually or automatically.

Manually uninstall the Spark UI

To manually uninstall the Spark UI in SageMaker Studio, complete the following steps:

  1. Choose System terminal in the SageMaker Studio launcher.

  1. Run the following commands in the system terminal:
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts

chmod +x uninstall-history-server.sh
./uninstall-history-server.sh

Uninstall the Spark UI automatically for all SageMaker Studio user profiles

To automatically uninstall the Spark UI in SageMaker Studio for all user profiles, complete the following steps:

  1. On the SageMaker console, choose Domains in the navigation pane, then choose the SageMaker Studio domain.

  1. On the domain details page, navigate to the Environment tab.
  2. Select the lifecycle configuration for the Spark UI on SageMaker Studio.
  3. Choose Detach.

  1. Delete and restart the Jupyter Server apps for the SageMaker Studio user profiles.

Conclusion

In this post, we shared a solution you can use to quickly install the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine learning (ML) and data engineering teams can use scalable cloud compute to access and analyze Spark logs from anywhere and speed up their project delivery. IT admins can standardize and expedite the provisioning of the solution in the cloud and avoid proliferation of custom development environments for ML projects.

All the code shown as part of this post is available in the GitHub repository.


About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering and an ML background, he works with customers of any size to understand their business and technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, computer vision, and NLP, involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size, helping them understand their technical needs and design AI and ML solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertice includes machine learning end to end, machine learning endustrialization, and generative AI. He enjoys spending time with his friends and exploring new places, as well as traveling to new destinations.

Read More

Deploy thousands of model ensembles with Amazon SageMaker multi-model endpoints on GPU to minimize your hosting costs

Deploy thousands of model ensembles with Amazon SageMaker multi-model endpoints on GPU to minimize your hosting costs

Artificial intelligence (AI) adoption is accelerating across industries and use cases. Recent scientific breakthroughs in deep learning (DL), large language models (LLMs), and generative AI is allowing customers to use advanced state-of-the-art solutions with almost human-like performance. These complex models often require hardware acceleration because it enables not only faster training but also faster inference when using deep neural networks in real-time applications. GPUs’ large number of parallel processing cores makes them well-suited for these DL tasks.

However, in addition to model invocation, those DL application often entail preprocessing or postprocessing in an inference pipeline. For example, input images for an object detection use case might need to be resized or cropped before being served to a computer vision model, or tokenization of text inputs before being used in an LLM. NVIDIA Triton is an open-source inference server that enables users to define such inference pipelines as an ensemble of models in the form of a Directed Acyclic Graph (DAG). It is designed to run models at scale on both CPU and GPU. Amazon SageMaker supports deploying Triton seamlessly, allowing you to use Triton’s features while also benefiting from SageMaker capabilities: a managed, secured environment with MLOps tools integration, automatic scaling of hosted models, and more.

AWS, in its dedication to help customers achieve the highest saving, has continuously innovated not only in pricing options and cost-optimization proactive services, but also in launching cost savings features like multi-model endpoints (MMEs). MMEs are a cost-effective solution for deploying a large number of models using the same fleet of resources and a shared serving container to host all of your models. Instead of using multiple single-model endpoints, you can reduce your hosting costs by deploying multiple models while paying only for a single inference environment. Additionally, MMEs reduce deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint.

In this post, we show how to run multiple deep learning ensemble models on a GPU instance with a SageMaker MME. To follow along with this example, you can find the code on the public SageMaker examples repository.

How SageMaker MMEs with GPU work

With MMEs, a single container hosts multiple models. SageMaker controls the lifecycle of models hosted on the MME by loading and unloading them into the container’s memory. Instead of downloading all the models to the endpoint instance, SageMaker dynamically loads and caches the models as they are invoked.

When an invocation request for a particular model is made, SageMaker does the following:

  1. It first routes the request to the endpoint instance.
  2. If the model has not been loaded, it downloads the model artifact from Amazon Simple Storage Service (Amazon S3) to that instance’s Amazon Elastic Block Storage volume (Amazon EBS).
  3. It loads the model to the container’s memory on the GPU-accelerated compute instance. If the model is already loaded in the container’s memory, invocation is faster because no further steps are needed.

When an additional model needs to be loaded, and the instance’s memory utilization is high, SageMaker will unload unused models from that instance’s container to ensure that there is enough memory. These unloaded models will remain on the instance’s EBS volume so that they can be loaded into the container’s memory later, thereby removing the need to download them again from the S3 bucket. However, If the instance’s storage volume reaches its capacity, SageMaker will delete the unused models from the storage volume. In cases where the MME receives many invocation requests, and additional instances (or an auto-scaling policy) are in place, SageMaker routes some requests to other instances in the inference cluster to accommodate for the high traffic.

This not only provides a cost saving mechanism, but also enables you to dynamically deploy new models and deprecate old ones. To add a new model, you upload it to the S3 bucket the MME is configured to use and invoke it. To delete a model, stop sending requests and delete it from the S3 bucket. Adding models or deleting them from an MME doesn’t require updating the endpoint itself!

Triton ensembles

The Triton model ensemble represents a pipeline that consists of one model, preprocessing and postprocessing logic, and the connection of input and output tensors between them. A single inference request to an ensemble triggers the run of the entire pipeline as a series of steps using the ensemble scheduler. The scheduler collects the output tensors in each step and provides them as input tensors for other steps according to the specification. To clarify: the ensemble model is still viewed as a single model from an external view.

Triton server architecture includes a model repository: a file system-based repository of the models that Triton will make available for inferencing. Triton can access models from one or more locally accessible file paths or from remote locations like Amazon S3.

Each model in a model repository must include a model configuration that provides required and optional information about the model. Typically, this configuration is provided in a config.pbtxt file specified as ModelConfig protobuf. A minimal model configuration must specify the platform or backend (like PyTorch or TensorFlow), the max_batch_size property, and the input and output tensors of the model.

Triton on SageMaker

SageMaker enables model deployment using Triton server with custom code. This functionality is available through the SageMaker managed Triton Inference Server Containers. These containers support common machine leaning (ML) frameworks (like TensorFlow, ONNX, and PyTorch, as well as custom model formats) and useful environment variables that let you optimize performance on SageMaker. Using SageMaker Deep Learning Containers (DLC) images is recommended because they’re maintained and regularly updated with security patches.

Solution walkthrough

For this post, we deploy two different types of ensembles on a GPU instance, using Triton and a single SageMaker endpoint.

The first ensemble consists of two models: a DALI model for image preprocessing and a TensorFlow Inception v3 model for actual inference. The pipeline ensemble takes encoded images as an input, which will have to be decoded, resized to 299×299 resolution, and normalized. This preprocessing will be handled by the DALI model. DALI is an open-source library for common image and speech preprocessing tasks such as decoding and data augmentation. Inception v3 is an image recognition model that consists of symmetric and asymmetric convolutions, and average and max pooling fully connected layers (and therefore is perfect for GPU usage).

The second ensemble transforms raw natural language sentences into embeddings and consists of three models. First, a preprocessing model is applied to the input text tokenization (implemented in Python). Then we use a pre-trained BERT (uncased) model from the Hugging Face Model Hub to extract token embeddings. BERT is an English language model that was trained using a masked language modeling (MLM) objective. Finally, we apply a postprocessing model where the raw token embeddings from the previous step are combined into sentence embeddings.

After we configure Triton to use these ensembles, we show how to configure and run the SageMaker MME.

Finally, we provide an example of each ensemble invocation, as can be seen in the following diagram:

  • Ensemble 1 – Invoke the endpoint with an image, specifying DALI-Inception as the target ensemble
  • Ensemble 2 – Invoke the same endpoint, this time with text input and requesting the preprocess-BERT-postprocess ensemble

MME with 2 ensembles

Set up the environment

First, we set up the needed environment. This includes updating AWS libraries (like Boto3 and the SageMaker SDK) and installing the dependencies required to package our ensembles and run inferences using Triton. We also use the SageMaker SDK default execution role. We use this role to enable SageMaker to access Amazon S3 (where our model artifacts are stored) and the container registry (where the NVIDIA Triton image will be used from). See the following code:

import boto3, json, sagemaker, time
from sagemaker import get_execution_role
import nvidia.dali as dali
import nvidia.dali.types as types

# SageMaker varaibles
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
role = get_execution_role()

# Other Variables
instance_type = "ml.g4dn.4xlarge"
sm_model_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_config_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
endpoint_name = "triton-tf-dali-ensemble-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

Prepare ensembles

In this next step, we prepare the two ensembles: the TensorFlow (TF) Inception with DALI preprocessing and BERT with Python preprocessing and postprocessing.

This entails downloading the pre-trained models, providing the Triton configuration files, and packaging the artifacts to be stored in Amazon S3 before deploying.

Prepare the TF and DALI ensemble

First, we prepare the directories for storing our models and configurations: for the TF Inception (inception_graphdef), for DALI preprocessing (dali), and for the ensemble (ensemble_dali_inception). Because Triton supports model versioning, we also add the model version to the directory path (denoted as 1 because we only have one version). To learn more about the Triton version policy, refer to Version Policy. Next, we download the Inception v3 model, extract it, and copy to the inception_graphdef model directory. See the following code:

!mkdir -p model_repository/inception_graphdef/1
!mkdir -p model_repository/dali/1
!mkdir -p model_repository/ensemble_dali_inception/1

!wget -O /tmp/inception_v3_2016_08_28_frozen.pb.tar.gz 
https://storage.googleapis.com/download.tensorflow.org/models/inception_v3_2016_08_28_frozen.pb.tar.gz

!(cd /tmp && tar xzf inception_v3_2016_08_28_frozen.pb.tar.gz)
!mv /tmp/inception_v3_2016_08_28_frozen.pb model_repository/inception_graphdef/1/model.graphdef

Now, we configure Triton to use our ensemble pipeline. In a config.pbtxt file, we specify the input and output tensor shapes and types, and the steps the Triton scheduler needs to take (DALI preprocessing and the Inception model for image classification):

%%writefile model_repository/ensemble_dali_inception/config.pbtxt
name: "ensemble_dali_inception"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "inception_graphdef"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "InceptionV3/Predictions/Softmax"
        value: "OUTPUT"
      }
    }
  ]
}

Next, we configure each of the models. First, the model config for DALI backend:

%%writefile model_repository/dali/config.pbtxt
name: "dali"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 299, 299, 3 ]
  }
]
parameters: [
  {
    key: "num_threads"
    value: { string_value: "12" }
  }
]

Next, the model configuration for TensorFlow Inception v3 we downloaded earlier:

%%writefile model_repository/inception_graphdef/config.pbtxt
name: "inception_graphdef"
platform: "tensorflow_graphdef"
max_batch_size: 256
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 299, 299, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
    label_filename: "inception_labels.txt"
  }
]
instance_group [
    {
      kind: KIND_GPU
    }
]

Because this is a classification model, we also need to copy the Inception model labels to the inception_graphdef directory in the model repository. These labels include 1,000 class labels from the ImageNet dataset.

!aws s3 cp s3://sagemaker-sample-files/datasets/labels/inception_labels.txt model_repository/inception_graphdef/inception_labels.txt

Next, we configure and serialize the DALI pipeline that will handle our preprocessing to file. The preprocessing includes reading the image (using CPU), decoding (accelerated using GPU), and resizing and normalizing the image.

@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
    """Create a pipeline which reads images and masks, decodes the images and returns them."""
    images = dali.fn.external_source(device="cpu", name="DALI_INPUT_0")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299) #resize image to the default 299x299 size
    images = dali.fn.crop_mirror_normalize(
        images,
        dtype=types.FLOAT,
        output_layout="HWC",
        crop=(299, 299),  #crop image to the default 299x299 size
        mean=[0.485 * 255, 0.456 * 255, 0.406 * 255], #crop a central region of the image
        std=[0.229 * 255, 0.224 * 255, 0.225 * 255], #crop a central region of the image
    )
    return images

pipe().serialize(filename="model_repository/dali/1/model.dali")

Finally, we package the artifacts together and upload them as a single object to Amazon S3:

!tar -cvzf model_tf_dali.tar.gz -C model_repository .
model_uri = sagemaker_session.upload_data(
    path="model_tf_dali.tar.gz", key_prefix="triton-mme-gpu-ensemble"
)
print("S3 model uri: {}".format(model_uri))

Prepare the TensorRT and Python ensemble

For this example, we use a pre-trained model from the transformers library.

You can find all models (preprocess and postprocess, along with config.pbtxt files) in the folder ensemble_hf. Our file system structure will include four directories (three for the individual model steps and one for the ensemble) as well as their respective versions:


ensemble_hf
├── bert-trt
|   |── model.pt
|   |──config.pbtxt
├── ensemble
│   └── 1
|   └── config.pbtxt
├── postprocess
│   └── 1
|       └── model.py
|   └── config.pbtxt
├── preprocess
│   └── 1
|       └── model.py
|   └── config.pbtxt

In the workspace folder, we provide with two scripts: the first to convert the model into ONNX format (onnx_exporter.py) and the TensorRT compilation script (generate_model_trt.sh).

Triton natively supports the TensorRT runtime, which enables you to easily deploy a TensorRT engine, thereby optimizing for a selected GPU architecture.

To make sure we use the TensorRT version and dependencies that are compatible with the ones in our Triton container, we compile the model using the corresponding version of NVIDIA’s PyTorch container image:

model_id = "sentence-transformers/all-MiniLM-L6-v2"
! docker run --gpus=all --rm -it -v `pwd`/workspace:/workspace nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash generate_model_trt.sh $model_id

We then copy the model artifacts to the directory we created earlier and add a version to the path:

! mkdir -p ensemble_hf/bert-trt/1 && mv workspace/model.plan ensemble_hf/bert-trt/1/model.plan && rm -rf workspace/model.onnx workspace/core*

We use a Conda pack to generate a Conda environment that the Triton Python backend will use in preprocessing and postprocessing:

!bash conda_dependencies.sh
!cp processing_env.tar.gz ensemble_hf/postprocess/ && cp processing_env.tar.gz ensemble_hf/preprocess/
!rm processing_env.tar.gz

Finally, we upload the model artifacts to Amazon S3:

!tar -C ensemble_hf/ -czf model_trt_python.tar.gz .
model_uri = sagemaker_session.upload_data(
    path="model_trt_python.tar.gz", key_prefix="triton-mme-gpu-ensemble"
)

print("S3 model uri: {}".format(model_uri))

Run ensembles on a SageMaker MME GPU instance

Now that our ensemble artifacts are stored in Amazon S3, we can configure and launch the SageMaker MME.

We start by retrieving the container image URI for the Triton DLC image that matches the one in our Region’s container registry (and is used for TensorRT model compilation):

account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}
region = boto3.Session().region_name
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")
base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:23.03-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

Next, we create the model in SageMaker. In the create_model request, we describe the container to use and the location of model artifacts, and we specify using the Mode parameter that this is a multi-model.

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": models_s3_location,
    "Mode": "MultiModel",
}

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

To host our ensembles, we create an endpoint configuration with the create_endpoint_config API call, and then create an endpoint with the create_endpoint API. SageMaker then deploys all the containers that you defined for the model in the hosting environment.

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

Although in this example we are setting a single instance to host our model, SageMaker MMEs fully support setting an auto scaling policy. For more information on this feature, see Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

Create request payloads and invoke the MME for each model

After our real-time MME is deployed, it’s time to invoke our endpoint with each of the model ensembles we used.

First, we create a payload for the DALI-Inception ensemble. We use the shiba_inu_dog.jpg image from the SageMaker public dataset of pet images. We load the image as an encoded array of bytes to use in the DALI backend (to learn more, see Image Decoder examples).

sample_img_fname = "shiba_inu_dog.jpg"

import numpy as np

s3_client = boto3.client("s3")
s3_client.download_file(
    "sagemaker-sample-files", "datasets/image/pets/shiba_inu_dog.jpg", sample_img_fname
)

def load_image(img_path):
    """
    Loads image as an encoded array of bytes.
    This is a typical approach you want to use in DALI backend
    """
    with open(img_path, "rb") as f:
        img = f.read()
        return np.array(list(img)).astype(np.uint8)
    
rv = load_image(sample_img_fname)
print(f"Shape of image {rv.shape}")

rv2 = np.expand_dims(rv, 0)
print(f"Shape of expanded image array {rv2.shape}")

payload = {
    "inputs": [
        {
            "name": "INPUT",
            "shape": rv2.shape,
            "datatype": "UINT8",
            "data": rv2.tolist(),
        }
    ]
}

With our encoded image and payload ready, we invoke the endpoint.

Note that we specify our target ensemble to be the model_tf_dali.tar.gz artifact. The TargetModel parameter is what differentiates MMEs from single-model endpoints and enables us to direct the request to the right model.

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload), TargetModel="model_tf_dali.tar.gz"
)

The response includes metadata about the invocation (such as model name and version) and the actual inference response in the data part of the output object. In this example, we get an array of 1,001 values, where each value is the probability of the class the image belongs to (1,000 classes and 1 extra for others).
Next, we invoke our MME again, but this time target the second ensemble. Here the data is just two simple text sentences:

text_inputs = ["Sentence 1", "Sentence 2"]

To simplify communication with Triton, the Triton project provides several client libraries. We use that library to prepare the payload in our request:

import tritonclient.http as http_client

text_inputs = ["Sentence 1", "Sentence 2"]
inputs = []
inputs.append(http_client.InferInput("INPUT0", [len(text_inputs), 1], "BYTES"))
batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]
input0_real = np.array(batch_request, dtype=np.object_)
inputs[0].set_data_from_numpy(input0_real, binary_data=True)
outputs = []
outputs.append(http_client.InferRequestedOutput("finaloutput"))
request_body, header_length = http_client.InferenceServerClient.generate_request_body(
    inputs, outputs=outputs
)

Now we are ready to invoke the endpoint—this time, the target model is the model_trt_python.tar.gz ensemble:

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        header_length
    ),
    Body=request_body,
    TargetModel="model_trt_python.tar.gz"
)

The response is the sentence embeddings that can be used in a variety of natural language processing (NLP) applications.

Clean up

Lastly, we clean up and delete the endpoint, endpoint configuration, and model:

sm_client.delete_endpoint(EndpointName=endpoint_name)
sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm_client.delete_model(ModelName=sm_model_name)

Conclusion

In this post, we showed how to configure, deploy, and invoke a SageMaker MME with Triton ensembles on a GPU-accelerated instance. We hosted two ensembles on a single real-time inference environment, which reduced our cost by 50% (for a g4dn.4xlarge instance, which represents over $13,000 in yearly savings). Although this example used only two pipelines, SageMaker MMEs can support thousands of model ensembles, making it an extraordinary cost savings mechanism. Furthermore, you can use SageMaker MMEs’ dynamic ability to load (and unload) models to minimize the operational overhead of managing model deployments in production.


About the authors

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Nikhil Kulkarni is a software developer with AWS Machine Learning, focusing on making machine learning workloads more performant on the cloud, and is a co-creator of AWS Deep Learning Containers for training and inference. He’s passionate about distributed Deep Learning Systems. Outside of work, he enjoys reading books, fiddling with the guitar, and making pizza.

Uri Rosenberg is the AI & ML Specialist Technical Manager for Europe, Middle East, and Africa. Based out of Israel, Uri works to empower enterprise customers to design, build, and operate ML workloads at scale. In his spare time, he enjoys cycling, backpacking, and backpropagating.

 Eliuth Triana Isaza is a Developer Relations Manager on the NVIDIA-AWS team. He connects Amazon and AWS product leaders, developers, and scientists with NVIDIA technologists and product leaders to accelerate Amazon ML/DL workloads, EC2 products, and AWS AI services. In addition, Eliuth is a passionate mountain biker, skier, and poker player.

Read More

NVIDIA H100 Tensor Core GPU Used on New Microsoft Azure Virtual Machine Series Now Generally Available

NVIDIA H100 Tensor Core GPU Used on New Microsoft Azure Virtual Machine Series Now Generally Available

Microsoft Azure users can now turn to the latest NVIDIA accelerated computing technology to train and deploy their generative AI applications.

Available today, the Microsoft Azure ND H100 v5 VMs using NVIDIA H100 Tensor Core GPUs and NVIDIA Quantum-2 InfiniBand networking — enables scaling generative AI, high performance computing (HPC) and other applications with a click from a browser.

Available to customers across the U.S., the new instance arrives as developers and researchers are using large language models (LLMs) and accelerated computing to uncover new consumer and business use cases.

The NVIDIA H100 GPU delivers supercomputing-class performance through architectural innovations, including fourth-generation Tensor Cores, a new Transformer Engine for accelerating LLMs and the latest NVLink technology that lets GPUs talk to each other at 900GB/sec.

The inclusion of NVIDIA Quantum-2 CX7 InfiniBand with 3,200 Gbps cross-node bandwidth ensures seamless performance across the GPUs at massive scale, matching the capabilities of top-performing supercomputers globally.

Scaling With v5 VMs

ND H100 v5 VMs are ideal for training and running inference for increasingly complex LLMs and computer vision models. These neural networks drive the most demanding and compute-intensive generative AI applications, including question answering, code generation, audio, video and image generation, speech recognition and more.

The ND H100 v5 VMs achieve up to 2x speedup in LLMs like the BLOOM 175B model for inference versus previous generation instances, demonstrating their potential to further optimize AI applications.

NVIDIA and Azure

NVIDIA H100 Tensor Core GPUs on Azure provide enterprises the performance, versatility and scale to supercharge their AI training and inference workloads. The combination streamlines the development and deployment of production AI with the NVIDIA AI Enterprise software suite integrated with Azure Machine Learning for MLOps, and delivers record-setting AI performance in industry-standard MLPerf benchmarks.

In addition, by connecting the NVIDIA Omniverse platform to Azure, NVIDIA and Microsoft are providing hundreds of millions of Microsoft enterprise users with access to powerful industrial digitalization and AI supercomputing resources.

Learn more about new Azure v5 instances powered by NVIDIA H100 GPUs.

Read More

AWS performs fine-tuning on a Large Language Model (LLM) to classify toxic speech for a large gaming company

AWS performs fine-tuning on a Large Language Model (LLM) to classify toxic speech for a large gaming company

The video gaming industry has an estimated user base of over 3 billion worldwide1. It consists of massive amounts of players virtually interacting with each other every single day. Unfortunately, as in the real world, not all players communicate appropriately and respectfully. In an effort to create and maintain a socially responsible gaming environment, AWS Professional Services was asked to build a mechanism that detects inappropriate language (toxic speech) within online gaming player interactions. The overall business outcome was to improve the organization’s operations by automating an existing manual process and to improve user experience by increasing speed and quality in detecting inappropriate interactions between players, ultimately promoting a cleaner and healthier gaming environment.

The customer ask was to create an English language detector that classifies voice and text excerpts into their own custom defined toxic language categories. They wanted to first determine if the given language excerpt is toxic, and then classify the excerpt in a specific customer-defined category of toxicity such as profanity or abusive language.

AWS ProServe solved this use case through a joint effort between the Generative AI Innovation Center (GAIIC) and the ProServe ML Delivery Team (MLDT). The AWS GAIIC is a group within AWS ProServe that pairs customers with experts to develop generative AI solutions for a wide range of business use cases using proof of concept (PoC) builds. AWS ProServe MLDT then takes the PoC through production by scaling, hardening, and integrating the solution for the customer.

This customer use case will be showcased in two separate posts. This post (Part 1) serves as a deep dive into the scientific methodology. It will explain the thought process and experimentation behind the solution, including the model training and development process. Part 2 will delve into the productionized solution, explaining the design decisions, data flow, and illustration of the model training and deployment architecture.

This post covers the following topics:

  • The challenges AWS ProServe had to solve for this use case
  • Historical context about large language models (LLMs) and why this technology is a perfect fit for this use case
  • AWS GAIIC’s PoC and AWS ProServe MLDT’s solution from a data science and machine learning (ML) perspective

Data challenge

The main challenge AWS ProServe faced with training a toxic language classifier was obtaining enough labeled data from the customer to train an accurate model from scratch. AWS received about 100 samples of labeled data from the customer, which is a lot less than the 1,000 samples recommended for fine-tuning an LLM in the data science community.

As an added inherent challenge, natural language processing (NLP) classifiers are historically known to be very costly to train and require a large set of vocabulary, known as a corpus, to produce accurate predictions. A rigorous and effective NLP solution, if provided sufficient amounts of labeled data, would be to train a custom language model using the customer’s labeled data. The model would be trained solely with the players’ game vocabulary, making it tailored to the language observed in the games. The customer had both cost and time constraints that made this solution unviable. AWS ProServe was forced to find a solution to train an accurate language toxicity classifier with a relatively small labeled dataset. The solution lay in what’s known as transfer learning.

The idea behind transfer learning is to use the knowledge of a pre-trained model and apply it to a different but relatively similar problem. For example, if an image classifier was trained to predict if an image contains a cat, you could use the knowledge that the model gained during its training to recognize other animals like tigers. For this language use case, AWS ProServe needed to find a previously trained language classifier that was trained to detect toxic language and fine-tune it using the customer’s labeled data.

The solution was to find and fine-tune an LLM to classify toxic language. LLMs are neural networks that have been trained using a massive number of parameters, typically in the order of billions, using unlabeled data. Before going into the AWS solution, the following section provides an overview into the history of LLMs and their historical use cases.

Tapping into the power of LLMs

LLMs have recently become the focal point for businesses looking for new applications of ML, ever since ChatGPT captured the public mindshare by being the fastest growing consumer application in history2, reaching 100 million active users by January 2023, just 2 months after its release. However, LLMs are not a new technology in the ML space. They have been used extensively to perform NLP tasks such as analyzing sentiment, summarizing corpuses, extracting keywords, translating speech, and classifying text.

Due to the sequential nature of text, recurrent neural networks (RNNs) had been the state of the art for NLP modeling. Specifically, the encoder-decoder network architecture was formulated because it created an RNN structure capable of taking an input of arbitrary length and generating an output of arbitrary length. This was ideal for NLP tasks like translation where an output phrase of one language could be predicted from an input phrase of another language, typically with differing numbers of words between the input and output. The Transformer architecture3 (Vaswani, 2017) was a breakthrough improvement on the encoder-decoder; it introduced the concept of self-attention, which allowed the model to focus its attention on different words on the input and output phrases. In a typical encoder-decoder, each word is interpreted by the model in an identical fashion. As the model sequentially processes each word in an input phrase, the semantic information at the beginning may be lost by the end of the phrase. The self-attention mechanism changed this by adding an attention layer to both the encoder and decoder block, so that the model could put different weightings on certain words from the input phrase when generating a certain word in the output phrase. Thus the basis of the transformer model was born.

The transformer architecture was the foundation for two of the most well-known and popular LLMs in use today, the Bidirectional Encoder Representations from Transformers (BERT)4 (Radford, 2018) and the Generative Pretrained Transformer (GPT)5 (Devlin 2018). Later versions of the GPT model, namely GPT3 and GPT4, are the engine that powers the ChatGPT application. The final piece of the recipe that makes LLMs so powerful is the ability to distill information from vast text corpuses without extensive labeling or preprocessing via a process called ULMFiT. This method has a pre-training phase where general text can be gathered and the model is trained on the task of predicting the next word based on previous words; the benefit here is that any input text used for training comes inherently prelabeled based on the order of the text. LLMs are truly capable of learning from internet-scale data. For example, the original BERT model was pre-trained on the BookCorpus and entire English Wikipedia text datasets.

This new modeling paradigm has given rise to two new concepts: foundation models (FMs) and Generative AI. As opposed to training a model from scratch with task-specific data, which is the usual case for classical supervised learning, LLMs are pre-trained to extract general knowledge from a broad text dataset before being adapted to specific tasks or domains with a much smaller dataset (typically on the order of hundreds of samples). The new ML workflow now starts with a pre-trained model dubbed a foundation model. It’s important to build on the right foundation, and there are an increasing number of options, such as the new Amazon Titan FMs, to be released by AWS as part of Amazon Bedrock. These new models are also considered generative because their outputs are human interpretable and in the same data type as the input data. While past ML models were descriptive, such as classifying images of cats vs. dogs, LLMs are generative because their output is the next set of words based on input words. That allows them to power interactive applications such as ChatGPT that can be expressive in the content they generate.

Hugging Face has partnered with AWS to democratize FMs and make them easy to access and build with. Hugging Face has created a Transformers API that unifies more than 50 different transformer architectures on different ML frameworks, including access to pre-trained model weights in their Model Hub, which has grown to over 200,000 models as of writing this post. In the next sections, we explore the proof of concept, the solution, and the FMs that were tested and chosen as the basis for solving this toxic speech classification use case for the customer.

AWS GAIIC proof of concept

AWS GAIIC chose to experiment with LLM foundation models with the BERT architecture to fine-tune a toxic language classifier. A total of three models from Hugging Face’s model hub were tested:

All three model architectures are based on the BERTweet architecture. BERTweet is trained based on the RoBERTa pre-training procedure. The RoBERTa pre-training procedure is an outcome of a replication study of BERT pre-training that evaluated the effects of hyperparameter tuning and training set size to improve the recipe for training BERT models6 (Liu 2019). The experiment sought to find a pre-training method that improved the performance results of BERT without changing the underlying architecture. The conclusion of the study found that the following pre-training modifications substantially improved the performance of BERT:

  • Training the model with bigger batches over more data
  • Removing the next sentence prediction objective
  • Training on longer sequences
  • Dynamically changing the masking pattern applied to the training data

The bertweet-base model uses the preceding pre-training procedure from the RoBERTa study to pre-train the original BERT architecture using 850 million English tweets. It is the first public large-scale language model pre-trained for English tweets.

Pre-trained FMs using tweets were thought to fit the use case for two main theoretical reasons:

  • The length of a tweet is very similar to the length of an inappropriate or toxic phrase found in online game chats
  • Tweets come from a population with a large variety of different users, similar to that of the population found in gaming platforms

AWS decided to first fine-tune BERTweet with the customer’s labeled data to get a baseline. Then chose to fine-tune two other FMs in bertweet-base-offensive and bertweet-base-hate that were further pre-trained specifically on more relevant toxic tweets to achieve potentially higher accuracy. The bertweet-base-offensive model uses the base BertTweet FM and is further pre-trained on 14,100 annotated tweets that were deemed as offensive7 (Zampieri 2019). The bertweet-base-hate model also uses the base BertTweet FM but is further pre-trained on 19,600 tweets that were deemed as hate speech8 (Basile 2019).

To further enhance the performance of the PoC model, AWS GAIIC made two design decisions:

  • Created a two-stage prediction flow where the first model acts as a binary classifier that classifies whether a piece of text is toxic or not toxic. The second model is a fine-grained model that classifies text based on the customer’s defined toxic types. Only if the first model predicts the text as toxic does it get passed to the second model.
  • Augmented the training data and added a subset of a third-party-labeled toxic text dataset from a public Kaggle competition (Jigsaw Toxicity) to the original 100 samples received from the customer. They mapped the Jigsaw labels to the associated customer-defined toxicity labels and did an 80% split as training data and 20% split as test data to validate the model.

AWS GAIIC used Amazon SageMaker notebooks to run their fine-tuning experiments and found that the bertweet-base-offensive model achieved the best scores on the validation set. The following table summarizes the observed metric scores.

Model Precision Recall F1 AUC
Binary .92 .90 .91 .92
Fine-grained .81 .80 .81 .89

From this point, GAIIC handed off the PoC to the AWS ProServe ML Delivery Team to productionize the PoC.

AWS ProServe ML Delivery Team solution

To productionize the model architecture, the AWS ProServe ML Delivery Team (MLDT) was asked by the customer to create a solution that is scalable and easy to maintain. There were a few maintenance challenges of a two-stage model approach:

  • The models would require double the amount of model monitoring, which makes retraining timing inconsistent. There may be times that one model will have to be retrained more often than the other.
  • Increased costs of running two models as opposed to one.
  • The speed of inference slows because inference goes through two models.

To address these challenges, AWS ProServe MLDT had to figure out how to turn the two-stage model architecture into a single model architecture while still being able to maintain the accuracy of the two-stage architecture.

The solution was to first ask the customer for more training data, then to fine-tune the bertweet-base-offensive model on all the labels, including non-toxic samples, into one model. The idea was that fine-tuning one model with more data would result in similar results as fine-tuning a two-stage model architecture on less data. To fine-tune the two-stage model architecture, AWS ProServe MLDT updated the pre-trained model multi-label classification head to include one extra node to represent the non-toxic class.

The following is a code sample of how you would fine-tune a pre-trained model from the Hugging Face model hub using their transformers platform and alter the model’s multi-label classification head to predict the desired number of classes. AWS ProServe MLDT used this blueprint as its basis for fine-tuning. It assumes that you have your train data and validation data ready and in the correct input format.

First, Python modules are imported as well as the desired pre-trained model from the Hugging Face model hub:

# Imports.
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    DataCollatorWithPadding,
    PreTrainedTokenizer,
    Trainer,
    TrainingArguments,
)

# Load pretrained model from model hub into a tokenizer.
model_checkpoint = “cardiffnlp/bertweet-base-offensive”
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

The pre-trained model then gets loaded and prepped for fine-tuning. This is the step where the number of toxic categories and all model parameters get defined:

# Load pretrained model into a sequence classifier to be fine-tuned and define the number of classes you want to classify in the num_labels parameter.

model = AutoModelForSequenceClassification.from_pretrained(
            model_checkpoint,
            num_labels=[number of classes]
        )

# Set your training parameter arguments. The below are some key parameters that AWS ProServe MLDT tuned:
training_args = TrainingArguments(
        num_train_epochs=[enter input]
        per_device_train_batch_size=[enter input]
        per_device_eval_batch_size=[enter input]
        evaluation_strategy="epoch",
        logging_strategy="epoch",
        save_strategy="epoch",
        learning_rate=[enter input]
        load_best_model_at_end=True,
        metric_for_best_model=[enter input]
        optim=[enter input],
    )

Model fine-tuning starts with inputting paths to the training and validation datasets:

# Finetune the model from the model_checkpoint, tokenizer, and training_args defined assuming train and validation datasets are correctly preprocessed.
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=[enter input],
        eval_dataset=[enter input],
        tokenizer=tokenizer,
        data_collator=data_collator,
    )

# Finetune model command.
trainer.train()

AWS ProServe MLDT received approximately 5,000 more labeled data samples, 3,000 being non-toxic and 2,000 being toxic, and fine-tuned all three bertweet-base models, combining all labels into one model. They used this data in addition to the 5,000 samples from the PoC to fine-tune new one-stage models using the same 80% train set, 20% test set method. The following table shows that the performance scores were comparable to that of the two-stage model.

Model Precision Recall F1 AUC
bertweet-base (1-Stage) .76 .72 .74 .83
bertweet-base-hate (1-Stage) .85 .82 .84 .87
bertweet-base-offensive (1-Stage) .88 .83 .86 .89
bertweet-base-offensive (2-Stage) .91 .90 .90 .92

The one-stage model approach delivered the cost and maintenance improvements while only decreasing the precision by 3%. After weighing the trade-offs, the customer opted for AWS ProServe MLDT to productionize the one-stage model.

By fine-tuning one model with more labeled data, AWS ProServe MLDT was able to deliver a solution that met the customer’s threshold for model accuracy, as well as deliver on their ask for ease of maintenance, while lowering cost and increasing robustness.

Conclusion

A large gaming customer was looking for a way to detect toxic language within their communication channels to promote a socially responsible gaming environment. AWS GAIIC created a PoC of a toxic language detector by fine-tuning an LLM to detect toxic language. AWS ProServe MLDT then updated the model training flow from a two-stage approach to a one-stage approach and productionized the LLM for the customer to be used at scale.

In this post, AWS demonstrates the effectiveness and practicality of fine-tuning an LLM to solve this customer use case, shares context on the history of foundation models and LLMs, and introduces the workflow between the AWS Generative AI Innovation Center and the AWS ProServe ML Delivery Team. In the next post in this series, we will dive deeper into how AWS ProServe MLDT productionized the resulting one-stage model using SageMaker.

If you are interested in working with AWS to build a Generative AI solution, please reach out to the GAIIC. They will assess your use case, build out a Generative-AI-based proof of concept, and have options to extend collaboration with AWS to implement the resulting PoC into production.

References

  1. Gamer Demographics: Facts and Stats About the Most Popular Hobby in the World
  2. ChatGPT sets record for fastest-growing user base – analyst note
  3. Vaswani et al., “Attention is All You Need”
  4. Radford et al., “Improving Language Understanding by Generative Pre-Training”
  5. Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding”
  6. Yinhan Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach”
  7. Marcos Zampieri et al., “SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)”
  8. Valerio Basile et al., “SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter”

About the authors

James Poquiz is a Data Scientist with AWS Professional Services based in Orange County, California. He has a BS in Computer Science from the University of California, Irvine and has several years of experience working in the data domain having played many different roles. Today he works on implementing and deploying scalable ML solutions to achieve business outcomes for AWS clients.

Han Man is a Senior Data Science & Machine Learning Manager with AWS Professional Services based in San Diego, CA. He has a PhD in Engineering from Northwestern University and has several years of experience as a management consultant advising clients in manufacturing, financial services, and energy. Today, he is passionately working with key customers from a variety of industry verticals to develop and implement ML and GenAI solutions on AWS.

Safa Tinaztepe is a full-stack data scientist with AWS Professional Services. He has a BS in computer science from Emory University and has interests in MLOps, distributed systems, and web3.

Read More

INT8 Quantization for x86 CPU in PyTorch

INT8 Quantization for x86 CPU in PyTorch

Overview

INT8 quantization is a powerful technique for speeding up deep learning inference on x86 CPU platforms. By reducing the precision of the model’s weights and activations from 32-bit floating-point (FP32) to 8-bit integer (INT8), INT8 quantization can significantly improve the inference speed and reduce memory requirements without sacrificing accuracy.

In this blog, we will discuss the recent progress on INT8 quantization for x86 CPU in PyTorch, focusing on the new x86 quantization backend. We will also briefly look at the new quantization path with PyTorch 2.0 Export (PT2E) and TorchInductor.

X86 Quantization Backend

The current recommended way of quantization in PyTorch is FX. Before PyTorch 2.0, the default quantization backend (a.k.a. QEngine) on x86 CPUs was FBGEMM, which leveraged the FBGEMM performance library to achieve the performance speedup. In the PyTorch 2.0 release, a new quantization backend called X86 was introduced to replace FBGEMM. The x86 quantization backend offers improved INT8 inference performance when compared to the original FBGEMM backend by leveraging the strengths of both FBGEMM and the Intel® oneAPI Deep Neural Network Library (oneDNN) kernel libraries.

Performance Benefit from X86 Backend

To measure the performance benefits of the new X86 backend, we ran INT8 inference on 69 popular deep learning models (shown in Figures 1-3 below) using 4th Gen Intel® Xeon® Scalable processors. The results showed a 2.97X geomean performance speedup compared to FP32 inference performance, while the speedup was 1.43X with the FBGEMM backend. The charts below show the per-model performance speedup comparing the x86 backend and the FBGEMM backend.

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 1: Models with less than 2x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 2: Models with 2x-4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Figure 3: Models with larger than 4x performance boost with x86 backend1

Usage of x86 Backend

By default in 2.0, users on x86 platforms will use the x86 quantization backend and their PyTorch programs will remain unchanged when using the default backend. Alternatively, users can specify x86 as the quantization backend explicitly.
Below is an example code snippet of PyTorch static post-training quantization with x86 quantization backend.

import torch
from torch.ao.quantization import get_default_qconfig_mapping
from torch.quantization.quantize_fx import prepare_fx, convert_fx

qconfig_mapping = get_default_qconfig_mapping()
# Or explicity specify the qengine
# qengine = 'x86'
# torch.backends.quantized.engine = qengine
# qconfig_mapping = get_default_qconfig_mapping(qengine)

model_fp32 = MyModel().eval()
x = torch.randn((1, 3, 224, 224), dtype=torch.float)
x = x.to(memory_format=torch.channels_last)

# Insert observers according to qconfig and backend config
prepared_model = prepare_fx(model_fp32, qconfig_mapping, example_inputs=x)

# Calibration code not shown

# Convert to quantized model
quantized_model = convert_fx(prepared_model)

Technical Details of x86 Backend

We devised heuristic dispatching rules according to the performance numbers from the models we benchmarked to decide whether to invoke oneDNN or FBGEMM performance library to execute the convolution or matrix multiplication operations. The rules are a combination of operation kinds, shapes, CPU architecture information, etc. Detailed logic is available here. For more design and technical discussion, please refer to the Request for Comments.

Next Steps With a New Quantization Path PyTorch 2.0 Export

Although still far from finalized, a new quantization path, PyTorch 2.0 Export (PT2E), is in early design and PoC stage. The new approach is slated to replace the FX quantization path in the future. It is built upon the capabilities of TorchDynamo Export, a feature introduced in the PyTorch 2.0 release for FX graph capturing. This graph is then quantized and lowered to different backends. TorchInductor, the new DL compiler of PyTorch, has shown promising results in terms of FP32 inference speedup on x86 CPU. We are working actively to enable it as one of the quantization backends of PT2E. We believe the new path will lead to further improvements in INT8 inference performance due to more flexibility of fusion at different levels.

Conclusion

The x86 backend introduced in PyTorch 2.0 release has demonstrated a remarkable improvement in INT8 inference speed on x86 CPU platforms. It offers a 1.43X speedup compared to the original FBGEMM backend while maintaining backward compatibility. This enhancement can benefit end users with minimal or no modifications to their programs. Furthermore, a new quantization path, PT2E, is currently in development and is expected to provide even more possibilities in the future.

Acknowledgement

Special thanks to Nikita Shulga, Vasiliy Kuznetsov, Supriya Rao, and Jongsoo Park. Together, we made one more step forward on the path of improving the PyTorch CPU ecosystem.

Configuration

1 AWS EC2 r7iz.metal-16xl instance (Intel(R) Xeon(R) Gold 6455B, 32-core/64-thread, Turbo Boost On, Hyper-Threading On, Memory: 8x64GB, Storage: 192GB); OS: Ubuntu 22.04.1 LTS; Kernel: 5.15.0-1028-aws; Batch Size: 1; Core per Instance: 4; PyTorch 2.0 RC3; TorchVision 0.15.0+cpu, test by Intel on 3/77/2023. May not reflect all publicly available security updates.

Read More