The Power of Two: VMware, NVIDIA Bring AI to the Virtual Data Center

The Power of Two: VMware, NVIDIA Bring AI to the Virtual Data Center

Two key components of enterprise AI just snapped in place thanks to longtime partners who pioneered virtual desktops, virtual graphics workstations and more.

Taking their partnership to a new level, VMware and NVIDIA are uniting accelerated computing and virtualization to bring the power of AI to every company.

It’s a collaboration that will enable users to run data analytics and machine learning workloads in containers or virtual machines, secured and managed with familiar VMware tools. It will create a new sweet spot in hybrid cloud computing with greater control, lowered costs and expanded performance.

The partnership plants behind the firewalls of private companies the power of AI that public clouds provide from the world’s largest AI data centers.

The two companies will demonstrate these capabilities this week at VMworld.

Welcome to the Modern, Accelerated Data Center

Thanks to this collaboration, users will be able to run AI and data science software from NGC Catalog, NVIDIA’s hub for GPU-optimized AI software, using containers or virtual machines in a hybrid cloud based on VMware Cloud Foundation. It’s the kind of accelerated computing that’s a hallmark of the modern data center.

NVIDIA and VMware also launched a related effort enabling users to build a more secure and powerful hybrid cloud accelerated by NVIDIA BlueField-2 DPUs. These data processing units are built to offload and accelerate software-defined storage, security and networking tasks, freeing up CPU resources for enterprise applications.

Enterprises Gear Up for AI

Machine learning lets computers write software humans never could. It’s a capability born in research labs that’s rapidly spreading to data centers across every industry from automotive and banking to healthcare, retail and more.

The partnership will let VMware users train and run neural networks across multiple GPUs in public and private clouds. It also will enable them to share a single GPU across multiple jobs or users thanks to the multi-instance capabilities in the latest NVIDIA A100 GPUs.

To achieve these goals, the two companies will bring GPU acceleration to VMware vSphere to run AI and data-science jobs at near bare-metal performance next to existing enterprise apps on standard enterprise servers. In addition, software and models in NGC will support VMware Tanzu.

With these links, AI workloads can be virtualized and virtual environments become AI-ready without sacrificing system performance. And users can create hybrid clouds that give them the choice to run jobs in private or public data centers.

Companies will no longer need standalone AI systems for machine learning or big data analytics that are separate from their IT resources. Now a single enterprise infrastructure can run AI and traditional workloads managed by VMware tools and administrators.

“We’re providing the best of both worlds by bringing mature management capabilities to bare-metal systems and great performance to virtualized AI workloads,” said Kit Colbert, vice president and CTO of VMware’s cloud platform group.

Demos Show the Power of Two

Demos at VMworld will show a platform that delivers AI results fast as the public cloud and robust enough to tackle critical jobs like fighting COVID-19. They will run containers from NVIDIA NGC, managed by Tanzu, on VMware Cloud Foundation.

We’ll show those same VMware environments also tapping into the power of BlueField-2 DPUs to secure and accelerate hybrid clouds that let remote designers collaborate in an immersive, real-time environment.

That’s just the beginning. NVIDIA is committed to giving VMware the support to be a first-class platform for everything we build. In the background, VMware and NVIDIA engineers are driving a multi-year effort to deliver game-changing capabilities.

Colbert of VMware agreed. “We view the two initiatives we’re announcing today as initial steps, and there is so much more we can do. We invite customers to tell us what they need most to help prioritize our work,” he said.

To learn more, register for the early-access program and tune in to VMware sessions at GTC 2020 next week.

 

 

The post The Power of Two: VMware, NVIDIA Bring AI to the Virtual Data Center appeared first on The Official NVIDIA Blog.

Read More

Networks on Steroids: VMware, NVIDIA Power the Data Center with DPUs

Networks on Steroids: VMware, NVIDIA Power the Data Center with DPUs

The data center’s grid is about to plug in to a new source of power.

It rides a kind of network interface card called a SmartNIC. Its smarts and speed spring from an ASIC called a data processing unit.

In short, the DPU packs the power of data center infrastructure on a chip.

DPU-enabled SmartNICs will be available for millions of virtualized servers thanks to a collaboration between VMware and NVIDIA. They bring advances in security and storage as well as networking that will stretch from the core to the edge of the corporate network.

What’s more, the companies announced a related initiative that will put the power of the public AI cloud behind the corporate firewall. It enables enterprise AI managed with familiar VMware tools.

Lighting Up the Modern Data Center

Together, these efforts will give users the choice to run machine learning workloads in containers or virtual machines, secured and managed with familiar VMware tools. And they will create a new sweet spot in hybrid cloud computing with greater control, lowered costs and the highest performance.

Laying the foundation for these capabilities, the partnership will help users build more secure and powerful distributed networks inside VMware Cloud Foundation, powered by the NVIDIA BlueField-2 DPU. It’s the Swiss Army knife of data center infrastructure that can accelerate security, storage, networking, and management tasks, freeing up CPUs to focus on enterprise applications.

The DPU’s jobs include:

  • Blocking malware
  • Advanced encryption
  • Network virtualization
  • Load balancing
  • Intrusion detection and prevention
  • Data compression
  • Packet switching
  • Packet inspection
  • Managing pools of solid-state and hard-disk storage

Our DPUs can run these tasks today across two ports, each carrying traffic at 100 Gbit/second. That’s an order of magnitude faster than CPUs geared for enterprise apps. The DPU is taking on these jobs so CPU cores can run more apps, boosting vSphere and data center efficiency.

As a result, data centers can handle more apps and their networks will run faster, too.

“The BlueField-2 SmartNIC is a fundamental building block for us because we can take advantage of its DPU hardware for better network performance and dramatically reduced cost to operate data center infrastructure,” said Kit Colbert, vice president and CTO of VMware’s cloud platform group.

NVIDIA BlueField-2 DPU in VMware's Project Monterey
Running VMware Cloud Foundation on the NVIDIA BlueField-2 DPU provides security isolation and lets CPUs support more apps per server.

Securing the Data Center with DPUs

DPUs also will usher in a new era of advanced security.

Today, most companies run their security policies on the same CPUs that run their applications. That kind of multitasking leaves IT departments vulnerable to malware or attacks in the guise of a new app.

With the BlueField DPU, all apps and requests can be vetted on a processor isolated from the application domain, enforcing security and other policies. Many cloud computing services already use this approach to create so-called zero-trust environments where software authenticates everything.

VMware is embracing SmartNICs in its products as part of an initiative called Project Monterey. With SmartNICs, corporate data centers can take advantage of the same advances Web giants enjoy.

“These days the traditional security perimeter is gone. So, we believe you need to root security in the hardware of the SmartNIC to monitor servers and network traffic very fast and without performance impacts,” said Colbert.

BlueField-2 DPU demo with VMwar
A demo shows an NVIDIA BlueField-2 DPU preventing a DDOS attack that swamps a CPU.

See DPUs in Action at VMworld

The companies are demonstrating these capabilities this week at VMworld. For example, the demo below shows how virtual servers running VMware ESXi clients can use Bluefield-2 DPUs to stop a distributed denial-of-service attack in a server cluster.

Leading OEMs are already preparing to bring the capabilities of DPUs to market. NVIDIA also plans to support BlueField-2 SmartNICs across its portfolio of platforms including its EGX systems for enterprise and edge computing.

You wouldn’t hammer a nail with a monkey wrench or pound in a screw with a hammer — you need to use the right tool for the job. To build the modern data center network, that means using an NVIDIA DPU enabled by VMware.

The post Networks on Steroids: VMware, NVIDIA Power the Data Center with DPUs appeared first on The Official NVIDIA Blog.

Read More

Drug Discovery in the Age of COVID-19

Drug Discovery in the Age of COVID-19

Drug discovery is like searching for the right jigsaw tile — in a puzzle box with 1060 molecular-size pieces. AI and HPC tools help researchers more quickly narrow down the options, like picking out a subset of correctly shaped and colored puzzle pieces to experiment with.

An effective small-molecule drug will bind to a target enzyme, receptor or other critical protein along the disease pathway. Like the perfect puzzle piece, a successful drug will be the ideal fit, possessing the right shape, flexibility and interaction energy to attach to its target.

But it’s not enough just to interact strongly with the target. An effective therapeutic must modify the function of the protein in just the right way, and also possess favorable absorption, distribution, metabolism, excretion and toxicity properties — creating a complex optimization problem for scientists.

Researchers worldwide are racing to find effective vaccine and drug candidates to inhibit infection with and replication of SARS-CoV-2, the virus that causes COVID-19. Using NVIDIA GPUs, they’re accelerating this lengthy discovery process — whether for structure-based drug design, molecular docking, generative AI models, virtual screening or high-throughput screening.

Identifying Protein Targets with Genomics

To develop an effective drug, researchers have to know where to start. A disease pathway — a chain of signals between molecules that trigger different cell functions — may involve thousands of interacting proteins. Genomic analyses can provide invaluable insights for researchers, helping them identify promising proteins to target with a specific drug.

With the NVIDIA Clara Parabricks genome analysis toolkit, researchers can sequence and analyze genomes up to 50x faster. Given the unprecedented spread of the COVID pandemic, getting results in hours versus days can have an extraordinary impact on understanding the virus and developing treatments.

To date, hundreds of institutions, including hospitals, universities and supercomputing centers, in 88 countries have downloaded the software to accelerate their work — to sequence the viral genome itself, as well as to sequence the DNA of COVID patients and investigate why some are more severely affected by the virus than others.

Another method, cryo-EM, uses electron microscopes to directly observe flash-frozen proteins — and can harness GPUs to shorten processing time for the complex, massive datasets involved.

Using CryoSPARC, a GPU-accelerated software built by Toronto startup Structura Biotechnology, researchers at the National Institutes of Health and the University of Texas at Austin created the first 3D, atomic-scale map of the coronavirus, providing a detailed view into the virus’ spike proteins, a key target for vaccines, therapeutic antibodies and diagnostics.

GPU-Accelerated Compound Screening

Once a target protein has been identified, researchers search for candidate compounds that have the right properties to bind with it. To evaluate how effective drug candidates will be, researchers can screen drug candidates virtually, as well as in real-world labs.

New York-based Schrödinger creates drug discovery software that can model the properties of potential drug molecules. Used by the world’s biggest biopharma companies, the Schrödinger platform allows its users to determine the binding affinity of a candidate molecule on NVIDIA Tensor Core GPUs in under an hour and with just a few dollars of compute cost — instead of many days and thousands of dollars using traditional methods.

Generative AI Models for Drug Discovery

Rather than evaluating a dataset of known drug candidates, a generative AI model starts from scratch. Tokyo-based startup Elix, Inc., a member of the NVIDIA Inception virtual accelerator program, uses generative models trained on NVIDIA DGX Station systems to come up with promising molecular structures. Some of the AI’s proposed molecules may be unstable or difficult to synthesize, so additional neural networks are used to determine the feasibility for these candidates to be tested in the lab.

With DGX Station, Elix achieves up to a 6x speedup on training the generative models, which would otherwise take a week or more to converge, or to reach the lowest possible error rate.

Molecular Docking for COVID-19 Research

With the inconceivable size of the chemical space, researchers couldn’t possibly test every possible molecule to figure out which will be effective to combat a specific disease. But based on what’s known about the target protein, GPU-accelerated molecular dynamics applications can be used to approximate molecular behavior and simulate target proteins at the atomic level.

Software like AutoDock-GPU, developed by the Center for Computational Structural Biology at the Scripps Research Institute, enables researchers to calculate the interaction energy between a candidate molecule and the protein target. Known as molecular docking, this computationally complex process simulates millions of different configurations to find the most favorable arrangement of each molecule for binding. Using the more than 27,000 NVIDIA GPUs on Oak Ridge National Laboratory’s Summit supercomputer, scientists were able to screen 1 billion drug candidates for COVID-19 in just 12 hours. Even using a single NVIDIA GPU provides more than 230x speedup over using a single CPU.

Argonne deployed one of the first DGX-A100 systems. Courtesy of Argonne National Laboratory.

In Illinois, Argonne National Laboratory is accelerating COVID-19 research using an NVIDIA A100 GPU-powered system based on the DGX SuperPOD reference architecture. Argonne researchers are combining AI and advanced molecular modelling methods to perform accelerated simulations of the viral proteins, and to screen billions of potential drug candidates, determining the most promising molecules to pursue for clinical trials.

Accelerating Biological Image Analysis

The drug discovery process involves significant high-throughput lab experiments as well. Phenotypic screening is one method of testing, in which a diseased cell is exposed to a candidate drug. With microscopes, researchers can observe and record subtle changes in the cell to determine if it starts to more closely resemble a healthy cell. Using AI to automate the process, thousands of possible drugs can be screened.

Digital biology company Recursion, based in Salt Lake City, uses AI and NVIDIA GPUs to observe these subtle changes in cell images, analyzing terabytes of data each week. The company has released an open-source COVID dataset, sharing human cellular morphological data with researchers working to create therapies for the virus.

Future Directions in AI for Drug Discovery

As AI and accelerated computing continue to accelerate genomics and drug discovery pipelines, precision medicine — personalizing individual patients’ treatment plans based on insights about their genome and their phenotype — will become more attainable.

Increasingly powerful NLP models will be applied to organize and understand massive datasets of scientific literature, helping connect the dots between independent investigations. Generative models will learn the fundamental equations of quantum mechanics and be able to suggest the optimal molecular therapy for a given target.

To learn more about how NVIDIA GPUs are being used to accelerate drug discovery, check out talks by Schrödinger, Oak Ridge National Laboratory and Atomwise at the GPU Technology Conference next week.

For more on how AI and GPUs are advancing COVID research, read our blog stories and visit the COVID-19 research hub.

Subscribe to NVIDIA healthcare news here

The post Drug Discovery in the Age of COVID-19 appeared first on The Official NVIDIA Blog.

Read More

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS has expanded the availability of Amazon EC2 Inf1 instances to four new AWS Regions, bringing the total number of supported Regions to 11: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Paris), and South America (São Paulo).

Amazon EC2 Inf1 instances are powered by AWS Inferentia chips, which are custom-designed to provide you with the lowest cost per inference in the cloud and lower the barriers for everyday developers to use machine learning (ML) at scale. Customers using models such as YOLO v3 and YOLO v4 can get up to 1.85 times higher throughput and up to 40% lower cost per inference compared to the EC2 G4 GPU-based instances.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure cost spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to deliver faster performance and more cost-effective ML inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances natively with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. You can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes by using the AWS Neuron SDK, which is integrated with these popular ML frameworks. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen a dramatic improvement in the breadth of models that deliver best-in-class performance at a fraction of the cost. This includes natural language processing models like the popular BERT, image classification models (ResNet and VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new neuron tool that allows you to easily scale monitoring of large fleets of inference applications.

Customer success stories

Since the launch of Inf1 instances, a broad spectrum of customers, from large enterprises to startups, as well as Amazon services, have begun using them to run production workloads.

Anthem is one of the nation’s leading health benefits companies, serving the healthcare needs of over 40 million members across dozens of states. They use deep learning to automate the generation of actionable insights from customer opinions via natural language models.

“Our application is computationally intensive and needs to be deployed in a highly performant manner,” says Numan Laanait, PhD, Principal AI/Data Scientist at Anthem. “We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide two times higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Condé Nast, another AWS customer, has a global portfolio that encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair.

“Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips,” says Paul Fryzel, Principal Engineer in AI Infrastructure at Condé Nast. “This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances.”

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.


About the Author

Gadi Hutt is a Sr. Director, Business Development at AWS. Gadi has over 20 years’ experience in engineering and business disciplines. He started his career as an embedded software engineer, and later on moved to product lead positions. Since 2013, Gadi leads Annapurna Labs technical business development and product management focused on hardware acceleration software and hardware products like the EC2 FPGA F1 instances and AWS Inferentia along side with its Neuron SDK, accelerating machine learning in the cloud.

Read More

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added.

In practice, data scientists often work with Jupyter notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of a ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following:

  • Amazon SageMaker – A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly
  • AWS Glue – A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data

In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints.

Use case

For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes.

This post doesn’t go into the details of the model, but demonstrates a way to build an ML pipeline that builds and deploys any ML model.

Solution overview

The following diagram summarizes the approach for the retraining pipeline.

The workflow contains the following elements:

  • AWS Glue crawler – You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
  • AWS Glue triggers – Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers.
  • AWS Glue job – An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location.
  • AWS Glue workflow – An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image.

The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters.

When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status.

At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more.

Setting up the environment

To set up the environment, complete the following steps:

  1. Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI.
  2. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet.
  3. Download the following code into your local directory.

Organization of code

The code to build the pipeline has the following directory structure:

--Glue workflow orchestration
	--glue_scripts
		--DataExtractionJob.py
		--DataProcessingJob.py
		--MessagingQueueJob,py
		--TrainingJob.py
	--base_resources.template
	--deploy.sh
	--glue_resources.template

The code directory is divided into three parts:

  • AWS CloudFormation templates – The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow.
  • AWS Glue scripts – The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs.
  • Bash script – A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers.

Implementing the solution

Complete the following steps:

  1. Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region.

The following code example is a path for Region us-west-2:

algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest"

For more information about BlazingText parameters, see Common parameters for built-in algorithms.

  1. Enter the following code in your terminal:
    sh deploy.sh -s dev AWS_PROFILE=your_profile_name

This step sets up the infrastructure of the pipeline.

  1. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE.
  2. On the AWS Glue console, manually start the pipeline.

In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs.

  1. To begin the workflow, in the Workflow section, select DevMLWorkflow.
  2. From the Actions drop-down menu, choose Run.
  3. View the progress of your workflow on the History tab and select the latest RUN ID.

The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion.

  1. After the workflow is successful, open the Amazon SageMaker console.
  2. Under Inference, choose Endpoint.

The following screenshot shows that the endpoint the workflow deployed is ready.

Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application.

Cleaning up

Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code:

	def delete_resources(self):

        endpoint_name = self.endpoint

        try:
            sagemaker.delete_endpoint(EndpointName=endpoint_name)
            print("Deleted Test Endpoint ", endpoint_name)
        except Exception as e:
            print('Model endpoint deletion failed')

        try:
            sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name)
            print("Deleted Test Endpoint Configuration ", endpoint_name)
        except Exception as e:
            print(' Endpoint config deletion failed')

        try:
            sagemaker.delete_model(ModelName=endpoint_name)
            print("Deleted Test Endpoint Model ", endpoint_name)
        except Exception as e:
            print('Model deletion failed')

Conclusion

This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.


About the Authors

Sai Sharanya Nalla is a Data Scientist at AWS Professional Services. She works with customers to develop and implement AI and ML solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, long walks, and engaging in outreach activities.

 

 

 

Inchara B Diwakar is a Data Scientist at AWS Professional Services. She designs and engineers ML solutions at scale, with experience across healthcare, manufacturing and retail verticals. Outside of work, she enjoys the outdoors, traveling and a good read.

Read More

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

Bidirectional Encoder Representations from Transformers (BERT) [1] has become one of the most popular models for natural language processing (NLP) applications. BERT can outperform other models in several NLP tasks, including question answering and sentence classification.

Training the BERT model on large datasets is expensive and time consuming, and achieving low latency when performing inference on this model is challenging. Latency and throughput are key factors to deploy a model in production. In this post, we focus on optimizing these factors for BERT inference tasks. We also compare the cost of deploying BERT on different Amazon Elastic Compute Cloud (Amazon EC2) instances.

When running inference on the BERT-base model, the g4dn.xlarge GPU instance achieves between 2.6–5 times lower latency (3.8 on average) than a c5.24xlarge CPU instance. The g4dn.xlarge instance also achieves the best cost-effective ratio (cost per requests) compared to c5.xlarge, c5.24xlarge, and m5.xlarge CPU instances. Specifically, the cost of processing 1 million BERT-inference requests with sequence length 128 is $0.20 on g4dn.xlarge, whereas on c5.xlarge (the best of these CPU instances), the cost is $3.31—the GPU instance is 16.5 times more efficient.

We achieved these results after a set of GPU optimizations on MXNet, described in the section Optimizing BERT model performance on MXNET 1.6 and 1.7 of this post.

Amazon EC2 G4 instances

G4 instances are optimized for machine learning application deployments. They’re equipped with NVIDIA T4 GPUs, powered by Tensor Cores, and deliver groundbreaking AI performance: up to 65 TFLOPS in FP16 precision and up to 130 TOPS in INT8 precision.

Amazon EC2 offers a variety of G4 instances with one or multiple GPUs, and with different amounts of vCPU and memory. You can perform BERT inference below 5 milliseconds on a single T4 GPU with 16 GB, such as on a g4dn.xlarge instance (the cost of this instance at the time of writing is $0.526 per hour on demand in the US East (N. Virginia) Region.

For more information about G4 instances, see Amazon EC2 G4 Instances.

GluonNLP and MXNet

GluonNLP is a deep learning framework built on top of MXNet, which was specifically designed for NLP applications. It extends MXNet, providing NLP models, datasets, and examples.

GluonNLP includes an efficient implementation of the BERT model, scripts for training and performing inference, and several datasets (such as GLUE benchmark and SQuAD). For more information, see GluonNLP: NLP made easy.

For this post, we use the GluonNLP BERT implementation to perform inference on NLP tasks. Specifically, we use MXNet version 1.7 and GluonNLP version 0.10.0.

BERT-base inference results

We present results performing two different BERT tasks: question answering and classification (sentimental analysis using the Stanford Sentiment Treebank (SST2) dataset). We achieved the results after a set of GPU optimizations on MXNet.

In the following graphs, we compare latency achieved by a single GPU on a g4dn.xlarge instance with FP16 precision vs. the most efficient CPU instance in terms of latency, c5.24xlarge with INT8 precision, MKL BLAS and 24 OpenMP threads.

The following graph shows BERT-base latency on c5.25xlarge (INT8) and g4dn.xlarge (FP16) instances performing a classification inference task (SST2 dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

The following graph shows BERT-base latency on c5.24xlarge (INT8) and g4dn.xlarge (FP16) instances performing a question answering inference task (SQuAD dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

 

In the following two graphs, we present a cost comparison between several instances based on the throughput (sentences/s) and the cost of each instance on demand (cost per hour) in the US East (N. Virginia) Region.

The following graph shows dollars per 1 million sequence classification requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

The following graph shows dollars per 1 million question answering requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

Deploying BERT on G4 instances

You can easily reproduce the results in the preceding section on a g4dn.xlarge instance. You can start from a pretrained model and fine-tune it for a specific task before running inference, or you can download one of the following fine-tuned models:

Then complete the following steps:

  1. To initialize a G4 instance, on the Amazon EC2 console, choose Deep Learning AMI (Ubuntu 18.04) Version 28.1 (or posterior) and a G4 instance.
  2. Connect to the instance and set MXNet 1.7 and GluonNLP 0.10.x:
pip install mxnet-cu102==1.7.0
git clone --branch v0.10.x https://github.com/dmlc/gluon-nlp.git
cd gluon-nlp; pip install -e .; cd scripts/bert
python setup.py install

 The command python setup.py install generates a custom graph pass (bertpass_lib.so) that optimizes the graph, and therefore performance. It can be passed to the inference script as an argument.

  1. If you didn’t download any fine-tuned parameters, you can now fine-tune your model to specify a sequence length and use a GPU.
    • For a question answering task, run the following script (approximately 180 minutes):
python3 finetune_squad.py --max_seq_length 128 --gpu
    • For a classification task, run the following script:
python3 finetune_classifier.py --task_name [task_name] --max_len 128 --gpu 0

In the preceding code, task choices include ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’ (refers to SST2), ‘XNLI’, ‘LCQMC’, and ‘ChnSentiCorp’. Computation time depends on the specific task. For SST, it should take less than 15 minutes.

By default, these scripts run 3 epochs (to achieve the published accuracy in [1]).

They generate an output file, output_dir/net.params, where the fine-tuned parameters are stored and from where they can be loaded at inference step. Scripts also perform a prediction test to check accuracy.

You should get an F1 score of 85 or higher in question answering, and a validation metric higher to 0.92 in SST classification task.

You can now perform inference using validation datasets.

  1. Force MXNet to use FP32 precision in Softmax and LayerNorm layers for better accuracy when using FP16.

These two layers are susceptible to overflow, so we recommend always using FP32. MXNet takes care of it if you set the following:

export MXNET_SAFE_ACCUMULATION=1
  1. Activate True FP16 computation for performance purposes.

General matrix multiply operations don’t present accuracy issues in this model. By default, they’re computed using FP32 accumulation (for more information, see the section Optimizing BERT model performance on MXNET 1.6 and 1.7 in this post), but you can activate the FP16 accumulation setting:

 export MXNET_FC_TRUE_FP16=1
  1. Run inference:
python3 deploy.py --model_parameters [path_to_finetuned_params] --task [_task_] --gpu 0 --dtype float16 --custom_pass=bertpass_lib.so

In the preceding code, the  task can be one of ‘QA’, ‘embedding’, ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’, ‘XNLI’, ‘LCQMC’, or ‘ChnSentiCorp’ [1].

This command exports the model (JSON and parameter files) into the output directory (output_dir/[task_name]), and performs inference using the validation dataset corresponding to each task.

It reports the average latency and throughput.

The second time you run it, you can skip the export step by adding the tag --only_infer and specifying the exported model to use by adding --exported_model followed by the prefix name of the JSON or parameter files.

Optimal latency is achieved on G4 instances with FP16 precision. We recommend adding the flag -dtype float16 and activating MXNET_FC_TRUE_FP16 when performing inference. These flags shouldn’t reduce the final accuracy in your results.

By default, all these scripts use BERT-base (12 transformer-encoder layers). If you want to use BERT-large, use the flag --bert_model bert_24_1024_16 when calling the scripts.

Optimizing BERT model performance on MXNet 1.6 and 1.7

Computationally, the BERT model is mainly dominated by general matrix multiply operations (GEMMs). They represent up to 56% of time consumed when performing inference. The following chart shows the percentage of computational time spent on each operation type performing BERT-base inference (sequence length 128 and batch size 128).

MXNet uses the cuBLAS library to efficiently compute these GEMMs on the GPU. These GEMMs belong to the multi-head self-attention part of the model (4 GEMMs per transformer layer), and the feed-forward network (2 GEMMs per transformer layer).

In this section, we discuss optimizing the most computational-consuming operations.

The following table shows the improvement of each optimization. The performance improvements were achieved by the different GPU BERT optimizations implemented on MXNet and GluonNLP, performing a question answering inference task (SQuAD dataset), and using a sequence length of 128. Speedup achieved is shown for different batch sizes.

LayerNorm, Softmax and AddBias

Although LayerNorm was already optimized for GPUs on MXNet 1.5, the implementation of Softmax was optimized in MXNet 1.6. The new implementation improves inference performance on GPUs by optimizing the device memory accesses and using the CUDA registers and shared memory during reduction operations more efficiently. Additionally, you have the option to apply a max_length mask within the C++ Softmax operator, which removes the need to apply the mask at the Python level.

The addition of bias terms following GEMMs was also optimized. Instead of using an mshadow broadcast summation, a custom CUDA kernel is now attached to the FullyConnected layer, which includes efficient device memory accesses.

Multi-head self-attention

The following equation defines the attention mechanism used in the BERT model [2], where Q represents the query, K the key, V the value, and dk the inner dimension of these three matrices:

Three different linear projections (FullyConnected: GEMMs and Bias-Addition) are performed to obtain Q, K, and V from the same input (when the same input is employed, the mechanism is denominated self-attention), but with different weights:

    • Q = input Wqt
    • K = input Wkt
    • V = input Wvt

The input size is (BatchSize, SeqLength, EmbeddingDim), and each weight tensor W size is (ProjectionDim, EmbeddingDim).

In multi-head attention, many projections and attention functions are applied to the input as the number of heads, augmenting the dimensions of the weights so that each W size is ((NumHeads x ProjectionDim), EmbeddingDim).

All these projections are independent, so we can compute them in parallel within the same operation, producing an output which size is (BatchSize, SeqLength, 3 x NumHeads x ProjectionDim). That is, GluonNLP uses a single FullyConnected layer to compute Q, K, and V.

To compute the attention function (the preceding equation), we first need to compute the dot product QKT. We need to perform this computation independently for each head, with m=SeqLength number of Q rows, n=SeqLength number of K columns, and k=ProjectionDim size of vectors in the dot product. We can use a batched dot product operation, where the number of batches is (BatchSize x NumHeads), to compute all the dot products within the same operation.

However, to perform such an operation in cuBLAS, we need to have the batches and heads dimensions contiguous (in order to have a regular pattern to express distance between batches), but that isn’t the case by default (SeqLength dimension is between them). To avoid rearranging Q, K, and V, GluonNLP transposes the input so that its shape is (SeqLength, BatchSize, EmbeddingDim), and Q, K, and V are directly projected into a tensor with shape (SeqLength, BatchSize, 3 x NumHeads x ProjectionDim).

Moreover, to avoid splitting the joint QKV output, we can compute the projections in an interleaved fashion, allocating continuously the applied weights Wq, Wk, Wv of each individual head. The following diagram depicts the interleaved projection operation, where P is the projection size, and we end with a joint QKV output with shape (SeqLength, BatchSize, NumHeads x 3 x ProjectionDim).

This strategy allows us to compute QKT from a unique joint input tensor with cuBLASGEMMStridedBatched, setting the number of batches to (BatchSize x NumHeads) and the stride to (3 x ProjectionDim). We also use a strided batched GEMM to compute the dot product of V (same stride as before) with the output of the Softmax function. We implemented MXNet operators that deal with this cuBLAS configuration.

True FP16

Since MXNet 1.7, you can compute completely in FP16 precision GEMMs. By default, when the data type is FP16, MXNet sets cuBLAS to internally use FP32 accumulation. You can now set the environment variable MXNET_FC_TRUE_FP16 to 1 to force MXNet to use FP16 as the cuBLAS internal computation type.

Pointwise fusion and prearrangement of MHA weights and bias using a custom graph pass

Finally, the feed-forward part of the model, which happens after each transformer layer, uses Gaussian Exponential Linear Unit (GELU) as its activation function. This operation follows a feed-forward FullyConnected operation, which includes bias addition. We use the MXNet functionality of custom graph passes to detach the bias addition from the FullyConnected operation and fuse it with GELU through the pointwise fusion mechanism.

In our custom graph pass for BERT, we also prearrange the weights and bias terms for the multi-head self-attention computation so that we avoid any overhead at runtime. As explained earlier, weights need to be interleaved, and bias terms need to be joint into a unique tensor. We do this before exporting the model. This strategy shows benefits in small batch size cases.

Conclusion

In this post, we presented an efficient solution for performing BERT inference tasks on EC2 G4 GPU instances. We showed how a set of MXNet optimizations boost GPU performance, achieving speeds up to twice as fast in both question answering and classification tasks.

We have shown that g4dn.xlarge instances offer lower latency (below 4 milliseconds with batch size 1) than any EC2 CPU instance, and g4dn.xlarge is 3.8 times better than c5.24xlarge on average. Finally, g4dn.xlarge offers the best cost per million requests ratio—16 times better than CPU instances (c5.xlarge) on average.

 

 

Acknowledgments

We would like to thank Triston Cao, Murat Guney from NVIDIA, Sandeep Krishnamurthy from Amazon, the Amazon-MXNet team, and the NVIDIA MXNet team for their feedback and support.

Disclaimer

The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.

References

  1. Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  2. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

About the Authors

Moises Hernandez Fernandez is an AI DevTech Engineer at NVIDIA. He works on accelerating NLP applications on GPUs. Before joining NVIDIA, he was conducting research into the brain connectivity, optimizing the analysis of diffusion MRI using GPUs. Moises received a PhD in Neurosciences from Oxford University.

Haibin Lin is a former Applied Scientist at Amazon Web Services. He works on distributed systems, deep learning, and NLP. He is a PPMC and committer of Apache MXNet, and a major contributor to the GluonNLP toolkit. He has finished his M.S. in Computer Science at Carnegie Mellon University, advised by Andy Pavlo. Prior to that, he has received a B.Eng. in Computer Science from University of Hong Kong and Shanghai Jiao Tong University jointly.

Przemyslaw Tredak is a senior developer technology engineer on the Deep Learning Frameworks team at NVIDIA. He is a committer of Apache MXNet and leads the MXNet team at NVIDIA.

Anish Mohan is a Machine Learning Architect at Nvidia and the technical lead for ML/DL engagements with key Nvidia customers in the greater Seattle region. Before Nvidia, he was at Microsoft’s AI Division, working to develop and deploy AI/ML algorithms and solutions.

Read More

AI in Schools: Sony Reimagines Remote Learning with Artificial Intelligence

AI in Schools: Sony Reimagines Remote Learning with Artificial Intelligence

Back to school was destined to look different this year.

With the world adapting to COVID-19, safety measures are preventing a return to in-person teaching in many places. Also, students learning through conventional video conferencing systems often feel the content is difficult to read, or teachers block the words written on presentation boards.

Faced with these challenges, educators at Prefectural University of Hiroshima in Japan envisioned a high-quality remote learning system with additional features not possible with traditional video conferencing.

They chose a distance-learning solution from Sony that links lecturers and students across their three campuses. It uses AI to make it easy for presenters anywhere to engage their audiences and impart information using captivating video. Thanks to these innovations, lecturers at Prefectural University can now teach students simultaneously on three campuses linked by a secure virtual private network.

Sony remote learning solution
Sony’s remote learning solution in action, with Edge Analytics Appliance, remote cameras and projectors.

AI Helps Lecturers Get Smarter About Remote Learning

At the heart of Prefectural’s distance learning system is Sony’s REA-C1000 Edge Analytics Appliance, which was developed using the NVIDIA Jetson Edge AI platform. The appliance lets teachers and speakers quickly create dynamic video presentations without using expensive video production gear or learning sophisticated software applications.

Sony’s exclusive AI algorithms run inside the appliance. These deep learning models employ techniques such as automatic tracking, zooming and cropping to allow non-specialists to produce engaging, professional-quality video in real time.

Users simply connect the Edge Analytics Appliance to a camera that can pan, tilt and zoom; a PC; and a display or recording device. In Prefectural’s case, multiple cameras capture what a lecturer writes on the board, questions and contributions from students, and up to full HD images depending on the size of the lecture hall.

Managing all of this technology is made simple for the lecturers. A touchscreen panel facilitates intuitive operation of the system without the need for complex adjustment of camera settings.

Sony remote learning solution

Teachers Achieve New Levels of Transparency

One of the landmark applications in the Edge Analytics Appliance is handwriting extraction, which lets students experience lectures more fully, rather than having to jot down notes.

The application uses a camera to record text and figures as an instructor writes them by hand on a whiteboard or blackboard, and then immediately draws them as if they are floating in front of the instructor.

Students viewing the lecture live from a remote location or from a recording afterward can see and recognize the text and diagrams, even if the original handwriting is unclear or hidden by the instructor’s body. The combined processing power of the compact, energy-efficient Jetson TX2 and Sony’s moving/unmoving object detection technology makes the transformation from the board to the screen seamless.

Handwriting extraction is also customizable: the transparency of the floating text and figures can be adjusted, so that characters that are faint or hard to read can be highlighted in color, making them more legible — and even more so than the original content written on the board.

Create Engaging Content Without Specialist Resources

 

Another innovative application is Chroma key-less CG overlay, using state-of-the-art algorithms from Sony, like moving-object detection, to produce class content without the need for large-scale video editing equipment.

Like a personal greenscreen for presenters, the application seamlessly places the speaker in front of any animations, diagrams or graphs being presented.

Previously, moving-object detection algorithms required for this kind of compositing could only be run on professional workstations. With Jetson TX2, Sony was able to include this powerful deep learning-based feature within the compact, simple design of the Edge Analytics Appliance.

A Virtual Camera Operator

Numerous additional algorithms within the appliance include those for color-pattern matching, shape recognition, pose recognition and more. These enable features such as:

  • PTZ Auto Tracking — automatically tracks an instructor’s movements and ensures they stay in focus.
  • Focus Area Cropping — crops a specified portion from a video recorded on a single camera and creates effects as if the cropped portion were recorded on another camera. This can be used to generate, for example, a picture-in-picture effect, where an audience can simultaneously see a close-up of the presenter speaking against a wide shot of the rest of the stage.
  • Close Up by Gesture — automatically zooms in on and records students or audience members who stand up in preparation to ask a question.

With the high-performance Jetson platform, the Edge Analytics Appliance can easily handle a wide range of applications like these. The result is like a virtual camera operator that allows people to create engaging, professional-looking video presentations without the expertise or expense previously required to do so.

Officials at Prefectural University of Hiroshima say the new distance learning initiative has already led to greater student and teacher satisfaction with remote learning. Linking the university’s three campuses through the system is also fostering a sense of unity among the campuses.

“We chose Sony’s Edge Analytics Appliance for our new distance learning design because it helps us realize a realistic and comfortable learning environment for students by clearly showing the contents on the board and encouraging discussion. It was also appealing as a cost-effective solution as teachers can simply operate without additional staff,” said Kyousou Kurisu, director of public university corporation, Prefectural University of Hiroshima.

Sony plans to continually update applications available on the Edge Analytics Appliance. So, like any student, the system will only get better over time.

The post AI in Schools: Sony Reimagines Remote Learning with Artificial Intelligence appeared first on The Official NVIDIA Blog.

Read More

Whether It’s Rembrandt or Toilets, ‘Curiosity About How Things Work’ Is Key to Innovation, CGI Legend Pat Hanrahan Says

Whether It’s Rembrandt or Toilets, ‘Curiosity About How Things Work’ Is Key to Innovation, CGI Legend Pat Hanrahan Says

You may have never heard of Pat Hanrahan, but you have almost certainly seen his work.

His list of credits includes three Academy Awards, and his work on Pixar’s RenderMan rendering technology enabled Hollywood megahits Toy Story, Finding Nemo, Cars and Jurassic Park.

Hanrahan also founded Tableau Software — snatched up by Salesforce last year for nearly $16 billion — and has mentored countless technology companies as a Stanford professor.

Hanrahan is the most recent winner of the Turing Award, along with his longtime friend and collaborator Ed Catmull, a former president at Pixar and Disney Animation Studios. The award — a Nobel Prize, of sorts, in computer science —  was for their work in 3D computer graphics and computer-generated imagery.

He spoke Thursday at NTECH, NVIDIA’s annual internal engineering conference. The digital event was followed by a virtual chat between NVIDIA CEO Jensen Huang and Hanrahan, who taught a computer graphics course at NVIDIA’s Silicon Valley campus during its early days.

While the theme of his address was “You Can Be an Innovator,” the main takeaway is that a “curiosity about how things work” is a prerequisite.

Hanrahan said his own curiosity for art and studying how Rembrandt painted flesh tones led to a discovery. Artists of that Baroque period, he said, applied a technique in oil painting with layers, called impasto, for depth of skin tone. This led to his own deeper study of light’s interaction with translucent surfaces.

“Artists, they sort of instinctively figured it out,” he said. “They don’t know about the physics of light transport. Inspired by this whole idea of Rembrandt’s, I came up with a mathematical model.”

Hanrahan said innovative people need to be instinctively curious. He tested that out himself when interviewing job candidates in the early days of Pixar. “I asked everybody that I wanted to hire into the engineering team, ‘How does a toilet work?’ To be honest, most people did not know how their toilet worked,” he said, “and these were engineers.”

At the age of seven 7, he’d already lifted the back cover of the toilet to find out what made it work.

Hanrahan worked with Steve Jobs at Pixar. Jobs’s curiosity and excitement about touch-capacitive sensors — technology that dated back to the 1970s — would eventually lead to the touch interface of the iPhone, he said.

After the talk, Huang joined the video feed from his increasingly familiar kitchen at home and interviewed Hanrahan. The wide-ranging conversation was like a time machine, with questions and reminisces looking back 20 years and discussions peering forward to the next 20.

The post Whether It’s Rembrandt or Toilets, ‘Curiosity About How Things Work’ Is Key to Innovation, CGI Legend Pat Hanrahan Says appeared first on The Official NVIDIA Blog.

Read More