Understanding the key capabilities of Amazon SageMaker Feature Store

One of the challenging parts of machine learning (ML) is feature engineering, the process of transforming data to create features for ML. Features are processed data signals used for training ML models and for deployed models to make accurate predictions. Data scientists and ML engineers can spend up to 60-70% of their time on feature engineering. It’s also typical to have this work repeated by different teams within an organization who use the same data to build ML models for different solutions, further increasing effort levels for feature engineering. Moreover, it’s important that the generated features should be available for both training and real-time inference use cases, to ensure consistency between model training and inference serving.

A purpose-built feature store for ML is needed to ensure both high-quality ML predictions with a consistent set of features, and cost reduction by eliminating duplicate feature engineering effort and storage overhead. Consistent features are needed between different parts of an organization, and between training and inference for any given ML model. There is also a need for feature stores to meet the high performance, scale, and availability requirements to serve features in near-real time for inferences. Because of this, organizations are often forced to do the heavy lifting of building and maintaining feature store systems, which can become expensive and difficult to maintain.

At AWS we are constantly listening to our customers and building solutions and services that delight them. We heard from many customers about the pain their data science and data engineering teams face when managing features, and used those inputs to build the Amazon SageMaker Feature Store, which was launched at re:Invent on December 1, 2020. Amazon SageMaker Feature Store is a fully managed, purpose-built repository to securely store, update, retrieve, and share ML features.

Although there is a lot to unpack in terms of the capabilities that SageMaker Feature Store brings to the table, in this post, we focus on key capabilities for data ingestion, data access, and security and access control.

Overview of SageMaker Feature Store

As a purpose-built feature store for ML, SageMaker Feature Store is designed to enable you to do the following:

  • Securely store and serve features for real-time and batch applications – SageMaker Feature Store serves features at a very low latency for real-time use-cases. It enables you to use ML to make near-real time decisions by enabling feature vector retrievals with low millisecond latencies (p95 latency lower than 10 milliseconds for a 15-kilobyte payload).
  • Accelerate model development by sharing and reusing features – Feature engineering is a long and tedious process that often gets repeated by multiple teams within an organization working on the same data. SageMaker Feature Store enables data scientists to spend less time on data preparation and feature computation, and more time on ML innovation, by letting them discover and reuse existing engineered features across the organization.
  • Provide historical data access – Features are used for training purposes, and a good feature store should provide easy and quick access to historical feature values to recreate training datasets at a given point in time in the past. Amazon SageMaker Feature Store enables this with support for time-travel queries—querying data at a point in time—which enables you to re-create features at specific points of time in the past.
  • Reduce training-serving skew – Data science teams often struggle with training-serving skew caused by data discrepancy between model training and inference serving, which can cause models to perform worse than expected in production. SageMaker Feature Store reduces training-serving skew by keeping feature consistency between training and inference.
  • Enable data encryption and access control – As with other data assets, ML feature data security is paramount in all organizations. At AWS, security and operational performance are our top priorities, and SageMaker Feature Store provides a suite of capabilities for enterprise-grade data security and access control, including encryption at rest and in transit, and role-based access control using AWS Identity and Access Management (IAM).
  • Guarantee a robust service level – Managed feature store production use cases need service-level guarantees, ensuring that you get the desired performance and availability, and you can rely on expert help should something go wrong. SageMaker Feature Store is backed by AWS’s unmatched reliability, scale and operational efficiency.

SageMaker Feature Store is designed to play a central role in ML architectures, helping you streamline the ML lifecycle, and integrating seamlessly with many other services. For example, you can use tools like AWS Glue DataBrew and SageMaker Data Wrangler for feature authoring. You can use Amazon EMR, AWS Glue, and SageMaker Processing in conjunction with SageMaker Feature Store for performing feature transformation tasks. You can use a suite of tools, including SageMaker Pipelines, AWS Step Functions, or Apache AirFlow for scheduling and orchestrating feature pipelines to automate feature engineering process flow. When you have features in the feature store, you can pull them with low latency from the online store to feed models hosted with services like SageMaker Hosting. You can use existing tools like Amazon Athena, Presto, Spark, and EMR to extract datasets from the offline store for use with SageMaker Training and batch transforms. Lastly, you can use Amazon Kinesis, Apache Kafka, and AWS Lambda for streaming feature engineering and ingestion. The following diagram illustrates some of the services that can be integrated with SageMaker Feature Store.

The following diagram illustrates some of the services that can be integrated with SageMaker Feature Store.

Before we go into more detail, we briefly introduce some SageMaker Feature Store concepts:

  • Feature group – a logical grouping of ML features
  • Record – a set of values for features in a feature group
  • Online store – the low latency, high availability store that enables real-time lookup of records
  • Offline store – the store that manages historical data in your Amazon Simple Storage Service (Amazon S3) bucket, and is used for exploration and model training use cases

For more information, see Get started with Amazon SageMaker Feature Store.

Data ingestion

SageMaker Feature Store provides multiple ways to ingest data, including batch ingestion, ingestion from streaming data sources and a combination of both. SageMaker Feature Store is built in a modular fashion and is designed to ingest data from a variety of sources, including directly from SageMaker Data Wrangler, or sources like Kinesis or Apache Kafka. The following diagram shows the various data ingestion and mechanisms supported by SageMaker Feature Store.

The following diagram shows the various data ingestion and mechanisms supported by SageMaker Feature Store.

Streaming ingestion

SageMaker Feature Store provides the low latency PutRecord API, which is designed to give you millisecond-level latency and high throughput cost-optimized data ingestion. The API is designed to be called by different streams, and you can use streaming sources such as Apache Kafka, Kinesis, Spark Streaming, or another source to extract features in real-time and feed them directly into the online store, or both the online and offline store.

For even faster ingestion, the PutRecord API can be parallelized to support higher throughput writes. The data from all these PUT requests is synchronously written to the online store, and buffered and written to an offline store (Amazon S3) if that option is selected. The data is written to the offline store within a few minutes of ingestion. SageMaker Feature Store provides data and schema validations at ingestion time to ensure data quality is maintained. Validations are done to make sure that input data conforms to the defined data types and that the input record contains all features. If you have configured an offline store, SageMaker Feature Store provides automatic replication of the ingested data into the offline store for future training and historical record access use cases.

Batch ingestion

You can perform batch ingestion to SageMaker Feature Store by integrating it with your feature generation and processing pipelines. You have the flexibility to build feature pipelines with your choice of technology. After performing any data transformations and batch aggregations, the processing pipelines can ingest feature data into the SageMaker Feature Store via batch ingestion.

You can perform batch ingestion in the following 3 modes:

  • Batch ingest into the online store – This can be done by calling the synchronous PutRecord API. SageMaker Feature Store gives you the flexibility to set up an online-only feature store for use cases that don’t require offline feature access, keeping your costs low by avoiding any unnecessary storage. If you have configured your feature group as online-only, the latest values of a record override older values.
  • Batch ingest into the offline store – You can choose to ingest data directly into your offline store. This is useful when you want to backfill historical records for training use cases. This can be done from SageMaker Data Wrangler or directly through a SageMaker Processing job Spark container. The offline store resides in your account and uses Amazon S3 to store data. This gives you the benefits of Amazon S3, including low cost of storage, durability, reliability and flexible access control. In addition, the feature group created in the offline store can be registered with appropriate metadata to provide support for search, discovery, and automatic creation of an AWS Glue Data Catalog.
  • Batch ingest into both the online and offline store – If your feature group is configured to have both online and offline stores, you can do batch ingestion by calling the PutRecord API. In this case, only the latest values are stored in the online store, and the offline store maintains both your older records and the latest record. The offline store is an append-only data store.

To see an example of how you can couple streaming and batch feature engineering for an ML use-case, see Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time.

Data Access

In this section, we discuss the details of real-time data access, access from the offline store, and advanced query patterns.

Real-time data access

SageMaker Feature Store provides a low latency GetRecord API, which is designed to serve real-time inference use cases. This is a synchronous API that provides strong read consistency and can be parallelized to support high-throughput applications. The GetRecord API lets you retrieve an entire record with all its features or a specific subset of features, which helps optimize access for shared feature groups with hundreds or thousands of features.

Data access from Offline Store

SageMaker Feature Store uses an S3 bucket in your account to store offline data. You can use query engines like Athena against the offline data store in Amazon S3 to analyze feature data or to join more than one feature group in a single query. SageMaker Feature Store automatically builds the AWS Glue Data Catalog for feature groups during feature group creation, and you can then use this catalog to access and query the data from the offline store using Athena or even open-source tools like Presto. You can set up an AWS Glue crawler to run on a schedule to make sure your catalog is always up to date. Because the offline store is in Amazon S3 in your account, you can use any of the capabilities that Amazon S3 provides, like replication.

For an example showing how you can run an Athena query on a dataset containing two feature groups using a Data Catalog that was built automatically, see Build Training Dataset. For detailed sample queries, see Athena and AWS Glue in Feature Store. These queries are also available in SageMaker Studio.

Advanced query patterns

The SageMaker Feature Store design allows you to access your data using advanced query patterns. For example, it’s easy to run a query against your offline store to see what your data looked like a month ago (time-travel). SageMaker Feature Store requires a parameter called EventTimeFeatureName in your feature group to store an event time for each record. This, combined with the append-only capability of the offline store, allows you to easily use query engines to get a snapshot of your data based on the event time feature. Other patterns include querying data after removing duplicates, reconstructing a dataset based on past events for training models, and gathering data needed for ensuring compliance with regulations.

We plan to publish a detailed post on how to use advanced query patterns (including time-travel) very soon.

Security: Encryption and access control

At AWS, we take data security very seriously, and as such, SageMaker Feature Store is architected to provide end-to-end encryption, fine-grained access control mechanisms, and the ability to set up private access via VPC.

Encryption at rest and in transit

After you ingest data, your data is always encrypted at rest and in transit. When you create a feature group for online or offline access, you can provide an AWS Key Management Service (AWS KMS) customer master key (CMK) to encrypt all your data at rest. If you don’t provide a CMK, we ensure that your data is encrypted on the server side using an AWS-managed CMK. We also support having different CMKs for online and offline stores.

Access control

SageMaker Feature Store lets you set up fine-grained access control to your data and APIs by using IAM user roles and policies to allow or deny specific actions. You can set up access control at the API or account level to enforce policies across all feature groups, or for individual feature groups. Creating, deleting, describing, and listing feature groups are all operations that can be managed by IAM policies. You can also set up private access to all operations in your app from your VPC via AWS PrivateLink.

Summary

At Amazon, customer obsession is in our DNA. We have spent countless hours listening to many customers and understanding their key pain points with managing features at an enterprise level for ML, and have used those requirements to develop SageMaker Feature Store.

SageMaker Feature Store is a purpose-built store that lets you define features one time for both large-scale offline model building and batch inference use cases, and also to get up to single-digit millisecond retrievals for real-time inference. You can easily name, organize, find, and share feature groups among teams of developers and data scientists—all from Amazon SageMaker Studio. SageMaker Feature Store offers feature consistency between training and inference by automatically replicating feature values from the online store to the historical offline store for model building. It’s tightly integrated with SageMaker Data Wrangler and SageMaker Pipelines to build repeatable feature engineering pipelines, but is also modular enough to easily integrate with your existing data processing and inferencing workflows. SageMaker Feature Store provides end-to-end encryption, secure data access, and API level controls to ensure that your data is adequately protected. For more information, see New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store.

We understand how crucial it is for you to get the right service guarantee in terms of running your mission critical applications on Amazon SageMaker Feature Store. Thus SageMaker Feature Store is backed by the same service assurances that AWS customers rely on AWS to provide.


About the Authors

Lakshmi Ramakrishnan is a Principal Engineer at Amazon SageMaker Machine Learning (ML) platform team in AWS, providing technical leadership for the product. He has worked in several engineering roles in Amazon for over 9 years. He has a Bachelor of Engineering degree in Information Technology from National Institute of Technology, Karnataka, India and a Master of Science degree in Computer Science from the University of Minnesota Twin Cities.

 

 

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

 

Ravi KhandelwalRavi Khandelwal is a Software Dev Manager in Amazon SageMaker team leading engineering for SageMaker Feature Store. Prior to joining AWS, he has held engineering leadership roles in Amazon.com, FICO, and Thomson Reuters. He has an MBA from Carlson School of Management and an engineering degree from Indian Institute of Technology, Varanasi. He enjoys backpacking in the Pacific Northwest and is working towards a goal to hike in all US National Parks.

 

 

Romi DattaDr. Romi Datta is a Principal Product Manager in Amazon SageMaker team responsible for training and feature store. He has been in AWS for over 2 years, holding several product management leadership roles in S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

Read More

Saving time with personalized videos using AWS machine learning

CLIPr aspires to help save 1 billion hours of people’s time. We organize video into a first-class, searchable data source that unlocks the content most relevant to your interests using AWS machine learning (ML) services. CLIPr simplifies the extraction of information in videos, saving you hours by eliminating the need to skim through them manually to find the most relevant information. CLIPr provides simple AI-enabled tools to find, interact, and share content across videos, uncovering your buried treasure by converting unstructured information into actionable data and insights.

How CLIPr uses AWS ML services

At CLIPr, we’re leveraging the best of what AWS and the ML stack is offering to delight our customers. At its core, CLIPr uses the latest ML, serverless, and infrastructure as code (IaC) design principles. AWS allows us to consume cloud resources just when we need them, and we can deploy a completely new customer environment in a couple of minutes with just one script. The second benefit is the scale. Processing video requires an architecture that can scale vertically and horizontally by running many jobs in parallel.

As an early-stage startup, time to market is critical. Building models from the ground up for key CLIPr features like entity extraction, topic extraction, and classification would have taken us a long time to develop and train. We quickly delivered advanced capabilities by using AWS AI services for our applications and workflows. We used Amazon Transcribe to convert audio into searchable transcripts, Amazon Comprehend for text classification and organizing by relevant topics, Amazon Comprehend Medical to extract medical ontologies for a health care customer, and Amazon Rekognition to detect people’s names, faces, and meeting types for our first MVP. We were able to iterate fairly quickly and deliver quick wins that helped us close our pre-seed round with our investors.

Since then, we have started to upgrade our workflows and data pipelines to build in-house proprietary ML models, using the data we gathered in our training process. Amazon SageMaker has become an essential part of our solution. It’s a fabric that enables us to provide ML in a serverless model with unlimited scaling. The ease of use and flexibility to use any ML and deep learning framework of choice was an influencing factor. We’re using TensorFlow, Apache MXNet, and SageMaker notebooks.

Because we used open-source frameworks, we were able to attract and onboard data scientists to our team who are familiar with these platforms and quickly scale it in a cost-effective way. In just a few months, we integrated our in-house ML algorithms and workflows with SageMaker to improve customer engagement.

The following diagram shows our architecture of AWS services.

The more complex user experience is our Trainer UI, which allows human reviews of data collected via CLIPr’s AI processing engine in a timeline view. Humans can augment the AI-generated data and also fix potential issues. Human oversight helps us ensure accuracy and continuously improve and retrain models with updated predictions. An excellent example of this is speaker identification. We construct spectrographs from samples of the meeting speakers’ voices and video frames, and can identify and correlate the names and faces (if there is a video) of meeting participants. The Trainer UI also includes the ability to inspect the process workflow, and issues are flagged to help our data scientists understand what additional training may be required. A typical example of this is the visual clues to identify when speaker names differ in various meeting platforms.

Using CLIPr to create a personalized re:Invent video

We used CLIPr to process all the AWS re:Invent 2020 keynotes and leadership sessions to create a searchable video collection so you can easily find, interact, and share the moments you care about most across hundreds of re:Invent sessions. CLIPr became generally available in December 2020, and today we launched the ability for customers to upload their own content.

The following is an example of a CLIPr processed video of Andy’s keynote. You get to apply filters to the entire video to match topics that are auto-generated by CLIPr ML algorithms.

CLIPr dynamically creates a custom video from the keynote by aggregating the topics and moments that you select. Upon choosing Watch now, you can view your video composed of the topics and moments you selected. In this way, CLIPr is a video enrichment platform.

Our commenting and reaction features provide a co-viewing experience where you can see and interact with other users’ reactions and comments, adding collaborative value to the content. Back in the early days of AWS, low-flying-hawk was a huge contributor to the AWS user forums. The AWS team often sought low-flying-hawk’s thoughts on new features, pricing, and issues we were experiencing. Low-flying-hawk was like having a customer in our meetings without actually being there. Imagine what it would be like to have customers, AWS service owners, and presenters chime in and add context to the re:Invent presentations at scale.

Our customers very much appreciate the Smart Skip feature, where CLIPr gives you the option to skip to the beginning of the next topic of interest.

We built a natural language query and search capability so our customers can find moments easily and fast. For instance, you can search “SageMaker” in CLIPr search. We do a deep search across our entire media assets, ranging from keywords, video transcripts, topics, and moments, to present instant results. In a similar search (see the following screenshot), CLIPr highlights Andy’s keynote sessions, and also includes specific moments when SageMaker is mentioned in Swami Sivasubramanian and Matt Wood’s sessions.

CLIPr also enables advanced analytics capabilities using knowledge graphs, allowing you to understand the most important moments, including correlations across your entire video assets. The following is an example of the knowledge graph correlations from all the re:Invent 2020 videos filtered by topics, speakers, or specific organizations.

We provide a content library of re:Invent sessions, with all the keynotes and leadership sessions, to save you time and make the most out of re:Invent. Try CLIPr in action with re:Invent videos, see how CLIPr uses AWS to make it all happen.

Conclusion

Create an account at www.clipr.ai and create a personalized view of re:Invent content. You can also upload your own videos, so you can spend more time building and less time watching!

About the Authors

Humphrey Chen‘s experience spans from product management at AWS and Microsoft to advisory roles with Noom, Dialpad, and GrayMeta. At AWS, he was Head of Product and then Key Initiatives for Amazon’s Computer Vision. Humphrey knows how to take an idea and make it real. His first startup was the equivalent of shazam for FM radio and launched in 20 cities with AT&T and Sprint in 1999. Humphrey holds a Bachelor of Science degree from MIT and an MBA from Harvard.

Aaron Sloman is a Microsoft alum who launched several startups before joining CLIPr, with ventures including Nimble Software Systems, Inc., CrossFit Chalk, and speakTECH. Aaron was recently the architect and CTO for OWNZONES, a media supply chain and collaboration company, using advanced cloud and AI technologies for video processing.

Read More

Deepset achieves a 3.9x speedup and 12.8x cost reduction for training NLP models by working with AWS and NVIDIA

This is a guest post from deepset (creators of the open source frameworks FARM and Haystack), and was contributed to by authors from NVIDIA and AWS. 

At deepset, we’re building the next-level search engine for business documents. Our core product, Haystack, is an open-source framework that enables developers to utilize the latest NLP models for semantic search and question answering at scale. Our software as a service (SaaS) platform, Haystack Hub, is used by developers from various industries, including finance, legal, and automotive, to find answers in all kinds of text documents. You can use these answers to improve the search experience, cover the long-tail of chat bot queries, extract structured data from documents, or automate invoicing processes.

Pretrained language models like BERT, RoBERTa, and ELECTRA form the core for this latest type of semantic search and many other NLP applications. Although plenty of English models are available, the availability for other languages and more industry-specific terms (such as finance or automotive) is usually very limited and often complicates applications in the industry. Therefore, we regularly train language models for languages not covered by existing models (such as German BERT and German ELECTRA), models for special domains (such as finance and aerospace), or even models for client-specific jargon.

Challenge

Pretraining language models from scratch typically involves two major challenges: cost and development effort.

Training a language model is an extremely compute-intensive task and requires multiple GPUs running for multiple days. To give you a rough idea, training the original RoBERTa model took about 1 day on 1024 NVIDIA V100 GPUs.

Computation costs aren’t the only thing that can stress your budget. A considerable amount of manual development is required to create the training data and vocabulary, configure hyperparameters, start and monitor training jobs, and run periodical evaluation of different model checkpoints. In our first training runs, we also found several bugs only after multiple hours of training, resulting in a slow development cycle. In summary, language model training can be a painful job for a developer and easily consumes multiple days of work.

Solution

In a collaborative effort, AWS, NVIDIA, and deepset were able to complete training 3.9 times faster while lowering cost by 12.8 times and reducing developer effort from days to hours. We optimized the GPU utilization during training via PyTorch’s DistributedDataParallel (DDP) and enabled larger batch sizes by switching to Automatic Mixed Precision (AMP). Furthermore, we introduced a StreamingDataSilo that allows us to load the training data lazily from disk and to do the preprocessing on the fly, leading to a lower memory footprint and no initial preprocessing time. Last but not least, we integrated the training with Amazon SageMaker to reduce manual development effort and benefit from around a 70% cost reduction by using Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.  

In this post, we explore each of these technologies and their impact on improving BERT training performance. 

DistributedDataParallel

DistributedDataParallel (DDP) implements distributed data parallelism in PyTorch. This is a key module that’s essential for running training jobs at scale, on multiple machines or on multiple GPUs in a single machine. DDP parallelizes a given network module by splitting the input across specified devices (GPUs). The input is split in the batch dimension. The network module is replicated on each device, and each such replica handles a slice of the input. The gradients from each device are averaged during a backward pass. DDP is used in conjunction with the torch.distributed framework, which handles all communication and synchronization in distributed training jobs with PyTorch.

Before DDP was introduced, torch.nn.DataParallel (DP) was the standard module for doing single-machine multi-GPU training in PyTorch. The system in DP works as follows:

  1. The entire batch of input is loaded on the main thread.
  2. The batch is split and scattered across all the GPUs in the network.
  3. Each GPU runs the forward pass on its batch, split on a separate thread.
  4. The network outputs are gathered on the master GPU, the loss value is computed, and this loss value is then scattered across the other GPUs.
  5. Each GPU uses the loss value to run a backward pass and compute the gradients.
  6. The gradients are reduced on the master GPU and the model parameters on the master GPU are updated. This completes one iteration of training.
  7. To ensure all GPUs have the latest model parameters, they are broadcasted to the other GPUs at the start of the next iteration.

This design of DP has several inefficiencies:

  • Uneven GPU utilization – Only the primary GPU handles loss calculation, gradient reduction, and parameter updates, which leads to higher GPU memory consumption compared to the rest of the GPUs
  • Unnecessary broadcast at the beginning of each iteration – Because the model parameters are only updated in the master GPU, they need to be broadcast to the other GPUs before every iteration
  • Unnecessary gathering step – There is an unnecessary gathering step of model outputs on the GPU
  • Redundant data copies – Data is first copied to the master GPU, which is then split and copied over to the other GPUs
  • Multithreading overhead – Performance overhead caused by the Global Interpreter Lock (GIL) of the Python interpreter

DDP eliminates all such inefficiencies of DP. DDP uses a multiprocessing architecture, unlike the multithreaded one in DP. This means each GPU has its own dedicated process which runs independently and there is no master GPU anymore. Each process starts by loading its own split of data from the disk. Then the forward pass and loss computation is run independently on each GPU. This eliminates the need for gathering network outputs. During the backward pass, the gradients are AllReduced across the GPUs. Averaging the gradients with AllReduce ensures that the gradients in each GPU are identical. As a result, the model updates in each GPU are identical as well, which eliminates the need for parameter broadcast at the start of the next iteration.

Results with DDP

With DP, the GPU memory utilization was skewed towards the master GPU, which consumed 15 GB, whereas all other GPUs consumed 6 GB. With DDP, the memory consumption was split equally across 4 GPUs, or about 9 GB on each. This also allowed us to increase the per GPU batch size, which further contributed to the speedup in throughput. We reduced the number of gradient accumulation steps to keep the effective batch size constant, to ensure that there was no impact in convergence.

The following table shows the results on a P3.8xlarge instance with 4 NVIDIA V100 GPUs. With DDP, the training time reduced from 616 hours to 347 hours.

Run Batch Size Accumulation Steps Effective Batch Size Throughput (Effective Batches per Hour) Total Estimated Training Time (Hours)
BERT Training with DP 105 9 945 811 616
BERT Training with DDP 240 4 960 1415 347

The following screenshots show the GPU performance profiles captured by the NVIDIA Nsight Systems. The first screenshot shows the profile while running with DP. Light blue bars in the box marked as “GPU Utilization” show if GPUs are busy. Gaps between the blue areas show that the GPU utilization is zero. The red blocks in between are CPU to GPU memory copy operations. An additional blue block is between the memory copies, which is the aggregation operation that is computed on a single GPU.

The first screenshot shows the profile while running with DP.

The following screenshot shows high GPU utilization with DDP, which effectively deprecates all those inefficiencies.

The following screenshot shows high GPU utilization with DDP, which effectively deprecates all those inefficiencies.

Training with Automatic Mixed Precision

Automatic Mixed Precision (AMP) speeds up deep learning training with minimal impact to final accuracy. Traditionally, 32-bit precision floating point (FP32) variables are commonly used in deep learning training. You can improve speed of training with 16-bit precision floating point (FP16) variables because it requires lower storage and less memory bandwidth. However, training with lower precision could decrease the accuracy of the results. Mixed precision training is a balanced approach to achieve the computational speed up of lower precision training while maintaining accuracy close to FP32 precision training. Training with mixed precision provides additional speedup by using NVIDIA Tensor Cores, which are specialized hardware available on NVIDIA GPUs for accelerated computation.

To maintain accuracy, training with mixed precision involves the following steps:

  1. Port the model to use the FP16 datatype where appropriate.
  2. Handle specific functions or operations that must be done in FP32 to maintain accuracy.
  3. Add loss scaling to preserve small gradient values.

The AMP feature handles all these steps for deep learning training. As of this writing, all popular deep learning frameworks like PyTorch, TensorFlow, and Apache MXNet support AMP.

AMP in PyTorch was supported via the NVIDIA APEX library. AMP support has recently been moved to PyTorch core with the 1.6 release.

AMP in APEX library provides four levels of optimization for different application usage. Optimization levels O1 and O2 are both mixed precision modes with slight differences, where O1 is the recommended way for typical use cases and 02 is more aggressively converting most layers into FP16 mode. O0 and O4 opt levels are actually the FP32 mode and FP16 mode designed for reference only.

The following table shows the impact of applying AMP along with DDP. Without AMP, batch size up to 240 could be run on the GPU. With AMP, the larger batch size of 320 could be supported, reducing the total training time from 347 hours to 223 hours

Run Batch Size Accumulation Steps Effective Batch Size Throughput (Effective Batches per Hour) Total Estimated Training Time (Hours)
BERT Training with DDP 240 4 960 1415 347
BERT Training with DDP & AMP 01 304 3 912 2025 243
BERT Training with DDP & AMP 02 320 3 960 2210 223

As mentioned earlier, AMP O2 converts more layers into FP16 mode, so we can run DDP and AMP O2 with a larger batch size and get a better throughput compared to DDP and AMP O1. When selecting between these two opt levels, you should do a validation of the prediction results to make sure AMP O2 meets your accuracy requirements.

The following screenshot shows the GPU performance profile after applying DDP but running the deep learning training with FP32 variables. In this profile, we have added custom markers called NVTX markers, which show the time taken for each epoch, each step, and the time for the forward and backward pass.

The following screenshot shows the GPU performance profile after applying DDP but running the deep learning training with FP32 variables. 

The following screenshot shows the profile after enabling AMP with opt level O2. The time to run a forward and backward pass reduced significantly even though we increased the batch size for training when using AMP.

The following screenshot shows the profile after enabling AMP with opt level O2

Earlier, we mentioned that AMP utilizes Tensor Cores available on NVIDIA GPU hardware for significant speedup for deep learning training. GPU performance profiles show when operations are utilizing Tensor Cores. The following screenshot shows a sample GPU kernel that is run in FP32 mode. The GPU operation is marked here as the volta_sgemm kernel.

The GPU operation is marked here as the volta_sgemm kernel.

The following screenshot shows similar operations run in FP16 mode, which utilizes Tensor Cores. Kernels running with Tensor Cores are marked as volta_fp16_s884gemm.

The following screenshot shows similar operations run in FP16 mode

Data pipeline

The datasets used for training language models typically contain 10–200 GB of raw text data. Loading the whole dataset in RAM can be challenging. Furthermore, the typical pipeline of first running the preprocessing for all data and then pulling batches during training isn’t optimal because the up-front preprocessing can take multiple hours in which we don’t utilize the GPUs on the server.

Therefore, we introduced a StreamingDataSilo, which loads data lazily from disk just in time when it’s needed in the training loop. The whole preprocessing happens on the fly. Our implementation builds upon PyTorch’s IterableDataset and DistributedSampler, but requires some custom parts to ensure enough preprocessed batches are always in the queue for our trainer, so that the GPU never has to wait for the next batch to be ready. For implementation steps, see the GitHub repo. Together with an increased number of workers to fill the queue, we ended up with another 28% throughput improvement, as shown in the following table.

Run Batch Size Accumulation Steps Effective Batch Size Throughput (Effective Batches per Hour) Total Estimated Training Time (Hours)
DDP with 8 workers 320 3 960 2210 223
DDP with 16 workers 320 3 960 3077 160

A second tricky case that we had to handle was related to the unknown number of batches in our dataset and the distributed training via DDP. If the batches in your dataset can’t be evenly distributed across the number of workers, some workers don’t get any batches in the last step of the epoch while others do. This asynchronicity can crash your whole training run or result in a deadlock (for a related PyTorch issue, see [RFC] Join-based API to support uneven inputs in DDP). We handled this by adding a small synchronization step where all workers communicate if they still have data left. For implementation details, see the GitHub repo.

Spot Instances

Besides reducing the total training time, using EC2 Spot Instances is another compelling approach to reduce training costs. This is pretty straightforward to configure in SageMaker; just set the parameter EnableManagedSpotTraining to True when launching your training job. SageMaker launches your training job and saves checkpoints periodically. When your Spot Instance ends, SageMaker takes care of spinning up a new instance and loading the latest checkpoint to continue training from there.

In your code, you need to make sure to save regular checkpoints containing the states of all the objects that are relevant for your training session. This includes not only your model weights, but also the states of your optimizer, the data loader, and all random number generators to replicate the results from your continuous runs without Spot Instances. For implementation details, see the GitHub repo.

In our test runs, we achieved around 70% cost savings in comparison to regular On-Demand Instances.

Conclusion

Language models have become the backbone of modern NLP. Although using existing public models works well in many cases, many domains with special languages can benefit from training a new model from scratch. Having a fast, simple, and cheap training pipeline is essential for these big training jobs. In addition, the increased efficiency of training jobs reduces our energy usage and lowers our carbon footprint. By tackling different areas of FARM’s training pipeline, we were able to significantly optimize the resource utilization. In the end, we were able to achieve a speedup in training time of 3.9 times faster, a 12.8 times reduction in training cost, and reduced the developer effort required from days to hours.

If you’re interested in training your own BERT model, you can look at the open-source code in FARM or try our free SageMaker algorithm on the AWS Marketplace.


About the Authors

Abhinav Sharma is a Software Engineer at AWS Deep Learning. He works on bringing state-of-the-art deep learning research to customers, building products that help customers use deep learning engines. Outside of work, he enjoys playing tennis, noodling on his guitar and watching thriller movies.

Malte Pietsch is Co-Founder & CTO at deepset, where he builds the next-level enterprise search engine fueled by open source and NLP. He holds a M.Sc. with honors from TU Munich and conducted research at Carnegie Mellon University. He is an open-source lover, likes reading papers before breakfast, and is obsessed with automating the boring parts of our work.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for AWS machine learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Jiahong Liu is a Solution Architect on the NVIDIA Cloud Service Provider team, where he helps customers adopt ML and AI solutions with better utilization of NVIDIA’s GPU to solve their business challenges.

Anish Mohan is a Machine Learning Architect at NVIDIA and the technical lead for ML and DL engagements with key NVIDIA customers in the greater Seattle region. Before NVIDIA, he was at Microsoft’s AI Division, working to develop and deploy AI and ML algorithms and solutions.

Read More

How to deliver natural conversational experiences using Amazon Lex Streaming APIs

Natural conversations often include pauses and interruptions. During customer service calls, a caller may ask to pause the conversation or hold the line while they look up the necessary information before continuing to answer a question. For example, callers often need time to retrieve credit card details when making bill payments. Interruptions are also common. Callers may interrupt a human agent with an answer before the agent finishes asking the entire question (for example, “What’s the CVV code for your credit card? It is the three-digit code top right corner.…”). Just like conversing with human agents, a caller interacting with a bot may interrupt or instruct the bot to hold the line. Previously, you had to orchestrate such dialog on Amazon Lex by managing client attributes and writing code via an AWS Lambda function. Implementing a hold pattern required code to keep track of the previous intent so that the bot could continue the conversation. The orchestration of these conversations was complex to build and maintain, and impacted the time to market for conversational interfaces. Moreover, the user experience was disjointed because the properties of prompts such as ability to interrupt were defined in the session attributes on the client.

Amazon Lex’s new streaming conversation APIs allow you to deliver sophisticated natural conversations across different communication channels. You can now easily configure pauses, interruptions and dialog constructs while building a bot with the Wait and Continue and Interrupt features. This simplifies the overall design and implementation of the conversation and makes it easier to manage. By using these features, the bot builder can quickly enhance the conversational capability of virtual agents or IVR systems.

In the new Wait and Continue feature, the ability to put the conversation into a waiting state is surfaced during slot elicitation. You can configure the slot to respond with a “Wait” message such as “Sure, let me know when you’re ready” when a caller asks for more time to retrieve information. You can also configure the bot to continue the conversation with a “Continue” response based on defined cues such as “I’m ready for the policy ID. Go ahead.” Optionally, you can set a “Still waiting” prompt to play messages like “I’m still here” or “Let me know if you need more time.” You can set the frequency of these messages to play and configure a maximum wait time for user input. If the caller doesn’t provide any input within the maximum wait duration, Amazon Lex resumes the dialog by prompting for the slot. The following screenshot shows the wait and continue configuration options on the Amazon Lex console.

The following screenshot shows the wait and continue configuration options on the Amazon Lex console. 

The Interrupt feature enables callers to barge-in while a prompt is played by the bot. A caller may interrupt the bot and answer a question before the prompt is completed. This capability is surfaced at the prompt level and provided as a default setting. On the Amazon Lex console, navigate to the Advanced Settings and under Slot prompts, enable the setting to allow users to interrupt the prompt.

On the Amazon Lex console, navigate to the Advanced Settings and under Slot prompts, enable the setting to allow users to interrupt the prompt.

After configuring these features, you can initiate a streaming interaction with the Lex bot by using the StartConversation API. The streaming capability enables you to capture user input, manage state transitions, handle events, and deliver a response required as part of a conversation. The input can be one of three types: audio, text, or DTMF, whereas the response can be either audio or text. The dialog progresses by eliciting an intent, populating any slots, confirming the intent, and finally closing the intent. Streaming allows intents to be defined based on different conversation states such as: InProgress, Waiting, Confirmed, Denied, Fulfilled, ReadyForFulfillment or Failed. A detailed list of different dialog and intents states, see the documentation for Amazon Lex intents.

The StartConversation API uses HTTP/2 for bidirectional communication between the client and Amazon Lex. Clients are expected to maintain a single long-running stream throughout the conversation and exchange required information with the server (and vice versa) via different events. To learn more about streaming for Amazon Lex bots, please refer the Lex streaming documentation.

Amazon Lex Streaming APIs in action

Let’s put all of this together in a bot. I first use the Amazon Lex console to configure a bot and then use the following sample telephony application to interact with the bot. For the purposes of this post, we use a simple bot to complete a bill payment with a credit card. Because callers often need time to retrieve their credit card details, we enable the wait and continue construct when eliciting the credit card number. We also set up a service-side interrupt to manage interruption.

The following is a sample conversation with our bot:

Caller:    I’d like to make a payment on my account
Bot:        Sure. Can we start off with your account ID?
Caller:    Yes the account ID is 1234
Bot:        Great. The balance due is $100. How much would you like to pay today?
Caller:    $100
Bot:        Ok. Can you share the 16-digit number on your credit card?
Caller:    Can you wait a minute?
Bot:        Sure

—————- a few seconds later —————-

Bot:        Let me know when ready. No rush.

—————- a few seconds later —————-

Caller:    Ok I am back
Bot:        Ok please go ahead
Caller:    The credit card number is 1234 5678 1234 5678
Bot:        Got it. Let me repeat that for you 1234 5678 1234 5678. Is that correct?
Caller:    Yes
Bot:        Ok. What’s the CVV code? You can find it in the top right on the signature panel on the back of the card.
Caller:    123
Bot:        Awesome. I will go ahead and process the $100 payment on your account using the card ending in 5678.
Caller:    Ok
Bot:        The payment went through. Your confirmation code is 1234.

The first step is to build an Amazon Lex bot with intents to process payment and get balance on the account. The ProcessPayment intent elicits the information required to process the payment, such as the payment amount, credit card number, CVV code, and expiration date. The GetBalanceAmount intent provides the balance on the account. The FallbackIntent is triggered when the user input can’t be processed by either of the two configured intents.

Deploying the sample bot

To create the sample bot, complete the following steps. This creates an Amazon Lex bot called PaymentsBot.

  1. On the Amazon Lex console, choose Create Bot.
  2. In the Bot configuration section, give the bot the name PaymentsBot.
  3. Specify AWS Identity and Access Management (IAM) permissions and COPPA flag.
  4. Choose Next.
  5. Under Languages, choose English(US).
  6. Choose Done.
  7. Add the ProcessPayment and GetBalanceAmount intents to your bot.
  8. For the ProcessPayment intent, add the following slots:
    1. PaymentAmount slot using the built-in AMAZON.Number slot type
    2. CreditCardNumber slot using the built-in AMAZON.AlphaNumeric slot type
    3. CVV slot using the built-in AMAZON.Number slot type
    4. ExpirationDate using the built-in AMAZON.Date built-in slot type
  9. Configure slot elicitation prompts for each slot.
  10. Configure a closing response for the ProcessPayment intent.
  11. Similarly, add and configure slots and prompts for GetBalanceAmount intents.
  12. Choose Build to test your bot.

For more information about creating a bot, see the Lex V2 documentation.

Configuring Wait and Continue

  1. Choose the ProcessPayment intent and navigate to the CreditCardNumber slot.
  2. Choose Advanced Settings to open the slot editor.
  3. Enable Wait and Continue for the slot.
  4. Provide the Wait, Still Waiting, and Continue responses.
  5. Save the intent and choose Build.

The bot is now configured to support the Wait and Continue dialog construct. Now let’s configure the client code. You can use a telephony application to interact with your Lex bot. You can download the code for setting up a telephony IVR interface via Twilio at the GitHub project. The link contains information to set up a telephony interface as well as a client application code to communicate between the telephony interface and Amazon Lex.

Now, let us review the client-side setup to use the bot configuration that we just enabled on the Amazon Lex console. The client application uses the Java SDK to capture payment information. In the beginning, you use the ConfigurationEvent to set up the conversation parameters. Then, you start sending an input event (AudioInputEvent, TextInputEvent or DTMFInputEvent) to send user input to the bot depending on the input type. When sending audio data, you would need to send multiple AudioInputEvent events, with each event containing a slice of the data.

The service first responds with TranscriptEvent to give transcription, then sends the IntentResultEvent to surface the intent classification results. Subsequently, Amazon Lex sends a response event (TextResponseEvent or AudioResponseEvent) that contains the response to play back to caller. If the caller requests the bot to hold the line, the intent is moved to the Waiting state and Amazon Lex sends another set of TranscriptEvent, IntentResultEvent and a response event. When the caller requests to continue the conversation, the intent is set to the InProgress state and the service sends another set of TranscriptEvent, IntentResultEvent and a response event. While the dialog is in the Waiting state, Amazon Lex responds with a set of IntentResultEvent and response event for every “Still waiting” message (there is no transcript event for server-initiated responses). If the caller interrupts the bot prompt at any time, Amazon Lex returns a PlaybackInterruptionEvent.

Let’s walk through the main elements of the client code:

  1. Create the Amazon Lex client:
    AwsCredentialsProvider awsCredentialsProvider = StaticCredentialsProvider
            .create(AwsBasicCredentials.create(accessKey, secretKey));
    
    LexRuntimeV2AsyncClient lexRuntimeServiceClient = LexRuntimeV2AsyncClient.builder()
            .region(region)
            .credentialsProvider(awsCredentialsProvider)
            .build();

  2. Create a handler to publish data to server:
    EventsPublisher eventsPublisher = new EventsPublisher();
    

  1. Create a handler to process bot responses:
    public class BotResponseHandler implements StartConversationResponseHandler {
    
        private static final Logger LOG = Logger.getLogger(BotResponseHandler.class);
    
    
        @Override
        public void responseReceived(StartConversationResponse startConversationResponse) {
            LOG.info("successfully established the connection with server. request id:" + startConversationResponse.responseMetadata().requestId()); // would have 2XX, request id.
        }
    
        @Override
        public void onEventStream(SdkPublisher<StartConversationResponseEventStream> sdkPublisher) {
    
            sdkPublisher.subscribe(event -> {
                if (event instanceof PlaybackInterruptionEvent) {
                    handle((PlaybackInterruptionEvent) event);
                } else if (event instanceof TranscriptEvent) {
                    handle((TranscriptEvent) event);
                } else if (event instanceof IntentResultEvent) {
                    handle((IntentResultEvent) event);
                } else if (event instanceof TextResponseEvent) {
                    handle((TextResponseEvent) event);
                } else if (event instanceof AudioResponseEvent) {
                    handle((AudioResponseEvent) event);
                }
            });
        }
    
        @Override
        public void exceptionOccurred(Throwable throwable) {
            LOG.error(throwable);
            System.err.println("got an exception:" + throwable);
        }
    
        @Override
        public void complete() {
            LOG.info("on complete");
        }
    
        private void handle(PlaybackInterruptionEvent event) {
            LOG.info("Got a PlaybackInterruptionEvent: " + event);
    
            LOG.info("Done with a  PlaybackInterruptionEvent: " + event);
        }
    
        private void handle(TranscriptEvent event) {
            LOG.info("Got a TranscriptEvent: " + event);
        }
    
    
        private void handle(IntentResultEvent event) {
            LOG.info("Got an IntentResultEvent: " + event);
    
        }
    
        private void handle(TextResponseEvent event) {
            LOG.info("Got an TextResponseEvent: " + event);
    
        }
    
        private void handle(AudioResponseEvent event) {//synthesize speech
            LOG.info("Got a AudioResponseEvent: " + event);
        }
    
    }
    

 

  1. Initiate the connection with the bot:
    StartConversationRequest.Builder startConversationRequestBuilder = StartConversationRequest.builder()
            .botId(botId)
            .botAliasId(botAliasId)
            .localeId(localeId);
    
    // configure the conversation mode with bot (defaults to audio)
    startConversationRequestBuilder = startConversationRequestBuilder.conversationMode(ConversationMode.AUDIO);
    
    // assign a unique identifier for the conversation
    startConversationRequestBuilder = startConversationRequestBuilder.sessionId(sessionId);
    
    // build the initial request
    StartConversationRequest startConversationRequest = startConversationRequestBuilder.build();
    
    CompletableFuture<Void> conversation = lexRuntimeServiceClient.startConversation(
            startConversationRequest,
            eventsPublisher,
            botResponseHandler);

  2. Establish the configurable parameters via ConfigurationEvent:
    public void configureConversation() {
        String eventId = "ConfigurationEvent-" + eventIdGenerator.incrementAndGet();
    
        ConfigurationEvent configurationEvent = StartConversationRequestEventStream
                .configurationEventBuilder()
                .eventId(eventId)
                .clientTimestampMillis(System.currentTimeMillis())
                .responseContentType(RESPONSE_TYPE)
                .build();
    
        eventWriter.writeConfigurationEvent(configurationEvent);
        LOG.info("sending a ConfigurationEvent to server:" + configurationEvent);
    }

  3. Send audio data to server:
    public void writeAudioEvent(ByteBuffer byteBuffer) {
        String eventId = "AudioInputEvent-" + eventIdGenerator.incrementAndGet();
    
        AudioInputEvent audioInputEvent = StartConversationRequestEventStream
                .audioInputEventBuilder()
                .eventId(eventId)
                .clientTimestampMillis(System.currentTimeMillis())
                .audioChunk(SdkBytes.fromByteBuffer(byteBuffer))
                .contentType(AUDIO_CONTENT_TYPE)
                .build();
    
        eventWriter.writeAudioInputEvent(audioInputEvent);
    }

  4. Manage interruptions on the client side:
    private void handle(PlaybackInterruptionEvent event) {
        LOG.info("Got a PlaybackInterruptionEvent: " + event);
    
        callOperator.pausePlayback();
    
        LOG.info("Done with a  PlaybackInterruptionEvent: " + event);
    }

  5. Enter the code to disconnect the connection:
    public void disconnect() {
    
        String eventId = "DisconnectionEvent-" + eventIdGenerator.incrementAndGet();
    
        DisconnectionEvent disconnectionEvent = StartConversationRequestEventStream
                .disconnectionEventBuilder()
                .eventId(eventId)
                .clientTimestampMillis(System.currentTimeMillis())
                .build();
    
        eventWriter.writeDisconnectEvent(disconnectionEvent);
    
        LOG.info("sending a DisconnectionEvent to server:" + disconnectionEvent);
    }

You can now deploy the bot on your desktop to test it out.

Things to know

The following are a couple of important things to keep in mind when you’re using the Amazon Lex V2 Console and APIs:

  • Regions and languages – The Streaming APIs are available in all existing Regions and support all current languages.
  • Interoperability with Lex V1 console – Streaming APIs are only available in the Lex V2 console and APIs.
  • Integration with Amazon Connect – As of this writing, Lex V2 APIs are not supported on Amazon Connect. We plan to provide this integration as part of our near-term roadmap.
  • Pricing – Please see the details on the Lex pricing page.

Try it out

Amazon Lex Streaming API is available now and you can start using it today. Give it a try, design a bot, launch it and let us know what you think! To learn more, please see the Lex streaming API documentation.


About the Authors

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

 

 

 

Swapandeep Singh is an engineer with Amazon Lex team. He works on making interactions with bot smoother and more human-like. Outside of work, he likes to travel and learn about different cultures.

Read More

Model serving in Java with AWS Elastic Beanstalk made easy with Deep Java Library

Deploying your machine learning (ML) models to run on a REST endpoint has never been easier. Using AWS Elastic Beanstalk and Amazon Elastic Compute Cloud (Amazon EC2) to host your endpoint and Deep Java Library (DJL) to load your deep learning models for inference makes the model deployment process extremely easy to set up. Setting up a model on Elastic Beanstalk is great if you require fast response times on all your inference calls. In this post, we cover deploying a model on Elastic Beanstalk using DJL and sending an image through a post call to get inference results on what the image contains.

About DJL

DJL is a deep learning framework written in Java that supports training and inference. DJL is built on top of modern deep learning engines (such as TenserFlow, PyTorch, and MXNet). You can easily use DJL to train your model or deploy your favorite models from a variety of engines without any additional conversion. It contains a powerful model zoo design that allows you to manage trained models and load them in a single line. The built-in model zoo currently supports more than 70 pre-trained and ready-to-use models from GluonCV, HuggingFace, TorchHub, and Keras.

Benefits

The primary benefit of hosting your model using Elastic Beanstalk and DJL is that it’s very easy to set up and provides consistent sub-second responses to a post request. With DJL, you don’t need to download any other libraries or worry about importing dependencies for your chosen deep learning framework. Using Elastic Beanstalk has two advantages:

  • No cold startup – Compared to an AWS Lambda solution, the EC2 instance is running all the time, so any call to your endpoint runs instantly and there isn’t any ovdeeerhead when starting up new containers.
  • Scalable – Compared to a server-based solution, you can allow Elastic Beanstalk to scale horizontally.

Configurations

You need to have the following gradle dependencies set up to run our PyTorch model:

plugins {
    id 'org.springframework.boot' version '2.3.0.RELEASE'
    id 'io.spring.dependency-management' version '1.0.9.RELEASE'
    id 'java'
}

dependencies {
    implementation platform("ai.djl:bom:0.8.0")
    implementation "ai.djl.pytorch:pytorch-model-zoo"
    implementation "ai.djl.pytorch:pytorch-native-auto"
    
    implementation "org.springframework.boot:spring-boot-starter"
    implementation "org.springframework.boot:spring-boot-starter-web"
}

The code

We first create a RESTful endpoint using Java SpringBoot and have it accept an image request. We decode the image and turn it into an Image object to pass into our model. The model is autowired by the Spring framework by calling the model() method. For simplicity, we create the predictor object on each request, where we pass our image for inference (you can optimize this by using an object pool) . When inference is complete, we return the results to the requester. See the following code:

    @Autowired ZooModel<Image, Classifications> model;

    /**
     * This method is the REST endpoint where the user can post their images
     * to run inference against a model of their choice using DJL.
     *
     * @param input the request body containing the image
     * @return returns the top 3 probable items from the model output
     * @throws IOException if failed read HTTP request
     */
    @PostMapping(value = "/doodle")
    public String handleRequest(InputStream input) throws IOException {
        Image img = ImageFactory.getInstance().fromInputStream(input);
        try (Predictor<Image, Classifications> predictor = model.newPredictor()) {
            Classifications classifications = predictor.predict(img);
            return GSON.toJson(classifications.topK(3)) + System.lineSeparator();
        } catch (RuntimeException | TranslateException e) {
            logger.error("", e);
            Map<String, String> error = new ConcurrentHashMap<>();
            error.put("status", "Invoke failed: " + e.toString());
            return GSON.toJson(error) + System.lineSeparator();
        }
    }

    @Bean
    public ZooModel<Image, Classifications> model() throws ModelException, IOException {
        Translator<Image, Classifications> translator =
                ImageClassificationTranslator.builder()
                        .optFlag(Image.Flag.GRAYSCALE)
                        .setPipeline(new Pipeline(new ToTensor()))
                        .optApplySoftmax(true)
                        .build();
        Criteria<Image, Classifications> criteria = Criteria.builder()
                .setTypes(Image.class, Classifications.class)
                .optModelUrls(MODEL_URL)
                .optTranslator(translator)
                .build();
        return ModelZoo.loadModel(criteria);
    }
    

A full copy of the code is available on the GitHub repo.

Building your JAR file

Go into the beanstalk-model-serving directory and enter the following code:

cd beanstalk-model-serving
./gradlew build

This creates a JAR file found in build/libs/beanstalk-model-serving-0.0.1-SNAPSHOT.jar

Deploying to Elastic Beanstalk

To deploy this model, complete the following steps:

  1. On the Elastic Beanstalk console, create a new environment.
  2. For our use case, we name the environment DJL-Demo.
  3. For Platform, select Managed platform.
  4. For Platform settings, choose Java 8 and the appropriate branch and version.

  1. When selecting your application code, choose Choose file and upload the beanstalk-model-serving-0.0.1-SNAPSHOT.jar that was created in your build.
  2. Choose Create environment.

After Elastic Beanstalk creates the environment, we need to update the Software and Capacity boxes in our configuration, located on the Configuration overview page.

  1. For the Software configuration, we add an additional setting in the Environment Properties section with the name SERVER_PORT and value 5000.
  2. For the Capacity configuration, we change the instance type to t2.small to give our endpoint a little more compute and memory.
  3. Choose Apply configuration and wait for your endpoint to update.

 

Calling your endpoint

Now we can call our Elastic Beanstalk endpoint with our image of a smiley face.

See the following code:

curl -X POST -T smiley.png <endpoint>.elasticbeanstalk.com/inference

We get the following response:

[
  {
    "className": "smiley_face",
    "probability": 0.9874626994132996
  },
  {
    "className": "face",
    "probability": 0.004804758355021477
  },
  {
    "className": "mouth",
    "probability": 0.0015588520327582955
  }
]

The output predicts that a smiley face is the most probable item in our image. Success!

Limitations

If your model isn’t called often and there isn’t a requirement for fast inference, we recommend deploying your models on a serverless service such as Lambda. However, this adds overhead due to the cold startup nature of the service. Hosting your models through Elastic Beanstalk may be slightly more expensive because the EC2 instance runs 24 hours a day, so you pay for the service even when you’re not using it. However, if you expect a lot of inference requests a month, we have found the cost of model serving on Lambda is equal to the cost of Elastic Beanstalk using a t3.small when there are about 2.57 million inference requests to the endpoint.

Conclusion

In this post, we demonstrated how to start deploying and serving your deep learning models using Elastic Beanstalk and DJL. You just need to set up your endpoint with Java Spring, build your JAR file, upload that file to Elastic Beanstalk, update some configurations, and it’s deployed!

We also discussed some of the pros and cons of this deployment process, namely that it’s ideal if you need fast inference calls, but the cost is higher when compared to hosting it on a serverless endpoint with lower utilization.

This demo is available in full in the DJL demo GitHub repo. You can also find other examples of serving models with DJL across different JVM tools like Spark and AWS products like Lambda. Whatever your requirements, there is an option for you.

Follow our GitHub, demo repository, Slack channel, and Twitter for more documentation and examples of DJL!

 


About the Author

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Read More

Building your own brand detection and visibility using Amazon SageMaker Ground Truth and Amazon Rekognition Custom Labels – Part 1: End-to-end solution

According to Gartner, 58% of marketing leaders believe brand is a critical driver of buyer behavior for prospects, and 65% believe it’s a critical driver of buyer behavior for existing customers. Companies spend huge amounts of money on advertisement to raise brand visibility and awareness. In fact, as per Gartner, CMO spends over 21% of their marketing budgets on advertising. Brands have to continuously maintain and improve their image, understand their presence on the web or media content, and measure their marketing effort. All these are a top priority for every marketer. However, calculating the ROI from such advertisement can be augmented with artificial intelligence (AI)-powered tools to deliver more accurate results.

Nowadays brand owners are naturally more than a little interested in finding out how effectively their outlays are working for them. However, it’s difficult to assess quantitatively just how good the brand exposure is in a given campaign or event. The current approach to computing such statistics has involved manually annotating broadcast material, which is time-consuming and expensive.

In this post, we show you how to mitigate these challenges by using Amazon Rekognition Custom Labels to train a custom computer vision model to detect brand logos without requiring machine learning (ML) expertise, and Amazon SageMaker Ground Truth to quickly build a training dataset from unlabeled video samples that can be used for training.

For this use case, we want to build a company brand detection and brand visibility application that allows you to submit a video sample for a given marketing event to evaluate how long your logo was displayed in the entire video and where in the video frame the logo was detected.

Solution overview

Amazon Rekognition Custom Labels is an automated ML (AutoML) feature that enables you to train custom ML models for image analysis without requiring ML expertise. Upload a small dataset of labeled images specific to your business use case, and Amazon Rekognition Custom Labels takes care of the heavy lifting of inspecting the data, selecting an ML algorithm, training a model, and calculating performance metrics.

No ML expertise is required to build your own model. The ease of use and intuitive setup of Amazon Rekognition Custom Labels allows any user to bring their own dataset for their use case, label them into separate folders, and launch the Amazon Rekognition Custom Labels training and validation.

The solution is built on a serverless architecture, which means you don’t have to provision your own servers. You pay for what you use. As demand grows or decreases, the compute power adapts accordingly.

This solution demonstrates an end-to-end workflow from preparing a training dataset using Ground Truth to training a model using Amazon Rekognition Custom Labels to identify and detect brand logos in video files. The solution has three main components: data labeling, model training, and running inference.

Data labeling

Three types of labels are available with Ground Truth:

  • Amazon Mechanical Turk – An option to engage a team of global, on-demand workers
  • Vendors – Third-party data labeling services listed on AWS Marketplace
  • Private labelers – Your own teams of private labelers to label the brand logo impressions frame by frame from the video

For this post, we use the private workforce option.

Training the model using Amazon Rekognition Custom Labels

After the labeling job is complete, we train our brand logo detection model using these labeled images. The solution in this post creates an Amazon Rekognition Custom Labels project and custom model. Amazon Rekognition Custom Labels automatically inspects the labeled data provided, selects the right ML algorithms and techniques, trains a model, and provides model performance metrics.

Running inference

When our model is trained, Amazon Rekognition Custom Labels provides an inference endpoint. We can then upload and analyze images or video files using the inference endpoint. The web user interface presents a bar chart that shows the distribution of detected custom labels per minute in the analyzed video.

Architecture overview

The solution uses the serverless architecture. The following architectural diagram illustrates an overview of the solution.

The following architectural diagram illustrates an overview of the solution.

The solution is composed of two AWS Step Functions state machines:

  • Training – Manages extracting image frames from uploaded videos, creating and waiting on a Ground Truth labeling job, and creating and training an Amazon Rekognition Custom Labels model. You can then use the model to run brand logo detection analysis.
  • Analysis – Handles analyzing video or image files. It manages extracting image frames from the video files, starting the custom label model, running the inference, and shutting down the custom label model.

The solution provides a built-in mechanism to manage your custom label model runtime to ensure the model is shut down to keep your cost at a minimum.

The web application communicates with the backend state machines using an Amazon API Gateway RESTful endpoint. The endpoint is protected with a valid AWS Identity and Access Management (IAM) credential. The authentication to the web application is done through an Amazon Cognito user pool, where an authenticated user is issued a secure, time-bounded, temporary credential that can then be used to access “scoped” resources such as uploading video and image files to an Amazon Simple Storage Service (Amazon S3) bucket, invoking the API Gateway RESTful API endpoint to create a new training project, or running inference with the Amazon Rekognition Custom Labels model you trained and built. We use Amazon CloudFront to host the static contents residing on an S3 bucket (web), which is protected through OAID.

Prerequisites

For this walkthrough, you should have an AWS account with appropriate IAM permissions to launch the provided AWS CloudFormation template.

Deploying the solution

You can deploy the solution using a CloudFormation template with AWS Lambda-backed custom resources. To deploy the solution, use one of the following CloudFormation templates and follows the instructions:

AWS Region CloudFormation Template URL
US East (N. Virginia)
US East (Ohio)
US West (Oregon)
EU (Ireland)
  1. Sign in to the AWS Management Console with your IAM user name and password.
  2. On the Create stack page, choose Next.

On the Create stack page, choose Next.

  1. On the Specify stack details page, for FFmpeg Component, choose AGREE AND INSTALL.
  2. For Email, enter a valid email address to use for administrative purposes.
  3. For Price Class, choose Use Only U.S., Canada and Europe [PriceClass_100].
  4. Choose Next.

Choose Next.

  1. On the Review stack page, under Capabilities, select both check boxes.
  2. Choose Create stack.

Choose Create stack.

The stack creation takes roughly 25 minutes to complete; we’re using Amazon CodeBuild to dynamically build the FFmpeg component, and Amazon CloudFront distribution takes about 15 minutes to propagate to the edge locations.

After the stack is created, you should receive an invitation email from no-reply@verificationmail.com. The email contains a CloudFront URL link to access the demo portal, your login username, and a temporary password.

The email contains a CloudFront URL link to access the demo portal, your login username, and a temporary password.

Choose the URL link to open the web portal with Mozilla Firefox or Google Chrome. After you enter your user name and temporary credentials, you’re prompted to create a new password.

Solution walkthrough

In this section, we walk you through the following high-level steps:

  1. Setting up a labeling team.
  2. Creating and training your model.
  3. Completing a video object detection labeling job.

Setting up a labeling team

The first time you log in to the web portal, you’re prompted to create your labeling workforce. The labeling workforce defines members of your labelers who are given labeling tasks to work on when you start a new training project. Choose Yes to configure your labeling team members.

Choose Yes to configure your labeling team members.

You can also navigate to the Labeling Team tab to manage members from your labeling team at any time.

Follow the instructions to add an email and choose Confirm and add members. See the following animation to walk you through the steps.

The newly added member receives two email notifications. The first email contains the credential for the labeler to access the labeling portal. It’s important to note that the labeler is only given access to consume a labeling job created by Ground Truth. They don’t have access to any AWS resources other than working on the labeling job.

They don’t have access to any AWS resources other than working on the labeling job.

A second email “AWS Notification – Subscription Confirmation” contains instructions to confirm your subscription to an Amazon Simple Notification Service (Amazon SNS) topic so the labeler gets notified whenever a new labeling task is ready to consume.

The labeler gets notified whenever a new labeling task is ready to consume.

Creating and training your first model

Let’s start to train our first model to identify logos for AWS and AWS DeepRacer. For this post, we use the video file AWS DeepRacer TV – Ep 1 Amsterdam.

  1. On the navigation menu, choose Training.
  2. Choose Option 1 to train a model to identify the logos with bounding boxes.
  3. For Project name, enter DemoProject.
  4. Choose Add label.
  5. Add the labels AWS and DeepRacer.
  6. Drag and drop a video file to the drop area.

You can drop multiple video or JPEG or PNG image files.

  1. Choose Create project.

The following GIF animation illustrates the process.

At this point, a labeling job is soon created by Ground Truth and the labeler receives an email notification when the job is ready to consume.

Completing the video object detection labeling job

Ground Truth recently launched a new set of pre-built templates that help label video files. For our post, we use the video object detection task template. For more information, see New – Label Videos with Amazon SageMaker Ground Truth.

The training workflow is currently paused, waiting for the labelers to work on the labeling job.

  1. After the labeler receives an email notification that a job is ready for them, they can log in to the labeling portal and start the job by choosing Start working.
  2. For Label category, choose a label.
  3. Draw bounding boxes around the AWS or AWS DeepRacer logos.

You can use the Predict next button to predict the bounding box in subsequent frames.

The following GIF animation demonstrates the labeling flow.

After the labeler completes the job, the backend training workflow resumes and collects the labeled images from the Ground Truth labeling job and starts the model training by creating an Amazon Rekognition Custom Labels project. The time to train a model varies from an hour to a few hours depending on the complexity of the objects (labels) and the size of your training dataset. Amazon Rekognition Custom Labels automatically splits the dataset 80/20 to create the training dataset and test dataset, respectively.

Running inference to detect brand logos

After the model is trained, let’s upload a video file and run predictions with the model we trained.

  1. On the navigation menu, choose Analysis.
  2. Choose Start new analysis.
  3. Specify the following:
    1. Project name – The project we created in Amazon Rekognition Custom Labels.
    2. Project version – The specific version of the trained model.
    3. Inference units – Your desired inference units, so you can dial up or dial down the inference endpoint. For example, if you require higher transactions per second (TPS), use a larger number of inference endpoint.
  4. Drop and drop image files (JPEG, PNG) or video files (MP4, MOV) to the drop area.
  5. When the upload is complete, choose Done and wait for the analysis process to finish.

The analysis workflow starts and waits for the trained Amazon Rekognition Custom Labels model, runs the inference frame by frame, and shuts the model down when it’s no longer in use.

The following GIF animation demonstrates the analysis flow.

Viewing prediction results

The solution provides an overall statistic of the detected brands distributed across the video. The following screenshot shows that the AWS DeepRacer logo is detected about 25% overall and is detected approximately 60% in the 00:01:00–00:02:00 timespan. In contrast, the AWS logo is detected at a much lower rate. For this post, we only used one video to train the model, which had relatively few AWS logos. We can improve the accuracy by retraining the model with more video files.

We can improve the accuracy by retraining the model with more video files.

You can expand the shot element view to see how the brand logos are detected frame by frame.

You can expand the shot element view to see how the brand logos are detected frame by frame.

If you choose a frame to view, it shows the logo with a confidence score. The images that are grayed out are the ones that don’t detect any logo. The following image shows that the AWS DeepRacer logo is detected at frame #10237 with a confidence score of 82%.

The following image shows that the AWS DeepRacer logo is detected at frame #10237 with a confidence score of 82%.

Another image shows that the AWS logo is detected with a confidence score of 60%.

Another image shows that the AWS logo is detected with a confidence score of 60%.

Cleaning up

To delete the demo solution, simply delete the CloudFormation stack that you deployed earlier. However, deleting the CloudFormation stack doesn’t remove the following resources, which you must clean up manually to avoid potential recurring costs:

  • S3 buckets (web, source, and logs)
  • Amazon Rekognition Custom Labels project (trained model)

Conclusion

This post demonstrated how to use Amazon Rekognition Custom Labels to detect brand logos in images and videos. No ML expertise is required to build your own model. The ease of use and intuitive setup of Amazon Rekognition Custom Labels allows you to bring your own dataset, label it into separate folders, and launch the Amazon Rekognition Custom Labels training and validation. We created the required infrastructure, demonstrated installing and running the UI, and discussed the security and cost of the infrastructure.

In the second post in this series, we deep dive into data labeling from a video file using Ground Truth to prepare the data for the training phase. We also explain the technical details of how we use Amazon Rekognition Custom Labels to train the model. In the third post in this series, we dive into the inference phase and show you the reach set of statistics for your brand visibility in a given video file.

For more information about the code sample in this post, see the GitHub repo.


About the Authors

Ken Shek Ken Shek is a Global Vertical Solutions Architect, Media and Entertainment in EMEA region. He helps media customers to designs, develops, and deploys workloads onto AWS Cloud using the AWS Cloud best practice. Ken graduated from University of California, Berkeley and received his master degree in Computer Science at Northwestern Polytechnical University.

 

 

Amit Mukherjee is a Sr. Partner Solutions Architect with a focus on Data Analytics and AI/ML. He works with AWS partners and customers to provide them with architectural guidance for building highly secure scalable data analytics platform and adopt machine learning at a large scale.

 

 

Sameer Goel is a Solutions Architect in Seattle, who drives customer’s success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with Master’s Degree from NEU Boston, with Data Science concentration. He enjoys building and experimenting creative projects and applications.

Read More

Model serving made easier with Deep Java Library and AWS Lambda

Developing and deploying a deep learning model involves many steps: gathering and cleansing data, designing the model, fine-tuning model parameters, evaluating the results, and going through it again until a desirable result is achieved. Then comes the final step: deploying the model.

AWS Lambda is one of the most cost effective service that lets you run code without provisioning or managing servers. It offers many advantages when working with serverless infrastructure. When you break down the logic of your deep learning service into a single Lambda function for a single request, things become much simpler and easy to scale. You can forget all about the resource handling needed for the parallel requests coming into your model. If your usage is sparse and tolerable to a higher latency, Lambda is a great choice among various solutions.

Now, let’s say you’ve decided to use Lambda to deploy your model. You go through the process, but it becomes confusing or complex with the various setup steps to run your models. Namely, you face issues with the Lambda size limits and managing the model dependencies within.

Deep Java Library (DJL) is a deep learning framework designed to make your life easier. DJL uses various deep learning backends (such as Apache MXNet, PyTorch, and TensorFlow) for your use case and is easy to set up and integrate within your Java application! Thanks to its excellent dependency management design, DJL makes it extremely simple to create a project that you can deploy on Lambda. DJL helps alleviate some of the problems we mentioned by downloading the prepackaged framework dependencies so you don’t have to package them yourself, and loads your models from a specified location such as Amazon Simple Storage Service (Amazon S3) so you don’t need to figure out how to push your models to Lambda.

This post covers how to get your models running on Lambda with DJL in 5 minutes.

About DJL

Deep Java Library (DJL) is a Deep Learning Framework written in Java, supporting both training and inference. DJL is built on top of modern deep learning engines (such as TenserFlow, PyTorch, and MXNet). You can easily use DJL to train your model or deploy your favorite models from a variety of engines without any additional conversion. It contains a powerful model zoo design that allows you to manage trained models and load them in a single line. The built-in model zoo currently supports more than 70 pre-trained and ready-to-use models from GluonCV, HuggingFace, TorchHub, and Keras.

Prerequisites

You need the following items to proceed:

In this post, we follow along with the steps from the following GitHub repo.

Building and deploying on AWS

First we need to ensure we’re in the correct code directory. We need to create the an S3 bucket for storage, an AWS CloudFormation stack, and the Lambda function with the following code:

cd lambda-model-serving
./gradlew deploy

This creates the following:

  • An S3 bucket with the name stored in bucket-name.txt
  • A CloudFormation stack named djl-lambda and a template file named out.yml
  • A Lambda function named DJL-Lambda

Now we have our model deployed on a serverless API. The next section invokes the Lambda function.

Invoking the Lambda function

We can invoke the Lambda function with the following code:

aws lambda invoke --function-name DJL-Lambda --payload '{"inputImageUrl":"https://djl-ai.s3.amazonaws.com/resources/images/kitten.jpg"}' build/output.json

The output is stored into build/output.json:

cat build/output.json
[
  {
    "className": "n02123045 tabby, tabby cat",
    "probability": 0.48384541273117065
  },
  {
    "className": "n02123159 tiger cat",
    "probability": 0.20599405467510223
  },
  {
    "className": "n02124075 Egyptian cat",
    "probability": 0.18810519576072693
  },
  {
    "className": "n02123394 Persian cat",
    "probability": 0.06411759555339813
  },
  {
    "className": "n02127052 lynx, catamount",
    "probability": 0.01021555159240961
  }
]

Cleaning up

Use the cleanup scripts to clean up the resources and tear down the services created in your AWS account:

./cleanup.sh

Cost analysis

What happens if we try to set this up on an Amazon Elastic Compute Cloud (Amazon EC2) instance and compare the cost to Lambda? EC2 instances need to continuously run for it to receive requests at any time. This means that you’re paying for that additional time when it’s not in use. If we use a cheap t3.micro instance with 2 vCPUs and 1 GB of memory (knowing that some of this memory is used by the operating system and for other tasks), the cost comes out to $7.48 a month or about 1.6 million requests to Lambda. When using a more powerful instance such as t3.small with 2 vCPUs and 4 GB of memory, the cost comes out to $29.95 a month or about 2.57 million requests to Lambda.

There are pros and cons with using either Lambda or Amazon EC2 for hosting, and it comes down to requirements and cost. Lambda is the ideal choice if your requirements allow for sparse usage and higher latency due to the cold startup of Lambda (5-second startup) when it isn’t used frequently, but it’s cheaper than Amazon EC2 if you aren’t using it much, and the first call can be slow. Subsequent requests become faster, but if Lambda sits idly for 30–45 minutes, it goes back to cold-start mode.

Amazon EC2, on the other hand, is better if you require low latency calls all the time or are making more requests than what it costs in Lambda (shown in the following chart).

Minimal package size

DJL automatically downloads the deep learning framework at runtime, allowing for a smaller package size. We use the following dependency:

runtimeOnly "ai.djl.mxnet:mxnet-native-auto:1.7.0-backport"

This auto-detection dependency results in a .zip file less than 3 MB. The downloaded MXNet native library file is stored in a /tmp folder that takes up about 155 MB of space. We can further reduce this to 50 MB if we use a custom build of MXNet without MKL support.

The MXNet native library is stored in an S3 bucket, and the framework download latency is negligible when compared to the Lambda startup time.

Model loading

The DJL model zoo offers many easy options to deploy models:

  • Bundling the model in a .zip file
  • Loading models from a custom model zoo
  • Loading models from an S3 bucket (supports Amazon SageMaker trained model .tar.gz format)

We use the MXNet model zoo to load the model. By default, because we didn’t specify any model, it uses the resnet-18 model, but you can change this by passing in an artifactId parameter in the request:

aws lambda invoke --function-name DJL-Lambda --payload '{"artifactId": "squeezenet", "inputImageUrl":"https://djl-ai.s3.amazonaws.com/resources/images/kitten.jpg"}' build/output.json

Limitations

There are certain limitations when using serverless APIs, specifically in AWS Lambda:

  • GPU instances are not yet available, as of this writing
  • Lambda has a 512 MB limit for the /tmp folder
  • If the endpoint isn’t frequently used, cold startup can be slow

As mentioned earlier, this way of hosting your models on Lambda is ideal when requests are sparse and the requirement allows for higher latency calls due to the Lambda cold startup. If your requirements require low latency for all requests, we recommend using AWS Elastic Beanstalk with EC2 instances.

Conclusion

In this post, we demonstrated how to easily launch serverless APIs using DJL. To do so, we just need to run the gradle deployment command, which creates the S3 bucket, CloudFormation stack, and Lambda function. This creates an endpoint to accept parameters to run your own deep learning models.

Deploying your models with DJL on Lambda is a great and cost-effective method if Lambda has sparse usage and allows for high latency, due to its cold startup nature. Using DJL allows your team to focus more on designing, building, and improving your ML models, while keeping costs low and keeping the deployment process easy and scalable.

For more information on DJL and its other features, see Deep Java Library.

Follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of DJL!


About the Author

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Read More