Amazon AWS – Page 314

New sound detection approach improves on state of the art

October 28, 2020

by admin Amazon AWS

Knowledge distillation technique for shrinking neural networks yields relative performance increases of up to 122%.Read More

Optimizing costs for machine learning with Amazon SageMaker

October 27, 2020

by BK Chaurasiya Amazon AWS

Applications based on machine learning (ML) can provide tremendous business value. Using ML, we can solve some of the most complex engineering problems that previously were infeasible. One of the advantages of running ML on the AWS Cloud is that you can continually optimize your workloads and reduce your costs. In this post, we discuss how to apply such optimization to ML workloads. We consider available options such as elasticity, different pricing models in cloud, automation, advantage of scale, and more.

Developing, training, maintaining, and performance tuning ML models is an iterative process that requires continuous improvement. Determining the optimum state in the model while going through the permutations and combinations of model parameters and data dependencies to adjust is just one leg of the journey. There is more to optimizing the cost of ML than just algorithm performance and model tuning. There is also some effort required to integrate developed models into applications and realize their benefits. Throughout this process, you can keep the cost down in numerous ways. Amazon SageMaker has made most of this journey smooth so developers and data scientists can spend most of their time focusing on what matters the most—delivering business value.

Amazon SageMaker notebook instances

An Amazon SageMaker notebook instance is an ML compute instance running the Jupyter Notebook app. This notebook instance comes with sample notebooks, several optimized algorithms, and complete code walkthroughs. Amazon SageMaker manages the creation of this instance and related resources. Consider using Amazon SageMaker Studio notebooks for collaborative workloads and when you don’t need to set up compute instances and file storage beforehand.

You can follow these best practices to help reduce the cost of notebook instances.

GPU or CPU?

CPUs are best at handling single, more complex calculations sequentially, whereas GPUs are better at handling multiple but simple calculations in parallel. For many use cases, a standard current generation instance type from an instance family such as ml.m* provides enough computing power, memory, and network performance for many Jupyter notebooks to perform well. GPUs provide a great price/performance ratio if you take advantage of them effectively. However, GPUs also cost more, and you should choose GPU-based notebooks only when you really need them.

Ask yourself: Is my neural network relatively small scale? Is my network performing tons of calculations involving hundreds of thousands of parameters? Can my model take advantage of hardware parallelism such as P3 and P3dn instance families?

Depending on the model, the GPU communication overhead might even degrade performance. So, take a step back and start with what you think is the minimum requirement in terms of ml instance specification and work your way up to identifying the best instance type and family for your model.

If you’re using your notebook instance to train multiple jobs, decide when you need a GPU-enabled instance and when you don’t. If you need accelerated computing in your notebook environment, you can stop your m* family notebook instance, switch to a GPU-enabled P* family instance, and start it again. Don’t forget to switch it back when you no longer need that extra boost in your development environment.

If you’re using massive datasets for training and don’t want to wait for days or weeks to finish your training job, you can speed up the process by distributing training on multiple machines or processes in a cluster.

It’s recommended to use a small subset of your data for development in your notebook instance. You can use the full dataset for a training job that is distributed across optimized instances such as P2 or P3 GPU instances or an instance with powerful CPU, such as c5.

Maximize instance utilization

You can optimize your Amazon SageMaker notebook utilization many different ways. One simple way is to stop your notebook instance when you’re not using it and start when you need it. Consider auto-detecting idle notebook instances and managing their lifecycle using a lifecycle configuration script. For detailed implementation, see Right-sizing resources and avoiding unnecessary costs in Amazon SageMaker. Remember that the instance is only useful when you’re using the Jupyter notebook. If you’re not working on a notebook overnight or over the weekend, it’s a good idea to schedule a stop and start. Another way to save instance cost is by scheduling an AWS Lambda function. For example, you can stop all instances at 7:00 PM and start them at 7:00 AM.

You can also use Amazon CloudWatch Events to start and stop the instance based on an event. If you’re feeling geeky, connect it to your Amazon Rekognition based system to start a data scientist’s notebook instance when they step into the office or have Amazon Alexa do it as you grab a coffee.

Training jobs

The following are some best practices for saving costs on training jobs.

Use pre-trained models or even APIs

Pre-trained models eliminate the time spent gathering data and training models with that data. Consider using higher-level APIs such as provided by Amazon Rekognition or Amazon Comprehend to help you avoid spending on tasks that are already done for you. As an example, Amazon Comprehend simplifies topic modeling on a large corpus of documents. You can also use the Neural topic modeling (NTM) algorithm in Amazon SageMaker to get similar results with more effort. Although you have more control over hyperparameters when training your own model, your use case may not need it. A lot of engineering work and experience goes into creating ready-to-consume and highly optimized models, therefore an upfront ROI analysis is highly recommended if you’re embarking on a journey to develop similar models.

Use Pipe mode (where applicable) to reduce training time

Certain algorithms in Amazon SageMaker like Blazing text work on a large corpus of data. When these jobs are launched, significant time goes into downloading the data from Amazon Simple Storage Service (Amazon S3) into the local Amazon Elastic Block Storage (Amazon EBS) store. Your training jobs don’t start until this download finishes. These algorithms can take advantage of Pipe mode, in which training data is streamed from Amazon S3 into Amazon EBS and your training jobs start immediately. For example, training Blazing text on common crawl (3 TB) can take a few days, out of which a significant number of hours are just lost in download. This process can take advantage of Pipe mode to reduce significant training time.

Managed spot training in Amazon SageMaker

Managed spot training can optimize the cost of training models up to 90% over On-Demand Instances. Amazon SageMaker manages the Spot interruptions on your behalf. If your training job can be interrupted, use managed spot training. You can specify which training jobs use Spot Instances and a stopping condition that specifies how long Amazon SageMaker waits for a job to run using EC2 Spot Instances.

You may also consider using EC2 Spot Instances if you’re willing to do some extra work and if your algorithm is resilient enough to interruptions. For more information, see Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs.

Test your code locally

Resolve issues with code and data so you don’t need to pay to run training clusters for failed training jobs. This also saves you time spent initializing the training cluster. Before you submit a training job, try to run the fit function in local mode to fetch some early feedback:

mxnet_estimator = MXNet('train.py', train_instance_type='local', train_instance_count=1)

Monitor the performance of your training jobs to identify waste

Amazon SageMaker is integrated with CloudWatch out of the box and publishes instance metrics of the training cluster in CloudWatch. You can use these metrics to see if you should make adjustments to your cluster, such as CPUs, memory, number of instances, and more. To view the CloudWatch metric for your training jobs, navigate to the Jobs page on the Amazon SageMaker console and choose View Instance metrics in the Monitor section.

Also, use Amazon SageMaker Debugger, which provides full visibility into model training by monitoring, recording, analyzing, and visualizing training process tensors. Debugger can dramatically reduce the time, resources, and cost needed to train models.

Find the right balance: Performance vs. accuracy

Compare the throughput of 16-bit floating point and 32-bit floating point calculations and determine what is right for your model. 32-bit (single precision or FP32) and even 64-bit (double precision or FP64) floating point variables are popular for many applications that require high precision. These are workloads like engineering simulations that simulate real-world behavior and need the mathematical model to be as exact as possible. In many cases, however, reducing memory usage and increasing speed gained by moving to half or mixed precision (16-bit or FP16) is worth the minor tradeoffs in accuracy. For more information, see Accelerating GPU computation through mixed-precision methods.

A similar trade-off also applies when deciding on the number of layers in your neural network for your classification algorithms, such as image classification.

Tuning (hyperparameter optimization) jobs

Use hyperparameter optimization (HPO) when needed and choose the hyperparameters and their ranges to tune on wisely.

Some API calls can result in a bill of hundreds or even thousands of dollars, and tuning jobs are one of those. A good tuning job can save you many working days of expensive data scientists’ time and provide a significant lift in model performance, which is highly beneficial. HPO in Amazon SageMaker finds good hyperparameters quicker if the search space is narrow (for example, a learning rate of 0.01–0.05 rather than 0.001–0.9). If you have some relevant prior knowledge about the hyperparameter range, start with that. For wide hyperparameter ranges, you may want to consider logarithmic transformations.

Amazon SageMaker also reduces the amount of time spent tuning models using built-in HPO. This technology automatically adjusts hundreds of different combinations of parameters to quickly arrive at the best solution for your ML problem. With high-performance algorithms, distributed computing, managed infrastructure, and HPO, Amazon SageMaker drastically decreases the training time and overall cost of building production grade systems. You can see examples of HPO in some of the Amazon SageMaker built-in algorithms.

For longer training jobs and as the training time for each training job gets longer, you may also want to consider early stopping of training jobs.

Hosting endpoints

The following section discusses how to save cost when hosting endpoints using Amazon SageMaker hosting services.

Delete endpoints that aren’t in use

Amazon SageMaker is great for testing new models because you can easily deploy them into an A/B testing environment. When you’re done with your tests and not using the endpoint extensively anymore, you should delete it. You can always recreate it when you need it again because the model is stored in Amazon S3.

Use Automatic Scaling

Auto Scaling your Amazon SageMaker endpoint doesn’t just provide high availability, better throughput, and better performance, it also optimizes the cost of your endpoint. Make sure that you configure Auto Scaling for your endpoint, monitor your model endpoint, and adjust the scaling policy based on the CloudWatch metrics. For more information, see Load test and optimize and Amazon SageMaker endpoint using automatic scaling.

Amazon Elastic Inference for deep learning

Selecting a GPU instance type that is big enough to satisfy the requirements of the most demanding resource for inference may not be a smart move. Even at peak load, a deep learning application may not fully utilize the capacity offered by a GPU. Consider using Amazon Elastic Inference, which allows you to attach low-cost GPU-powered acceleration to Amazon EC2 and Amazon SageMaker instances to reduce the cost of running deep learning inference by up to 75%.

Host multiple models with multi-model endpoints

You can create an endpoint that can host multiple models. Multi-model endpoints reduce hosting costs by improving endpoint utilization and provide a scalable and cost-effective solution to deploying a large number of models. Multi-model endpoints enable time-sharing of memory resources across models. It also reduces deployment overhead because Amazon SageMaker manages loading models in memory and scaling them based on traffic patterns to models.

Reducing labeling time with Amazon SageMaker Ground Truth

Data labeling is a key process of identifying raw data (such as images, text files, and videos) and adding one or more meaningful and informative labels to provide context so that an ML model can learn from it. This process is essential because the accuracy of trained model depends on accuracy of properly labeled dataset, or ground truth.

Amazon SageMaker Ground Truth uses combination of ML and a human workforce (vetted by AWS) to label images and text. Many ML projects are delayed because of insufficient labeled data. You can use Ground Truth to accelerate the ML cycle and reduce overall costs.

Tagging your resources

Consider tagging your Amazon SageMaker notebook instances and the hosting endpoints. Tags such as name of the project, business unit, environment (such as development, testing, or production) are useful for cost-optimization and can provide a clear visibility into where the money is spent. Cost allocation tags can help track and categorize your cost of ML. It can answer questions such as “Can I delete this resource to save cost?”

Keeping track of cost

If you need visibility of your ML cost on AWS, use AWS Budgets. This helps you track your Amazon SageMaker cost, including development, training, and hosting. You can also set alerts and get a notification when your cost or usage exceeds (or is forecasted to exceed) your budgeted amount. After you create your budget, you can track the progress on the AWS Budgets console.

Conclusion

In this post, I highlighted a few approaches and techniques to optimize cost without compromising on the implementation flexibility so you can deliver best-in-class ML-based business applications.

For more information about optimizing costs, consider the following:

Refer to more ways of optimizing your cost on the cloud by right-sizing your infrastructure. Also take a look at best practices.
For an in-depth cost saving analysis when using an Elastic Inference accelerator, see Serving deep learning at Curalate with Apache MXNet, AWS Lambda, and Amazon Elastic Inference.
Give Amazon SageMaker a try with any of the several sample Jupyter notebooks. For more information about getting started, see Amazon SageMaker – Accelerated Machine Learning.
Learn more about managing ML projects in the whitepaper Managing Machine Learning Projects.

About the Author

BK Chaurasiya is a Principal Product Manager at Amazon Web Services R&D and Innovation team. He provides technical guidance, design advice, and thought leadership to some of the largest and successful AWS customers and partners. A technologist by heart, BK specializes in driving DevOps, continuous delivery, and large-scale cloud transformation initiatives to success.

zomato digitizes menus using Amazon Textract and Amazon SageMaker

October 27, 2020

by Chiranjeev Ghai Amazon AWS

This post is co-written by Chiranjeev Ghai, ML Engineer at zomato. zomato is a global food-tech company based in India.

Are you the kind of person who has very specific cravings? Maybe when the mood hits, you don’t want just any kind of Indian food—you want Chicken Chettinad with a side of paratha, and nothing else will hit the spot! To help picky eaters satisfy their cravings, we at zomato have recently added enhanced search engine capabilities to our restaurant aggregation and food delivery platform. These capabilities enable us to recommend restaurants to zomato users based on searches for specific dishes.

We power this functionality with machine learning (ML), using it to extract and structure text data from menu images. To develop this menu digitization technology, we partnered with Amazon ML Solutions Lab to explore the capabilities of the AWS ML Stack. This post summarizes how we used Amazon Textract and Amazon SageMaker to develop a customized menu digitization solution.

Extracting raw text from menus with Amazon Textract

The first component of this solution was to accurately extract all the text in the menu image. This process is known as optical character recognition (OCR). For our use case, we experimented with both in-house and commercial OCR solutions.

We first created an in-house OCR solution by stacking a pre-trained text detection model and a pre-trained text recognition model. The challenge with these models was that they were trained on a standard text dataset that didn’t match the eclectic fonts found in restaurant menus. To improve system performance, we fine-tuned these models by generating a dataset of 1.5 million synthetic text images that were more representative of text in menus.

After evaluating our in-house solution and several commercial OCR solutions, we found that Amazon Textract offers the best text recognition precision and recall. Restaurants often get creative when designing their menus, so OCR robustness was crucial for this use case. Amazon Textract particularly differentiated itself when processing menus with unique fonts, background images, and low image resolutions. Using it is as simple as making an API call:

#Python 3.6
import boto3
textract_client = boto3.client(
    'textract',
    region_name = '' #insert the AWS region you're working in
)
textract_response = textract_client.detect_document_text(
    Document={
        'S3Object': {
        'Bucket': '', #insert the name of the S3 bucket containing your image
        'Name': '' #insert the S3 key of your image
        }
    }
)

print(textract_response)

The following code is the Amazon Textract output for a sample image:

{'DocumentMetadata': {'Pages': 1},
 'Blocks': [{'BlockType': 'PAGE',
   'Geometry': {'BoundingBox': {'Width': 1.0,
     'Height': 1.0,
     'Left': 0.0,
     'Top': 0.0},
  ...
  {'BlockType': 'WORD',
   'Text': 'Dim',
   'Geometry': {'BoundingBox': {'Width': 0.10242128372192383,
     'Height': 0. 048968635499477386,
     'Left': 0. 24052166938781738,
     'Top': 0. 02556285448372364},
...

The raw outputs are visualized by overlaying them on top of the image. The following image visualizes the preceding raw output. The black boxes are the text-detection bounding boxes provided by Amazon Textract. Extracted text is displayed on the right. Note the unconventional fonts, colors, and images on this menu.

The following image visualizes Amazon Textract outputs for a menu with a different design. Black boxes are the text-detection bounding boxes provided by Amazon Textract. Extracted text is displayed on the right. Again, this menu has unconventional fonts, colors, and images.

Using Amazon SageMaker to build a menu structure detector

The next component of this solution was to group the detections from Amazon Textract by menu section. This enabled our search engine to distinguish between entrees, desserts, beverages, and so on. We framed this as a computer vision problem—object detection, to be precise—and used Amazon SageMaker Ground Truth to collect training data. Ground Truth accelerated this process by providing a fully managed annotation tool that we customized to ask human annotators to draw bounding boxes around every menu section in the image. We used an annotation workforce from AWS Marketplace because this was a niche labeling task, and public labelers from Amazon Mechanical Turk didn’t perform well. With Ground Truth, it took just a few days and approximately $1,400 to label 4,086 images with triplicate redundancy.

With labeled data in hand, we faced a paradox of choice when selecting model-building approaches because object detection is such a thoroughly studied problem. Our choices included:

Removing low-confidence labels from the labeled dataset – Because even human annotators can make mistakes, Ground Truth calculates confidence scores for labels by having multiple annotators (for this use case, three) label the same image. Setting a higher confidence threshold for labels can decrease the noise in the training data at the expense of having less training data.
Data augmentation – Techniques for image data augmentation include horizontal flipping, cropping, shearing, and rotation. Data augmentation can make models more robust by increasing the amount of training data. However, excessive data augmentation may result in poor model convergence.
Feature engineering – From our experience in applying computer vision to processing menus, we had a variety of techniques in mind to emphasize or de-emphasize various aspects of the input images. For example, see the following images.

The following is the original image of a menu.

The following image shows the redacted image (overlay white boxes on a black background where text detections were found).

The following is a text cropped image. On a black background, the image has overlay crops from the original image where text detections were found.

The following is a single channel and text cropped image. The image is encoded as a single RGB channel (for this image, green). You can apply this with other transformations, in this case text cropping.

We also had the following additional model-building methods to choose from:

Model architectures like YOLO, SSD, and RCNN, with VGG or ResNet backbones – Each architecture has different trade-offs of model accuracy, inference time, model size, and more. For this use case, model accuracy was the most important metric because menu images were batch processed.
Using a model pre-trained on a general object detection task or starting from scratch – Transfer learning can be helpful when training complex models on small datasets. However, the task of detecting menu sections is very different from a general object detection task (for example, PASCAL VOC), so the pre-training may not be relevant.
Optimizer parameters – These include learning rate, momentum, regularization coefficients, and early stopping configuration.

With so many hyperparameters to consider, we turned to the automatic tuning feature of Amazon SageMaker to coordinate a massive tuning job across all these variables. The following code is an example of tuning a single model architecture and input data configuration:

import sagemaker
import boto3
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, CategoricalParameter, ContinuousParameter
import itertools
from time import sleep

#set to the region you're working in
REGION_NAME = ''
#set a S3 path for SageMaker to store the outputs of the training jobs 
S3_OUTPUT_PATH = ''
#set a S3 location for your training dataset, 
#assumed to be an augmented manifest file
#see: https://docs.aws.amazon.com/sagemaker/latest/dg/augmented-manifest.html
TRAIN_DATA_LOCATION = ''
#set a S3 location for your validation data, 
#assumed to be an augmented manifest file
VAL_DATA_LOCATION = ''
#specify which fields in the augmented manifest file are relevant for training
DATA_ATTRIBUTE_NAMES = [,]
#specify image shape
IMAGE_SHAPE = 
#specify label width
LABEL_WIDTH = 
#specify number of samples in the training dataset
NUM_TRAINING_SAMPLES = 

sgm_role = sagemaker.get_execution_role()
boto_session = boto3.session.Session(
    region_name = REGION_NAME
)
sgm_session = sagemaker.Session(
    boto_session = boto_session
)
training_image = get_image_uri(
    region_name = REGION_NAME, 
    repo_name = 'object-detection', 
    repo_version = 'latest'
)

#set training job configuration
object_detection_estimator = Estimator(
    image_name = training_image,
    role = sgm_role,
    train_instance_count = 1,
    train_instance_type = 'ml.p3.2xlarge',
    train_volume_size = 50,
    train_max_run = 360000,
    input_mode = 'Pipe',
    output_path = S3_OUTPUT_PATH,
    sagemaker_session = sgm_session
)

#set input data configuration
train_data = sagemaker.session.s3_input(
    s3_data = TRAIN_DATA_LOCATION,
    distribution = 'FullyReplicated',
    record_wrapping = 'RecordIO',
    s3_data_type = 'AugmentedManifestFile',
    attribute_names = DATA_ATTRIBUTE_NAMES
) 

val_data = sagemaker.session.s3_input(
    s3_data = VAL_DATA_LOCATION,
    distribution = 'FullyReplicated',
    record_wrapping = 'RecordIO',
    s3_data_type = 'AugmentedManifestFile',
    attribute_names = DATA_ATTRIBUTE_NAMES
)

data_channels = {
    'train': train_data, 
    'validation' : val_data
}

#set static hyperparameters
#see: https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection-api-config.html
static_hyperparameters = {
    'num_classes' : 1,
    'epochs' : 100,               
    'lr_scheduler_step' : '15,30',      
    'lr_scheduler_factor' : 0.1,
    'overlap_threshold' : 0.5,
    'nms_threshold' : 0.45,
    'image_shape' : IMAGE_SHAPE,
    'label_width' : LABEL_WIDTH,
    'num_training_samples' : NUM_TRAINING_SAMPLES,
    'early_stopping' : True,
    'early_stopping_min_epochs' : 5,
    'early_stopping_patience' : 1,
    'early_stopping_tolerance' : 0.05,
}

#set ranges for tunable hyperparameters
hyperparameter_ranges = {
    'learning_rate': ContinuousParameter(
        min_value = 1e-5, 
        max_value = 1e-2, 
        scaling_type = 'Auto'
    ),
    'mini_batch_size': IntegerParameter(
        min_value = 8, 
        max_value = 64, 
        scaling_type = 'Auto'
    )
}

#Not all hyperparameters are feasible to tune directly
#see: https://docs.aws.amazon.com/sagemaker/latest/dg/object-detection-tuning.html
#For these we run model tuning jobs in parallel using a for loop
#We take this approach for tuning over different model architectures 
#and different feature engineering configurations
use_pretrained_options = [0, 1]
base_network_options = ['resnet-50', 'vgg-16']

for use_pretrained, base_network in itertools.product(use_pretrained_options, base_network_options):
    static_hyperparameter_configuration = {
        **static_hyperparameters, 
        'use_pretrained_model' : use_pretrained, 
        'base_network' : base_network
    }
    
    object_detection_estimator.set_hyperparameters(
        **static_hyperparameter_configuration
    )
    
    tuner = HyperparameterTuner(
        estimator = object_detection_estimator,
        objective_metric_name = 'validation:mAP',
        strategy = 'Bayesian',
        hyperparameter_ranges = hyperparameter_ranges,
        max_jobs = 24,
        max_parallel_jobs = 2,
        early_stopping_type = 'Auto',
    )
    
    tuner.fit(
        inputs = data_channels
    )
    
    print(f'Started tuning job: {tuner.latest_tuning_job.name}')
    
    #wait a bit before starting next job so auto generated names don't conflict
    sleep(60)

This code uses version 1.72.0 of the Amazon SageMaker Python SDK, which is the default version installed in Amazon SageMaker notebook instances. Version 2.X introduces breaking changes. For more information, see Use Version 2.x of the SageMaker Python SDK.

We used powerful GPU hardware (p3.2xlarge instances), and it took us just 1 week and approximately $1,500 to explore 455 unique parameter configurations. Of these configurations, Amazon SageMaker found that a fine-tuned Faster R-CNN model with text cropping performed the best, with a mean average precision score of 0.93. This aligned with results from our prior work in this space, which found that two-stage detectors generally outperform single-stage detectors in processing menus.

The following is an example of how the object detection model processed a menu. In this image, the purple boxes are the predicted bounding boxes from the menu section detection model. Black boxes are the text detection bounding boxes provided by Amazon Textract.

Using Amazon SageMaker to build rule- and ML-based text classifiers

The final component in the solution was a layer of text classification. To enable our enhanced search functionality, we had to know if each detection within a menu section was the menu section title, name of a dish, price of a dish, or something else (such as a description of a dish or the name of the restaurant). To this end, we developed a hybrid rule- and ML-based text classification system.

The first step of the classification was to use a rule to determine if a detection was a price or not. This rule simply calculated the proportion of numeric characters in the detection. If the proportion was greater than 40%, the detection was classified as a price. Although simple, this classifier worked well in practice. We used Amazon SageMaker notebook instances as a convenient interactive environment to develop this and other rules.

After the prices were filtered out, the remaining detections were classified as dish or not dish. From our experience in processing menus, we intuitively knew that in many cases, the location of prices was sufficient to do this classification. For these menus, dishes and prices are listed side by side, so simply classifying detections located to the left of prices as dishes worked well.

The following example shows how the rules-based text classification system processed a menu. Green boxes are detections classified as dishes (by the price location rule). Red boxes are detections classified as not dishes (by the price location rule). Blue boxes are detections classified as prices. Final dish detections are on the right.

Some menus might include lengthy dish descriptions or may not list prices next to individual dishes. These menus violate the assumptions of the price location rules, so we turned to model-based text classification. We used Amazon SageMaker training jobs to experiment with many modeling approaches in parallel, including an XGBoost model trained on hashed word count vectors. In the end, we found that a fine-tuned BERT model from GluonNLP achieved the best performance with an AUROC score of 0.86.

The following image is an example of how the model-based text classification system processed a menu. Green boxes are detections classified as dishes (by the BERT model). Red boxes are detections classified as not dishes (by the BERT model). Blue boxes are detections classified as prices. The final dish detections are on the right.

Of the remaining detections (those not classified as prices or dishes), a final round of classification identified menu section titles. We created features that captured the font size of the detection, the location of the detection on the menu, and the length of the words within the detection. We used these features as inputs to a logistic regression model that predicted if a detection is a menu section title or not.

Key features of Amazon SageMaker

In the end, we found that doing OCR was as simple as making an API call to Amazon Textract. However, our use case required additional customization. We selected Amazon SageMaker as an ML platform to develop this customization because it offered several key features:

Amazon SageMaker Notebooks made it easy to spin up Jupyter notebook environments for prototyping and testing rules and models.
Ground Truth helped us build and deploy a custom image annotation tool with no front-end experience required.
Amazon SageMaker automatic tuning enabled us to run massive hyperparameter tuning jobs on powerful hardware, and included an intuitive interface for tracking the results of hundreds of experiments. You can implement tuning jobs with early stopping conditions, which makes experimentation cost-effective.

Amazon SageMaker offers additional integration benefits from including all the preceding features in a single platform:

Amazon SageMaker Notebooks come pre-installed with all the dependencies needed to build models that can be optimized with automatic tuning.
Ground Truth offers easy access to labelers from Mechanical Turk or AWS Marketplace.
Automatic tuning can directly ingest the manifest files created by Amazon SageMaker Ground Truth.

Putting it all together

Our menu digitization system can extract text from images of menus, group it by menu section, extract the title of the section, extract the dishes within each section, and pair each dish with its price. The following is a visualization of the end-to-end solution.

The workflow contains the following steps:

The input is an image of a menu.
Amazon Textract performs OCR on the input image.
An ML-based computer vision model predicts bounding boxes for menu sections in the menu image.
A rules-based classifier classifies Amazon Textract detections as price or not price.
A rules-based classifier (5a) attempts to use the location of price detections to classify the not price detections as dish or not dish. If this rule doesn’t successfully classify most of the detections on the page, an ML-based classifier is used instead (5b).
The ML-based classifier uses hand-crafted features to classify not dish detections as menu section title or not menu section title.
The menu text is structured by combining the menu section detections and the text classification results.

The following image visualizes a sample output of the system. Green boxes are detections classified as dishes. Blue boxes are detections classified as prices. Yellow boxes are detections classified as menu section titles. Purple boxes are predicted menu section bounding boxes.

The following code is the structured output:

[
   {
      "title":{
         "text":"Shrimp Dishes"
      },
      "dishes":[
         {
            "text":"Shrimp Masala",
            "price":{
               "text":"140"
            }
         },
         {
            "text":"Shrimp Biryani",
            "price":{
               "text":"170"
            }
         },
         {
            "text":"Shrimp Pulav",
            "price":{
               "text":"160"
            }
         }
      ]
   },
   ...
]

Conclusion

We built a system that uses ML to digitize menus without any human input required. This system will improve user experience by powering new features such as advanced dish search and review highlight verification. Our content team will also use it to accelerate creating menus for online ordering.

To explore these capabilities of Amazon Textract and Amazon SageMaker in more depth, see Automatically extract text and structured data from documents with Amazon Textract and Amazon SageMaker Automatic Model Tuning: Using Machine Learning for Machine Learning.

The Amazon ML Solutions Lab helped us accelerate our use of ML by pairing our team with ML experts. The ML Solutions Lab brings to every customer engagement learnings from more than 20 years of Amazon’s ML innovations in areas such as fulfillment and logistics, personalization and recommendations, computer vision and translation, fraud prevention, forecasting, and supply chain optimization. To learn more about the AWS ML Solutions Lab, contact your account manager or visit Amazon Machine Learning Solutions Lab.

About the Authors

Chiranjeev Ghai is a Machine Learning Engineer. In his current role, he has been aiding automation at zomato by leveraging a wide variety of ML optimisations ranging from Image Classification, Product Recommendation, and Text Detection. When not building models, he likes to spend his time playing video games at home.

Ryan Cheng is a Deep Learning Architect in the Amazon ML Solutions Lab. He has worked on a wide range of ML use cases from sports analytics to optical character recognition. In his spare time, Ryan enjoys cooking.

Andrew Ang is a Deep Learning Architect at the Amazon ML Solutions Lab, where he helps AWS customers identify and build AI/ML solutions to address their business problems.

Vinayak Arannil is a Data Scientist at the Amazon Machine Learning Solutions Lab. He has worked on various domains of data science like computer vision, natural language processing, recommendation systems, etc.

Video streaming and deep learning: Using Amazon Kinesis Video Streams with Deep Java Library

October 27, 2020

by Zach Kimberg Amazon AWS

Amazon Kinesis Video Streams allows you to easily ingest video data from connected devices for processing. One of the most effective ways to process this video data is using the power of deep learning. You can create an efficient service infrastructure to run these computations with a Java server, but Java support for deep learning has traditionally been difficult to come by.

Deep Java Library (DJL) is a new open-source deep learning framework for Java built by AWS. It sits on top of native engines, so you can train entirely in DJL while using different engines on the backend, such as PyTorch and Apache MXNet. It can also import and run models built using Tensorflow, Keras, and PyTorch. DJL can bridge the ease of Kinesis Video Streams with the power of deep learning for your own video analytics application.

In this tutorial, we walk through running an object detection model against a Kinesis video stream. In object detection, the computer finds different types of objects in an image and draws a bounding box, describing their locations inside the image. For example, you can use detection to recognize objects like dogs or people to avoid false alarms in a home security camera.

The full project and instructions to run it are available in the DJL demo repository.

Setting up

To begin, create a new Java project with the following dependencies, shown here in gradle format:

dependencies {
    implementation platform("ai.djl:bom:0.8.0")
    implementation "ai.djl:api"
    
    runtimeOnly "ai.djl.mxnet:mxnet-model-zoo"
    runtimeOnly "ai.djl.mxnet:mxnet-native-auto"
    
    implementation "software.amazon.awssdk:kinesisvideo:2.10.75"
    implementation "software.amazon.kinesis:amazon-kinesis-client:2.2.9"
    implementation "com.amazonaws:amazon-kinesis-video-streams-parser-library:1.0.13"
}

The DJL ImageVisitor

Because the model works on images, you can create a DJL FrameVisitor that visits and runs your model on each frame in the video. In real applications, it might help to only run your model on a fraction of the frames in the video. See the following code:

FrameVisitor frameVisitor = FrameVisitor.create(new DjlImageVisitor());

The DjlImageVisitor class extends the H264FrameDecoder to provide the capability to convert the frame into a standard Java BufferedImage. Because DJL natively supports this class, you can run it directly from the BufferedImage.

In DJL, the Predictor is used to run the trained model against live data. This is often referred to as inference or prediction. It fully encapsulates the inference experience by taking your input through preprocessing to prepare it into the model’s data structure, running the model itself, and postprocessing the data into an easy-to-use output class. In the following code block, the Predictor converts an Image to the set of outputs, DetectedObjects. An ImageFactory converts a standard Java BufferedImage into the DJL Image class:

public class DjlImageVisitor extends H264FrameDecoder {

    Predictor<Image, DetectedObjects> predictor;
    ImageFactory factory = ImageFactory.getInstance();

    ...

}

DJL also provides a model zoo where you can find many models trained on different tasks, datasets, and engines. For now, create a Predictor using the basic SSD object detection model. You can also use the default preprocessing and postprocessing defined within the model zoo to directly create a Predictor. For your own applications, you can define custom processing in a Translator and pass it in when creating a new Predictor:

Criteria<Image, DetectedObjects> criteria = Criteria.builder()
    .setTypes(Image.class, DetectedObjects.class)
    .optArtifactId("ai.djl.mxnet:ssd")
    .build();
predictor = ModelZoo.loadModel(criteria).newPredictor();

Then, you just need to define the FrameVisitors process method that is called to handle the various frames as follows. You convert the Frame into a BufferedImage using the decodeH264Frame method defined within the H264FrameDecoder. You wrap that into an Image using the ImageFactory you created earlier. Then, you use your Predictor to run prediction using the SSD model. See the following code:

    @Override
    public void process(
            Frame frame,
            MkvTrackMetadata trackMetadata,
            Optional<FragmentMetadata> fragmentMetadata)
            throws FrameProcessException {

        Image image = factory.fromImage(decodeH264Frame(frame, trackMetadata));
        DetectedObjects prediction = predictor.predict(image);
    }

Using the prediction

At this point, you have the detected objects and can use them for whatever your application requires. For a simple application, you could just print out all the class names that you detected to standard out as follows:

        String classStr =
                prediction
                        .items()
                        .stream()
                        .map(Classification::getClassName)
                        .collect(Collectors.joining(", "));
        System.out.println("Found objects: " + classStr);

You could also find out if there is a high probability that a person was in the image using the following code:

        boolean hasPerson =
                prediction
                        .items()
                        .stream()
                        .anyMatch(
                                c ->
                                        "person".equals(c.getClassName())
                                                && c.getProbability() > 0.5);

Another option is to use the image visualization methods in the Image class to draw the bounding boxes on top of the original image. Then, you can get a visual representation of the detected objects. See the following code:

        image.drawBoundingBoxes(prediction);
        Path outputFile = Paths.get("out/annotatedImage.png");
        try (OutputStream os = Files.newOutputStream(outputFile)) {
            image.save(os, "png");
        }

Running the stream

You’re now ready to set up your video stream. For instructions, see Create a Kinesis Video Stream. Make sure to record the REGION and STREAM_NAME that you used so you can pass it into your application.

Then, create a new thread pool to run your application. You also need to build a GetMediaWorker with all the data for your video stream and run it on the thread pool. For your getMediaworker, you need to pass in the data you pulled from the Kinesis Video Streams console describing your video stream. You also need to provide the AWS credentials for accessing the stream. Use the SystemPropertiesCredentialsProvider, which finds the credentials in the JVM System Properties. You can find more details about providing these credentials in the demo repository. Lastly, we need to pass in the StartSelectorType.NOW to start using the stream immediately. See the following code:

ExecutorService executorService = Executors.newFixedThreadPool(1);

AmazonKinesisVideoClientBuilder amazonKinesisVideoBuilder =
        AmazonKinesisVideoClientBuilder.standard();
amazonKinesisVideoBuilder.setRegion(REGION.getName());
amazonKinesisVideoBuilder.setCredentials(new SystemPropertiesCredentialsProvider());
AmazonKinesisVideo amazonKinesisVideo = amazonKinesisVideoBuilder.build();



GetMediaWorker getMediaWorker =
        GetMediaWorker.create(
                REGION,
                new SystemPropertiesCredentialsProvider(),
                STREAM_NAME,
                new StartSelector().withStartSelectorType(StartSelectorType.NOW),
                amazonKinesisVideo,
                frameVisitor);
executorService.submit(getMediaWorker);

Conclusion

That’s it! You’re ready to begin sending data to your stream and detecting the objects in the video. You can find more information about the Kinesis Video Streams API in the Amazon Kinesis Video Streams Producer SDK Java GitHub repo. The full Kinesis Video Streams DJL demo is available with the rest of the DJL demo applications and integrations with many other AWS and Java tools in the demo repository.

Now that you have integrated Kinesis Video Streams and DJL, you can improve your application in many different ways. You can choose additional object detection and image-based models from the more than 70 pre-trained and ready-to-use models in our model zoo from GluonCV, TorchHub, and Keras. You can run these or custom models across any of the engines supported by DJL, including Tensorflow, PyTorch, MXNet, and ONNX Runtime. DJL even has full training support so you can build your own model to add to your video streaming application instead of relying on a pre-trained one.

Don’t forget to follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of DJL!

About the Authors

Zach Kimberg is a Software Engineer with AWS Deep Learning working mainly on Apache MXNet for Java and Scala. Outside of work he enjoys reading, especially Fantasy.

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

For Neha Rungta, it’s the journey that matters

October 27, 2020

by admin Amazon AWS

Rungta had a promising career with NASA, but decided the stars aligned for her at Amazon.Read More

Bringing real-time machine learning-powered insights to rugby using Amazon SageMaker

October 26, 2020

by Mehdi Noori Amazon AWS

The Guinness Six Nations Championship began in 1883 as the Home Nations Championship among England, Ireland, Scotland, and Wales, with the inclusion of France in 1910 and Italy in 2000. It is among the oldest surviving rugby traditions and one of the best-attended sporting events in the world. The COVID-19 outbreak disrupted the end of the 2020 Championship and four games were postponed. The remaining rounds resumed on October 24. With the increasing application of artificial intelligence and machine learning (ML) in sports analytics, AWS and Stats Perform partnered to bring ML-powered, real-time stats to the game of rugby, to enhance fan engagement and provide valuable insights into the game.

This post summarizes the collaborative effort between the Guinness Six Nations Rugby Championship, Stats Perform, and AWS to develop an ML-driven approach with Amazon SageMaker and other AWS services that predicts the probability of a successful penalty kick, computed in real time and broadcast live during the game. AWS infrastructure enables single-digit millisecond latency for kick predictions during inference. The Kick Predictor stat is one of the many new AWS-powered, on-screen dynamic Matchstats that provide fans with a greater understanding of key in-game events, including scrum analysis, play patterns, rucks and tackles, and power game analysis. For more information about other stats developed for rugby using AWS services, see the Six Nations Rugby website.

Rugby is a form of football with a 23-player match day squad. 15 players on each team are on the field, with additional substitutions waiting to get involved in the full-contact sport. The objective of the game is to outscore the opposing team, and one way of scoring is to kick a goal. The ability to kick accurately is one of the most critical elements of rugby, and there are two ways to score with a kick: through a conversion (worth two points) and a penalty (worth three points).

Predicting the likelihood of a successful kick is important because it enhances fan engagement during the game by showing the success probability before the player kicks the ball. There are usually 40–60 seconds of stoppage time while the player sets up for the kick, during which the Kick Predictor stat can appear on-screen to fans. Commentators also have time to predict the outcome, quantify the difficulty of each kick, and compare kickers in similar situations. Moreover, teams may start to use kicking probability models in the future to determine which player should kick given the position of the penalty on the pitch.

Developing an ML solution

To calculate the penalty success probability, the Amazon Machine Learning Solutions Lab used Amazon SageMaker to train, test, and deploy an ML model from historical in-game events data, which calculates the kick predictions from anywhere in the field. The following sections explain the dataset and preprocessing steps, the model training, and model deployment procedures.

Dataset and preprocessing

Stats Perform provided the dataset for training the goal kick model. It contained millions of events from historical rugby matches from 46 leagues from 2007–2019. The raw JSON events data that was collected during live rugby matches was ingested and stored on Amazon Simple Storage Service (Amazon S3). It was then parsed and preprocessed in an Amazon SageMaker notebook instance. After selecting the kick-related events, the training data comprised approximately 67,000 kicks, with approximately 50,000 (75%) successful kicks and 17,000 misses (25%).

The following graph shows a summary of kicks taken during a sample game. The athletes kicked from different angles and various distances.

Rugby experts contributed valuable insights to the data preprocessing, which included detecting and removing anomalies, such as unreasonable kicks. The clean CSV data went back to an S3 bucket for ML training.

The following graph depicts the heatmap of the kicks after preprocessing. The left-side kicks are mirrored. The brighter colors indicated a higher chance of scoring, standardized between 0 to 1.

Feature engineering

To better capture the real-world event, the ML Solutions Lab engineered several features using exploratory data analysis and insights from rugby experts. The features that went into the modeling fell into three main categories:

Location-based features – The zone in which the athlete takes the kick and the distance and angle of the kick to the goal. The x-coordinates of the kicks are mirrored along the center of the rugby pitch to eliminate the left or right bias in the model.
Player performance features – The mean success rates of the kicker in a given field zone, in the Championship, and in the kicker’s entire career.
In-game situational features – The kicker’s team (home or away), the scoring situation before they take the kick, and the period of the game in which they take the kick.

The location-based and player performance features are the most important features in the model.

After feature engineering, the categorical variables were one-hot encoded, and to avoid the bias of the model towards large-value variables, the numerical predictors were standardized. During the model training phase, a player’s historical performance features were pushed to Amazon DynamoDB tables. DynamoDB helped provide single-digit millisecond latency for kick predictions during inference.

Training and deploying models

To explore a wide range of classification algorithms (such as logistic regression, random forests, XGBoost, and neural networks), a 10-fold stratified cross-validation approach was used for model training. After exploring different algorithms, the built-in XGBoost in Amazon SageMaker was used due to its better prediction performance and inference speed. Additionally, its implementation has a smaller memory footprint, better logging, and improved hyperparameter optimization (HPO) compared to the original code base.

HPO, or tuning, is the process of choosing a set of optimal hyperparameters for a learning algorithm, and is a challenging element in any ML problem. HPO in Amazon SageMaker uses an implementation of Bayesian optimization to choose the best hyperparameters for the next training job. Amazon SageMaker HPO automatically launches multiple training jobs with different hyperparameter settings, evaluates the results of those training jobs based on a predefined objective metric, and selects improved hyperparameter settings for future attempts based on previous results.

The following diagram illustrates the model training workflow.

Optimizing hyperparameters in Amazon SageMaker

You can configure training jobs and when the hyperparameter tuning job launches by initializing an estimator, which includes the container image for the algorithm (for this use case, XGBoost), configuration for the output of the training jobs, the values of static algorithm hyperparameters, and the type and number of instances to use for the training jobs. For more information, see Train a Model.

To create the XGBoost estimator for this use case, enter the following code:

import boto3
import sagemaker
from sagemaker.tuner import IntegerParameter, CategoricalParameter, ContinuousParameter, HyperparameterTuner
from sagemaker.amazon.amazon_estimator import get_image_uri
BUCKET = <bucket name>
PREFIX = 'kicker/xgboost/'
region = boto3.Session().region_name
role = sagemaker.get_execution_role()
smclient = boto3.Session().client('sagemaker')
sess = sagemaker.Session()
s3_output_path = ‘s3://{}/{}/output’.format(BUCKET, PREFIX)

container = get_image_uri(region, 'xgboost', repo_version='0.90-1')

xgb = sagemaker.estimator.Estimator(container,
                                    role, 
                                    train_instance_count=4, 
                                    train_instance_type= 'ml.m4.xlarge',
                                    output_path=s3_output_path,
                                    sagemaker_session=sess)

After you create the XGBoost estimator object, set its initial hyperparameter values as shown in the following code:

xgb.set_hyperparameters(eval_metric='auc',
                        objective= 'binary:logistic',
			num_round=200,
                        rate_drop=0.3,
                        max_depth=5,
                        subsample=0.8,
                        gamma=2,
                        eta=0.2,
                        scale_pos_weight=2.85) #For class imbalance weights

# Specifying the objective metric (auc on validation set)
OBJECTIVE_METRIC_NAME = ‘validation:auc’

# specifying the hyper parameters and their ranges
HYPERPARAMETER_RANGES = {'eta': ContinuousParameter(0, 1),
                        'alpha': ContinuousParameter(0, 2),
                        'max_depth': IntegerParameter(1, 10)}

For this post, AUC (area under the ROC curve) is the evaluation metric. This enables the tuning job to measure the performance of the different training jobs. The kick prediction is also a binary classification problem, which is specified in the objective argument as a binary:logistic. There is also a set of XGBoost-specific hyperparameters that you can tune. For more information, see Tune an XGBoost model.

Next, create a HyperparameterTuner object by indicating the XGBoost estimator, the hyperparameter ranges, passing the parameters, the objective metric name and definition, and tuning resource configurations, such as the number of training jobs to run in total and how many training jobs can run in parallel. Amazon SageMaker extracts the metric from Amazon CloudWatch Logs with a regular expression. See the following code:

tuner = HyperparameterTuner(xgb,
                            OBJECTIVE_METRIC_NAME,
                            HYPERPARAMETER_RANGES,
                            max_jobs=20,
                            max_parallel_jobs=4)
s3_input_train = sagemaker.s3_input(s3_data='s3://{}/{}/train'.format(BUCKET, PREFIX), content_type='csv')
s3_input_validation = sagemaker.s3_input(s3_data='s3://{}/{}/validation/'.format(BUCKET, PREFIX), content_type='csv')
tuner.fit({'train': s3_input_train, 'validation':

Finally, launch a hyperparameter tuning job by calling the fit() function. This function takes the paths of the training and validation datasets in the S3 bucket. After you create the hyperparameter tuning job, you can track its progress via the Amazon SageMaker console. The training time depends on the instance type and number of instances you selected during tuning setup.

Deploying the model on Amazon SageMaker

When the training jobs are complete, you can deploy the best performing model. If you’d like to compare models for A/B testing, Amazon SageMaker supports hosting representational state transfer (REST) endpoints for multiple models. To set this up, create an endpoint configuration that describes the distribution of traffic across the models. In addition, the endpoint configuration describes the instance type required for model deployment. The first step is to get the name of the best performing training job and create the model name.

After you create the endpoint configuration, you’re ready to deploy the actual endpoint for serving inference requests. The result is an endpoint that can you can validate and incorporate into production applications. For more information about deploying models, see Deploy the Model to Amazon SageMaker Hosting Services. To create the endpoint configuration and deploy it, enter the following code:

endpoint_name = 'Kicker-XGBoostEndpoint'
xgb_predictor = tuner.deploy(initial_instance_count=1, 
                             instance_type='ml.t2.medium', 
                             endpoint_name=endpoint_name)

After you create the endpoint, you can request a prediction in real time.

Building a RESTful API for real-time model inference

You can create a secure and scalable RESTful API that enables you to request the model prediction based on the input values. It’s easy and convenient to develop different APIs using AWS services.

The following diagram illustrates the model inference workflow.

First, you request the probability of the kick conversion by passing parameters through Amazon API Gateway, such as the location and zone of the kick, kicker ID, league and Championship ID, the game’s period, if the kicker’s team is playing home or away, and the team score status.

The API Gateway passes the values to the AWS Lambda function, which parses the values and requests additional features related to the player’s performance from DynamoDB lookup tables. These include the mean success rates of the kicking player in a given field zone, in the Championship, and in the kicker’s entire career. If the player doesn’t exist in the database, the model uses the average performance in the database in the given kicking location. After the function combines all the values, it standardizes the data and sends it to the Amazon SageMaker model endpoint for prediction.

The model performs the prediction and returns the predicted probability to the Lambda function. The function parses the returned value and sends it back to API Gateway. API Gateway responds with the output prediction. The end-to-end process latency is less than a second.

The following screenshot shows example input and output of the API. The RESTful API also outputs the average success rate of all the players in the given location and zone to get the comparison of the player’s performance with the overall average.

For instructions on creating a RESTful API, see Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda.

Bringing design principles into sports analytics

To create the first real-time prediction model for the tournament with a millisecond latency requirement, the ML Solutions Lab team worked backwards to identify areas in which design thinking could save time and resources. The team worked on an end-to-end notebook within an Amazon SageMaker environment, which enabled data access, raw data parsing, data preprocessing and visualization, feature engineering, model training and evaluation, and model deployment in one place. This helped in automating the modeling process.

Moreover, the ML Solutions Lab team implemented a model update iteration for when the model was updated with newly generated data, in which the model parses and processes only the additional data. This brings computational and time efficiencies to the modeling.

In terms of next steps, the Stats Perform AI team has been looking at the next stage of rugby analysis by breaking down the other strategic facets as line-outs, scrums and teams, and continuous phases of play using the fine-grain spatio-temporal data captured. The state-of-the-art feature representations and latent factor modelling (which have been utilized so effectively in Stats Perform’s “Edge” match-analysis and recruitment products in soccer) means that there is plenty of fertile space for innovation that can be explored in rugby.

Conclusion

Six Nations Rugby, Stats Perform, and AWS came together to bring the first real-time prediction model to the 2020 Guinness Six Nations Rugby Championship. The model determined a penalty or conversion kick success probability from anywhere in the field. They used Amazon SageMaker to build, train, and deploy the ML model with variables grouped into three main categories: location-based features, player performance features, and in-game situational features. The Amazon SageMaker endpoint provided prediction results with subsecond latency. The model was used by broadcasters during the live games in the Six Nations 2020 Championship, bringing a new metric to millions of rugby fans.

You can find full, end-to-end examples of creating custom training jobs, training state-of-the-art object detection models, and model deployment on Amazon SageMaker on the AWS Labs GitHub repo. To learn more about the ML Solutions Lab, see Amazon Machine Learning Solutions Lab.

About the Authors

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.

Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. Outside of work, he enjoys spending time with his family and reading books.

Patrick Lucey is the Chief Scientist at Stats Perform. Patrick started the Artificial Intelligence group at Stats Perform in 2015, with thegroup focusing on both computer vision and predictive modelling capabilities in sport. Previously, he was at Disney Research for 5 years, where he conducted research into automatic sports broadcasting using large amounts of spatiotemporal tracking data. He received his BEng(EE) from USQ and PhD from QUT, Australia in 2003 and 2008 respectively. He was also co-author of the best paper at the 2016 MIT Sloan Sports Analytics Conference and in 2017 & 2018 was co-author of best-paper runner-up at the same conference.

Xavier Ragot is Data Scientist with the Amazon ML Solution Lab team where he helps design creative ML solution to address customers’ business problems in various industries.

Building an NLU-powered search application with Amazon SageMaker and the Amazon ES KNN feature

October 26, 2020

by Amit Mukherjee Amazon AWS

The rise of semantic search engines has made ecommerce and retail businesses search easier for its consumers. Search engines powered by natural language understanding (NLU) allow you to speak or type into a device using your preferred conversational language rather than finding the right keywords for fetching the best results. You can query using words or sentences in your native language, leaving it to the search engine to deliver the best results.

Amazon SageMaker is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy machine learning (ML) models quickly. Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy for you to deploy, secure, and run Elasticsearch cost-effectively at scale. Amazon ES offers KNN search, which can enhance search in use cases such as product recommendations, fraud detection, and image, video, and some specific semantic scenarios like document and query similarity. Alternatively, you can also choose Amazon Kendra, a highly accurate and easy to use enterprise search service that’s powered by machine learning, with no machine learning experience required. In this post, we explain how you can implement an NLU-based product search for certain types of applications using Amazon SageMaker and the Amazon ES k-nearest neighbor (KNN) feature.

In the post Building a visual search application with Amazon SageMaker and Amazon ES, we shared how to build a visual search application using Amazon SageMaker and the Amazon ES KNN’s Euclidean distance metric. Amazon ES now supports open-source Elasticsearch version 7.7 and includes the cosine similarity metric for KNN indexes. Cosine similarity measures the cosine of the angle between two vectors in the same direction, where a smaller cosine angle denotes higher similarity between the vectors. With cosine similarity, you can measure the orientation between two vectors, which makes it the ideal choice for some specific semantic search applications. The highly distributed architecture of Amazon ES enables you to implement an enterprise-grade search engine with enhanced KNN ranking, with high recall and performance.

In this post, you build a very simple search application that demonstrates the potential of using KNN with Amazon ES compared to the traditional Amazon ES ranking method, including a web application for testing the KNN-based search queries in your browser. The application also compares the search results with Elasticsearch match queries to demonstrate the difference between KNN search and full-text search.

Overview of solution

Regular Elasticsearch text-matching search is useful when you want to do text-based search, but KNN-based search is a more natural way to search for something. For example, when you search for a wedding dress using KNN-based search application, it gives you similar results if you type “wedding dress” or “marriage dress.” Implementing this KNN-based search application consists of two phases:

KNN reference index – In this phase, you pass a set of corpus documents through a deep learning model to extract their features, or embeddings. Text embeddings are a numerical representation of the corpus. You save those features into a KNN index on Amazon ES. The concept underpinning KNN is that similar data points exist in close proximity in the vector space. As an example, “summer dress” and “summer flowery dress” are both similar, so these text embeddings are collocated, as opposed to “summer dress” vs. “wedding dress.”
KNN index query – This is the inference phase of the application. In this phase, you submit a text search query through the deep learning model to extract the features. Then, you use those embeddings to query the reference KNN index. The KNN index returns similar text embeddings from the KNN vector space. For example, if you pass a feature vector of “marriage dress” text, it returns “wedding dress” embeddings as a similar item.

Next, let’s take a closer look at each phase in detail, with the associated AWS architecture.

KNN reference index creation

For this use case, you use dress images and their visual descriptions from the Feidegger dataset. This dataset is a multi-modal corpus that focuses specifically on the domain of fashion items and their visual descriptions in German. The dataset was created as part of ongoing research at Zalando into text-image multi-modality in the area of fashion.

In this step, you translate each dress description from German to English using Amazon Translate. From each English description, you extract the feature vector, which is an n-dimensional vector of numerical features that represent the dress. You use a pre-trained BERT model hosted in Amazon SageMaker to extract 768 feature vectors of each visual description of the dress, and store them as a KNN index in an Amazon ES domain.

The following screenshot illustrates the workflow for creating the KNN index.

The process includes the following steps:

Users interact with a Jupyter notebook on an Amazon SageMaker notebook instance. An Amazon SageMaker notebook instance is an ML compute instance running the Jupyter Notebook app. Amazon SageMaker manages creating the instance and related resources.
Each item description, originally open-sourced in German, is translated to English using Amazon Translate.
A pre-trained BERT model is downloaded, and the model artifact is serialized and stored in Amazon Simple Storage Service (Amazon S3). The model is used to serve from a PyTorch model server on an Amazon SageMaker real-time endpoint.
Translated descriptions are pushed through the SageMaker endpoint to extract fixed-length features (embeddings).
The notebook code writes the text embeddings to the KNN index along with product Amazon S3 URI in an Amazon ES domain.

KNN search from a query text

In this step, you present a search query text string from the application, which passes through the Amazon SageMaker hosted model to extract 768 features. You use these features to query the KNN index in Amazon ES. KNN for Amazon ES lets you search for points in a vector space and find the nearest neighbors for those points by cosine similarity (the default is Euclidean distance). When it finds the nearest neighbors vectors (for example, k = 3 nearest neighbors) for a given query text, it returns the associated Amazon S3 images to the application. The following diagram illustrates the KNN search full-stack application architecture.

The process includes the following steps:

The end-user accesses the web application from their browser or mobile device.
A user-provided search query string is sent to Amazon API Gateway and AWS Lambda.
The Lambda function invokes the Amazon SageMaker real-time endpoint, and the model returns a vector of the search query embeddings. Amazon SageMaker hosting provides a managed HTTPS endpoint for predictions and automatically scales to the performance needed for your application using Application Auto Scaling.
The function passes the search query embedding vector as the search value for a KNN search in the index in the Amazon ES domain. A list of k similar items and their respective Amazon S3 URIs are returned.
The function generates pre-signed Amazon S3 URLs to return back to the client web application, used to display similar items in the browser.

Prerequisites

For this walkthrough, you should have an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to launch the AWS CloudFormation template.

Deploying your solution

You use a CloudFormation stack to deploy the solution. The stack creates all the necessary resources, including the following:

An Amazon SageMaker notebook instance to run Python code in a Jupyter notebook
An IAM role associated with the notebook instance
An Amazon ES domain to store and retrieve sentence embedding vectors into a KNN index
Two S3 buckets: one for storing the source fashion images and another for hosting a static website

From the Jupyter notebook, you also deploy the following:

An Amazon SageMaker endpoint for getting fixed-length sentence embedding vectors in real time.
An AWS Serverless Application Model (AWS SAM) template for a serverless backend using API Gateway and Lambda.
A static front-end website hosted on an S3 bucket to demonstrate a real-world, end-to-end ML application. The front-end code uses ReactJS and the AWS Amplify JavaScript library.

To get started, complete the following steps:

Sign in to the AWS Management Console with your IAM user name and password.
Choose Launch Stack and open it in a new tab:

On the Quick create stack page, select the check-box to acknowledge the creation of IAM resources.
Choose Create stack.

Wait for the stack to complete.

You can examine various events from the stack creation process on the Events tab. When the stack creation is complete, you see the status CREATE_COMPLETE.

You can look on the Resources tab to see all the resources the CloudFormation template created.

On the Outputs tab, choose the SageMakerNotebookURL

This hyperlink opens the Jupyter notebook on your Amazon SageMaker notebook instance that you use to complete the rest of the lab.

You should be on the Jupyter notebook landing page.

Choose nlu-based-item-search.ipynb.

Building a KNN index on Amazon ES

For this step, you should be at the beginning of the notebook with the title NLU based Item Search. Follow the steps in the notebook and run each cell in order.

You use a pre-trained BERT model (distilbert-base-nli-stsb-mean-tokens) from sentence-transformers and host it on an Amazon SageMaker PyTorch model server endpoint to generate fixed-length sentence embeddings. The embeddings are saved to the Amazon ES domain created in the CloudFormation stack. For more information, see the markdown cells in the notebook.

Continue when you reach the cell Deploying a full-stack NLU search application in your notebook.

The notebook contains several important cells; we walk you through a few of them.

Download the multi-modal corpus dataset from Feidegger, which contains fashion images and descriptions in German. See the following code:

## Data Preparation

import os 
import shutil
import json
import tqdm
import urllib.request
from tqdm import notebook
from multiprocessing import cpu_count
from tqdm.contrib.concurrent import process_map

images_path = 'data/feidegger/fashion'
filename = 'metadata.json'

my_bucket = s3_resource.Bucket(bucket)

if not os.path.isdir(images_path):
    os.makedirs(images_path)

def download_metadata(url):
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)
        
#download metadata.json to local notebook
download_metadata('https://raw.githubusercontent.com/zalandoresearch/feidegger/master/data/FEIDEGGER_release_1.1.json')

def generate_image_list(filename):
    metadata = open(filename,'r')
    data = json.load(metadata)
    url_lst = []
    for i in range(len(data)):
        url_lst.append(data[i]['url'])
    return url_lst


def download_image(url):
    urllib.request.urlretrieve(url, images_path + '/' + url.split("/")[-1])
                    
#generate image list            
url_lst = generate_image_list(filename)     

workers = 2 * cpu_count()

#downloading images to local disk
process_map(download_image, url_lst, max_workers=workers)

Upload the dataset to Amazon S3:

# Uploading dataset to S3

files_to_upload = []
dirName = 'data'
for path, subdirs, files in os.walk('./' + dirName):
    path = path.replace("\","/")
    directory_name = path.replace('./',"")
    for file in files:
        files_to_upload.append({
            "filename": os.path.join(path, file),
            "key": directory_name+'/'+file
        })
        

def upload_to_s3(file):
        my_bucket.upload_file(file['filename'], file['key'])
        
#uploading images to s3
process_map(upload_to_s3, files_to_upload, max_workers=workers)

This dataset has product descriptions in German, so you use Amazon Translate for the English translation for each German sentence:

with open(filename) as json_file:
    data = json.load(json_file)

#Define translator function
def translate_txt(data):
    results = {}
    results['filename'] = f's3://{bucket}/data/feidegger/fashion/' + data['url'].split("/")[-1]
    results['descriptions'] = []
    translate = boto3.client(service_name='translate', use_ssl=True)
    for i in data['descriptions']:
        result = translate.translate_text(Text=str(i), 
            SourceLanguageCode="de", TargetLanguageCode="en")
        results['descriptions'].append(result['TranslatedText'])
    return results

Save the sentence transformers model to notebook instance:

!pip install sentence-transformers

#Save the model to disk which we will host at sagemaker
from sentence_transformers import models, SentenceTransformer
saved_model_dir = 'transformer'
if not os.path.isdir(saved_model_dir):
    os.makedirs(saved_model_dir)

model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
model.save(saved_model_dir)

Upload the model artifact (model.tar.gz) to Amazon S3 with the following code:

#zip the model .gz format
import tarfile
export_dir = 'transformer'
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
    archive.add(export_dir, recursive=True)

#Upload the model to S3

inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
inputs

Deploy the model into an Amazon SageMaker PyTorch model server using the Amazon SageMaker Python SDK. See the following code:

from sagemaker.pytorch import PyTorch, PyTorchModel
from sagemaker.predictor import RealTimePredictor
from sagemaker import get_execution_role

class StringPredictor(RealTimePredictor):
    def __init__(self, endpoint_name, sagemaker_session):
        super(StringPredictor, self).__init__(endpoint_name, sagemaker_session, content_type='text/plain')

pytorch_model = PyTorchModel(model_data = inputs, 
                             role=role, 
                             entry_point ='inference.py',
                             source_dir = './code', 
                             framework_version = '1.3.1',
                             predictor_cls=StringPredictor)

predictor = pytorch_model.deploy(instance_type='ml.m5.large', initial_instance_count=3)

Define a cosine similarity Amazon ES KNN index mapping with the following code (to define cosine similarity KNN index mapping, you need Amazon ES 7.7 and above):

#KNN index maping
knn_index = {
    "settings": {
        "index.knn": True,
        "index.knn.space_type": "cosinesimil",
        "analysis": {
          "analyzer": {
            "default": {
              "type": "standard",
              "stopwords": "_english_"
            }
          }
        }
    },
    "mappings": {
        "properties": {
           "zalando_nlu_vector": {
                "type": "knn_vector",
                "dimension": 768
            } 
        }
    }
}

Each product has five visual descriptions, so you combine all five descriptions and get one fixed-length sentence embedding. See the following code:

# For each product, we are concatenating all the 
# product descriptions into a single sentence,
# so that we will have one embedding for each product

def concat_desc(results):
    obj = {
        'filename': results['filename'],
    }
    obj['descriptions'] = ' '.join(results['descriptions'])
    return obj

concat_results = map(concat_desc, results)
concat_results = list(concat_results)
concat_results[0]

Import the sentence embeddings and associated Amazon S3 image URI into the Amazon ES KNN index with the following code. You also load the translated descriptions in full text, so that later you can compare the difference between KNN search and standard match text queries in Elasticsearch.

# defining a function to import the feature vectors corresponds to each S3 URI into Elasticsearch KNN index
# This process will take around ~10 min.

def es_import(concat_result):
    vector = json.loads(predictor.predict(concat_result['descriptions']))
    es.index(index='idx_zalando',
             body={"zalando_nlu_vector": vector,
                   "image": concat_result['filename'],
                   "description": concat_result['descriptions']}
            )
        
workers = 8 * cpu_count()
    
process_map(es_import, concat_results, max_workers=workers)

Building a full-stack KNN search application

Now that you have a working Amazon SageMaker endpoint for extracting text features and a KNN index on Amazon ES, you’re ready to build a real-world, full-stack ML-powered web app. You use an AWS SAM template to deploy a serverless REST API with API Gateway and Lambda. The REST API accepts new search strings, generates the embeddings, and returns similar relevant items to the client. Then you upload a front-end website that interacts with your new REST API to Amazon S3. The front-end code uses Amplify to integrate with your REST API.

In the following cell, prepopulate a CloudFormation template that creates necessary resources such as Lambda and API Gateway for full-stack application:

s3_resource.Object(bucket, 'backend/template.yaml').upload_file('./backend/template.yaml', ExtraArgs={'ACL':'public-read'})


sam_template_url = f'https://{bucket}.s3.amazonaws.com/backend/template.yaml'

# Generate the CloudFormation Quick Create Link

print("Click the URL below to create the backend API for NLU search:n")
print((
    'https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review'
    f'?templateURL={sam_template_url}'
    '&stackName=nlu-search-api'
    f'&param_BucketName={outputs["s3BucketTraining"]}'
    f'&param_DomainName={outputs["esDomainName"]}'
    f'&param_ElasticSearchURL={outputs["esHostName"]}'
    f'&param_SagemakerEndpoint={predictor.endpoint}'
))

The following screenshot shows the output: a pre-generated CloudFormation template link.

Choose the link.

You are sent to the Quick create stack page.

Select the check-boxes to acknowledge the creation of IAM resources, IAM resources with custom names, and CAPABILITY_AUTO_EXPAND.
Choose Create stack.

When the stack creation is complete, you see the status CREATE_COMPLETE. You can look on the Resources tab to see all the resources the CloudFormation template created.

After the stack is created, proceed through the cells.

The following cell indicates that your full-stack application, including front-end and backend code, are successfully deployed:

print('Click the URL below:n')
print(outputs['S3BucketSecureURL'] + '/index.html')

The following screenshot shows the URL output.

Choose the link.

You are sent to the application page, where you can provide your own search text to find products using both the KNN approach and regular full-text search approaches.

When you’re done testing and experimenting with your KNN search application, run the last two cells at the bottom of the notebook:

# Delete the endpoint
predictor.delete_endpoint()

# Empty S3 Contents
training_bucket_resource = s3_resource.Bucket(bucket)
training_bucket_resource.objects.all().delete()

hosting_bucket_resource = s3_resource.Bucket(outputs['s3BucketHostingBucketName'])
hosting_bucket_resource.objects.all().delete()

These cells end your Amazon SageMaker endpoint and empty your S3 buckets to prepare you for cleaning up your resources.

Cleaning up

To delete the rest of your AWS resources, go to the AWS CloudFormation console and delete the nlu-search-api and nlu-search stacks.

Conclusion

In this post, we showed you how to create a KNN-based search application using Amazon SageMaker and Amazon ES KNN index features. You used a pre-trained BERT model from the sentence-transformers Python library. You can also fine-tune your BERT model using your own dataset. For more information, see Fine-tuning a PyTorch BERT model and deploying it with Amazon Elastic Inference on Amazon SageMaker.

A GPU instance is recommended for most deep learning purposes. In many cases, training new models is faster on GPU instances than CPU instances. You can scale sub-linearly when you have multi-GPU instances or if you use distributed training across many instances with GPUs. However, we used CPU instances for this use case so you can complete the walkthrough under the AWS Free Tier.

For more information about the code sample in the post, see the GitHub repo.

About the Authors

Amit Mukherjee is a Sr. Partner Solutions Architect with a focus on data analytics and AI/ML. He works with AWS partners and customers to provide them with architectural guidance for building highly secure and scalable data analytics platforms and adopting machine learning at a large scale.

Laith Al-Saadoon is a Principal Solutions Architect with a focus on data analytics at AWS. He spends his days obsessing over designing customer architectures to process enormous amounts of data at scale. In his free time, he follows the latest in machine learning and artificial intelligence.

Ceres: Harvesting knowledge from the semi-structured web

October 26, 2020

by admin Amazon AWS

Watch Amazon senior principal scientist Xin Luna Dong’s keynote CIKM 2020 talk on building a comprehensive product knowledge graph.Read More

Arcanum makes Hungarian heritage accessible with Amazon Rekognition

October 23, 2020

by Sinisa Mikasinovic Amazon AWS

Arcanum specializes in digitizing Hungarian language content, including newspapers, books, maps, and art. With over 30 years of experience, Arcanum serves more than 30,000 global subscribers with access to Hungarian culture, history, and heritage.

Amazon Rekognition Solutions Architects worked with Arcanum to add highly scalable image analysis to Hungaricana, a free service provided by Arcanum, which enables you to search and explore Hungarian cultural heritage, including 600,000 faces over 500,000 images. For example, you can find historical works by author Mór Jókai or photos on topics like weddings. The Arcanum team chose Amazon Rekognition to free valuable staff from time and cost-intensive manual labeling, and improved label accuracy to make 200,000 previously unsearchable images (approximately 40% of image inventory), available to users.

Amazon Rekognition makes it easy to add image and video analysis to your applications using highly scalable machine learning (ML) technology that requires no previous ML expertise to use. Amazon Rekognition also provides highly accurate facial recognition and facial search capabilities to detect, analyze, and compare faces.

Arcanum uses this facial recognition feature in their image database services to help you find particular people in Arcanum’s articles. This post discusses their challenges and why they chose Amazon Rekognition as their solution.

Automated image labeling challenges

Arcanum dedicated a team of three people to start tagging and labeling content for Hungaricana. The team quickly learned that they would need to invest more than 3 months of time-consuming and repetitive human labor to provide accurate search capabilities to their customers. Considering the size of the team and scope of the existing project, Arcanum needed a better solution that would automate image and object labelling at scale.

Automated image labeling solutions

To speed up and automate image labeling, Arcanum turned to Amazon Rekognition to enable users to search photos by keywords (for example, type of historic event, place name, or a person relevant to Hungarian history).

For the Hungaricana project, preprocessing all the images was challenging. Arcanum ran a TensorFlow face search across all 28 million pages on a machine with 8 GPUs in their own offices to extract only faces from images.

The following screenshot shows what an extract looks like (image provided by Arcanum Database Ltd).

The images containing only faces are sent to Amazon Rekognition, invoking the IndexFaces operation to add a face to the collection. For each face that is detected in the specified face collection, Amazon Rekognition extracts facial features into a feature vector and stores it in an Amazon Aurora database. Amazon Rekognition uses feature vectors when it performs face match and search operations using the SearchFaces and SearchFacesByImage operations.

The image preprocessing helped create a very efficient and cost-effective way to index faces. The following diagram summarizes the preprocessing workflow.

As for the web application, the workflow starts with a Hungaricana user making a face search request. The following diagram illustrates the application workflow.

The workflow includes the following steps:

The user requests a facial match by uploading the image. The web request is automatically distributed by the Elastic Load Balancer to the webserver fleet.
Amazon Elastic Compute Cloud (Amazon EC2) powers application servers that handle the user request.
The uploaded image is stored in Amazon Simple Storage Service (Amazon S3).
Amazon Rekognition indexes the face and runs SearchFaces to look for a face similar to the new face ID.
The output of the search face by image operation is stored in Amazon ElastiCache, a fully managed in-memory data store.
The metadata of the indexed faces are stored in an Aurora relational database built for the cloud.
The resulting face thumbnails are served to the customer via the fast content-delivery network (CDN) service Amazon CloudFront.

Experimenting and live testing Hungaricana

During our test of Hungaricana, the application performed extremely well. The searches not only correctly identified people, but also provided links to all publications and sources in Arcanum’s privately owned database where found faces are present. For example, the following screenshot shows the result of the famous composer and pianist Franz Liszt.

The application provided 42 pages of 6×4 results. The results are capped to 1,000. The 100% scores are the confidence scores returned by Amazon Rekognition and are rounded up to whole numbers.

The application of Hungaricana has always promptly, and with a high degree of certainty, presented results and links to all corresponding publications.

Business results

By introducing Amazon Rekognition into their workflow, Arcanum enabled a better customer experience, including building family trees, searching for historical figures, and researching historical places and events.

The concept of face searching using artificial intelligence certainly isn’t new. But Hungaricana uses it in a very creative, unique way.

Amazon Rekognition allowed Arcanum to realize three distinct advantages:

Time savings – The time to market speed increased dramatically. Now, instead of spending several months of intense manual labor to label all the images, the company can do this job in a few days. Before, basic labeling on 150,000 images took months for three people to complete.
Cost savings – Arcanum saved around $15,000 on the Hungaricana project. Before using Amazon Rekognition, there was no automation, so a human workforce had to scan all the images. Now, employees can shift their focus to other high-value tasks.
Improved accuracy – Users now have a much better experience regarding hit rates. Since Arcanum started using Amazon Rekognition, the number of hits has doubled. Before, out of 500,000 images, about 200,000 weren’t searchable. But with Amazon Rekognition, search is now possible for all 500,000 images.

“Amazon Rekognition made Hungarian culture, history, and heritage more accessible to the world,” says Előd Biszak, Arcanum CEO. “It has made research a lot easier for customers building family trees, searching for historical figures, and researching historical places and events. We cannot wait to see what the future of artificial intelligence has to offer to enrich our content further.”

Conclusion

In this post, you learned how to add highly scalable face and image analysis to an enterprise-level image gallery to improve label accuracy, reduce costs, and save time.

You can test Amazon Rekognition features such as facial analysis, face comparison, or celebrity recognition on images specific to your use case on the Amazon Rekognition console.

For video presentations and tutorials, see Getting Started with Amazon Rekognition. For more information about Amazon Rekognition, see Amazon Rekognition Documentation.

About the Authors

Siniša Mikašinović is a Senior Solutions Architect at AWS Luxembourg, covering Central and Eastern Europe—a region full of opportunities, talented and innovative developers, ISVs, and startups. He helps customers adopt AWS services as well as acquire new skills, learn best practices, and succeed globally with the power of AWS. His areas of expertise are Game Tech and Microsoft on AWS. Siniša is a PowerShell enthusiast, a gamer, and a father of a small and very loud boy. He flies under the flags of Croatia and Serbia.

Cameron Peron is Senior Marketing Manager for AWS Amazon Rekognition and the AWS AI/ML community. He evangelizes how AI/ML innovation solves complex challenges facing community, enterprise, and startups alike. Out of the office, he enjoys staying active with kettlebell-sport, spending time with his family and friends, and is an avid fan of Euro-league basketball.

Securing Amazon SageMaker Studio connectivity using a private VPC

October 22, 2020

by Rafael Suguiura Amazon AWS

Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). With a single click, data scientists and developers can quickly spin up Amazon SageMaker Studio Notebooks for exploring datasets and building models. With the new ability to launch Amazon SageMaker Studio in your Amazon Virtual Private Cloud (Amazon VPC), you can control the data flow from your Amazon SageMaker Studio notebooks. This allows you to restrict internet access, monitor and inspect traffic using standard AWS networking and security capabilities, and connect to other AWS resources through AWS PrivateLink or VPC endpoints.

In this post, we explore how the Amazon SageMaker Studio VPC connectivity works, implement a sample architecture, and demonstrate some security controls in action.

Solution overview

When experimenting with and deploying ML workflows, you need access to multiple resources, such as libraries, packages, and datasets. If you’re in a highly regulated industry, controlling access to these resources is a paramount requirement. Amazon SageMaker Studio allows you to implement security in depth, with features such as data encryption, AWS Identity and Access Management (IAM), and AWS Single Sign-On (AWS SSO) integration. The ability to launch Amazon SageMaker Studio in your own private VPC adds another layer of security.

Amazon SageMaker Studio runs on an environment managed by AWS. When launching a new Studio domain, the parameter AppNetworkAccessType defines the external connectivity for such domain. Previously, the only option available for this parameter was DirectInternetOnly, meaning the traffic from the notebook flowed from an AWS managed internet gateway, as described in the following diagram.

The Amazon Elastic File System (Amazon EFS) volumes that store the Studio users’ home directories resides in the customer VPC, even when AppNetworkAccessType=DirectInternetOnly. You can optionally specify which VPC and subnet to use.

With the newly introduced feature to launch Studio in your VPC, you can set the AppNetworkAccessType parameter to VpcOnly. This launches Studio inside the specified VPC, communicating with the domain through an elastic network interface (ENI). You can apply security groups to that ENI to enforce a first layer of security control.

You can also use VPC endpoints to establish a private connection between the Studio domain and other AWS services, such as Amazon Simple Storage Service (Amazon S3) for data storage and Amazon CloudWatch for logging and monitoring, without requiring internet connectivity. VPC endpoints can impose additional networking controls such as VPC endpoint IAM policies that may, for example, only allow traffic to certain S3 buckets. The following diagram illustrates this architecture.

Prerequisites

Before getting started, make sure you have the following prerequisites:

An AWS account
An IAM user or role with administrative access
Curiosity

Setting up your environment

To better understand how the feature works, we provide an AWS CloudFormation template to set up a basic environment where you can experiment with Amazon SageMaker Studio running inside a VPC. After deployment, the environment looks like the following diagram.

This template deploys the following resources in your account:

A new VPC, with a private subnet and security group. Because communication occurs across multiple Studio resources, this security group applied to the Studio ENI should allow inbound traffic to itself.
An encrypted S3 bucket, with bucket policies restricting access to our S3 endpoint.
VPC endpoints with policies for access control:
- We use an Amazon S3 endpoint to demonstrate the ability to limit traffic to specific S3 buckets.
- Because Studio has its traffic routed through the VPC, access to supporting services needs to be provisioned through VPC endpoints. Amazon CloudWatch Logs allows Studio to push logs generated by the service. We need an Amazon SageMaker API endpoint to launch Studio notebooks, training jobs, processing jobs, and deploy endpoints, and an Amazon SageMaker RunTime endpoint for services to call the Amazon SageMaker inference endpoint.
An IAM execution role. This role is assigned to Amazon SageMaker and defines which access permissions Studio has.

To set up your environment, click on the link below. The template is also available at this GitHub repo.

Creating an Amazon SageMaker Studio domain inside a VPC

With the infrastructure in place, you’re ready to create an Amazon SageMaker Studio domain and assign it to a VPC.

For more information about the options available to set up Studio, see Onboard to Amazon SageMaker Studio. If you have an existing domain, you might want to delete it and recreate it, or create a separate one.

To create the domain, you can use the following:

The AWS Command Line Interface (AWS CLI). For instructions, see create-domain.
The AWS SDK. For instructions, see CreateDomain.
The AWS Management Console.

To use the console to create a Studio domain and tie it to the VPC infrastructure deployed by the template, complete the following steps:

On the Amazon SageMaker console, choose SageMaker Studio.

If you don’t have a domain created, a screen appears.

For Get Started, select Standard setup.
For Authentication method, select AWS Identity and Access Management (IAM).
For Execution role for all users, choose your notebook IAM role (the default is studiovpc-notebook-role).
In the Network section, for VPC, choose your VPC (the default is studiovpc-vpc).
For Subnet, choose your subnet (the default is studiovpc-private-subnet).

Make sure to not choose studiovpc-endpoint-private-subnet.

For Network Access for Studio, select VPC Only.

Choose Submit.

To create and link the domain with the AWS CLI, enter the following code. The option --app-network-access-type VpcOnly links the domain to our VPC. The VPC and subnet parameters are set by the --default-user-settings option.

#Please replace the variable below according to your environment
REGION= #AWS Region where the Domain will be created
AWS_ACCOUNT_ID= #AWS Account ID 
VPC_DOMAIN_NAME= #Select a name for your Domain

#The values below can be obtained on the "Output" section of the CloudFormation used on the previous step
VPC_ID=
PRIVATE_SUBNET_IDS=
SECURITY_GROUP=
EXECUTION_ROLE_ARN=

#Now let's create the domain
aws sagemaker create-domain 
--region $REGION 
--domain-name $VPC_DOMAIN_NAME 
--vpc-id $VPC_ID 
--subnet-ids $PRIVATE_SUBNET_IDS 
--app-network-access-type VpcOnly 
--auth-mode IAM 
--default-user-settings "ExecutionRole=${EXECUTION_ROLE_ARN},SecurityGroups=${SECURITY_GROUP}"

#Please note the DomainArn output - we will use it on the next step

Creating a user profile

Now that the domain is created, we need to create a user profile. You can create multiple user profiles associated to a single domain.

To create your user profile on the console, complete the following steps:

On the Amazon SageMaker Studio console, choose Control Panel.
Choose Add user profile.
For User name, enter a name (for example, demo-user).
For Execution role, choose your IAM role (the default is studiovpc-notebook-role).

To create your user profile with the AWS CLI, enter the following code:

#Please replace the variable below according to your environment
DOMAIN_ID= #From previous step
USER_PROFILE_NAME= #Select a name for your user profile

#Now let's create the profile
aws sagemaker create-user-profile 
--region $REGION 
--domain-id $DOMAIN_ID 
--user-profile-name $USER_PROFILE_NAME

Accessing Amazon SageMaker Studio

We now have a Studio domain associated to our VPC and a user profile in this domain. Now we need to give access to the user. To do so, we create a pre-signed URL.

To use the console, on the Studio Control Panel, locate your user name and choose Open Studio.

To use the AWS CLI, enter the following code:

#Now let's create the pre-signed URL
aws sagemaker create-presigned-domain-url 
--region $REGION 
 --domain-id $DOMAIN_ID 
--user-profile-name $USER_PROFILE_NAME

#Please take note of the Domain URL, and paste it on a browser that have VPC Connectivity

At this point, our deployment looks like the following diagram.

We made it! Now you can use your browser to connect to the Amazon SageMaker Studio domain. After a few minutes, Studio finishes creating your environment and you’re greeted with the launcher screen (see the following screenshot).

Security controls

Some examples of security best practices are Amazon S3 access control and limiting internet ingress and egress. In this section, we see how to implement them in combination with running Amazon SageMaker Studio in a private VPC.

Amazon S3 access control

Developing ML models requires access to sensitive data stored on specific S3 buckets. You might want to implement controls to guarantee that:

Only specific Studio domains can access these buckets
Each Studio domain only have access to the defined S3 buckets

We can achieve this using the sample architecture provided in the CloudFormation template.

Our CloudFormation template created an S3 bucket with the following S3 bucket policy attached to it. The condition StringsNotEquals evaluates the VPC endpoint ID with the effect set to deny, meaning that access to the S3 bucket is denied if the access doesn’t come from the designated VPC endpoint. You can find your specific bucket name on the AWS CloudFormation console, on the Outputs tab for the stack.

{
    "Version": "2008-10-17",
    "Statement": [
        {
            "Effect": "Deny",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<s3-bucket-name>/*",
                "arn:aws:s3:::<s3-bucket-name>"
            ],
            "Condition": {
                "StringNotEquals": {
                    "aws:sourceVpce": "<s3-vpc-endpoint-id>"
                }
            }
        }
    ]

The Amazon S3 VPC endpoint also has a policy attached to it. This policy only allows access to the S3 bucket created by AWS CloudFormation:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": "*",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<s3-bucket-name>",
                "arn:aws:s3:::<s3-bucket-name>/*"
            ]
        }
    ]

This combination of S3 bucket policy and VPC endpoint policy, together with Studio VPC connectivity, establishes that Studio can only access the referenced S3 bucket, and this S3 bucket can only be accessed from the VPC endpoint.

To test it, open a notebook in Studio and try to copy a file into your S3 bucket. The following screenshot shows that it works as expected.

If you try the same with a different S3 bucket, you should get a permission denied error.

If you try to access the bucket from outside Studio, you should also get a permission error.

Limiting internet ingress and egress

To develop ML models, data scientists often need access to public code repos or Python packages (for example, from PyPI) to explore data and train models. If you need to restrict access to only approved datasets and libraries, you need to restrict internet access. In our sample architecture, we achieve this by using a private subnet on our VPC, without an internet gateway or NAT gateway deployed.

We can test this by trying to clone a public repository containing Amazon SageMaker example notebooks.

In your Studio environment, open a notebook and enter the following code:

! git clone https://github.com/awslabs/amazon-sagemaker-examples.git

You can also run it in your notebook directly.

As expected, the connection times out.

If you want to provide internet access through your VPC, just add an internet gateway and the proper routing entries. The internet traffic flows through your VPC, and you can implement other security controls such as inline inspections with a firewall or internet proxy. For more information, see Understanding Amazon SageMaker notebook instance networking configurations and advanced routing options.

Cleaning up

To avoid incurring future charges, delete the resources you created:

Shut down the notebooks you started with Studio.
If desired, delete the Studio domain.
On the AWS CloudFormation console, delete the CloudFormation stack.

Conclusion

You can use Amazon SageMaker Studio to streamline developing, experimenting with, training, and deploying ML models. With the new ability to launch Studio inside a VPC, regulated industries such as financial services, healthcare, and others with strict security requirements can use Studio while meeting their enterprise security needs.

Go test this new feature and let us know what you think. For more information about Amazon SageMaker security, see the following:

About the Authors

Rafael Suguiura is a Principal Solutions Architect at Amazon Web Services. He guides some of the world’s largest financial services companies in their cloud journey. When the weather is nice, he enjoys cycling and finding new hiking trails— and when it’s not, he catches up with sci-fi books, TV series, and video games.

Stefan Natu is a Sr. Machine Learning Specialist at Amazon Web Services. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.

Han Zhang is a Software Development Engineer at Amazon Web Services. She is part of the launch team for Amazon SageMaker Notebooks and Amazon SageMaker Studio, and has been focusing on building secure machine learning environments for customers. In her spare time, she enjoys hiking and skiing in the Pacific Northwest.

Amazon SageMaker notebook instances

GPU or CPU?

Maximize instance utilization

Training jobs

Use pre-trained models or even APIs

Use Pipe mode (where applicable) to reduce training time

Managed spot training in Amazon SageMaker

Test your code locally

Monitor the performance of your training jobs to identify waste

Find the right balance: Performance vs. accuracy

Tuning (hyperparameter optimization) jobs

Hosting endpoints

Delete endpoints that aren’t in use

Use Automatic Scaling

Amazon Elastic Inference for deep learning

Host multiple models with multi-model endpoints

Reducing labeling time with Amazon SageMaker Ground Truth

Tagging your resources

Keeping track of cost

Conclusion

About the Author

Extracting raw text from menus with Amazon Textract

Using Amazon SageMaker to build a menu structure detector

Using Amazon SageMaker to build rule- and ML-based text classifiers

Key features of Amazon SageMaker

Putting it all together

Conclusion

About the Authors

Setting up

The DJL ImageVisitor

Using the prediction

Running the stream

Conclusion

About the Authors

Developing an ML solution

Dataset and preprocessing

Feature engineering

Training and deploying models

Optimizing hyperparameters in Amazon SageMaker

Deploying the model on Amazon SageMaker

Building a RESTful API for real-time model inference

Bringing design principles into sports analytics

Conclusion

About the Authors

Overview of solution

KNN reference index creation

KNN search from a query text

Prerequisites

Deploying your solution

Building a KNN index on Amazon ES

Building a full-stack KNN search application

Cleaning up

Conclusion

About the Authors

Automated image labeling challenges

Automated image labeling solutions

Experimenting and live testing Hungaricana

Business results

Conclusion

About the Authors

Solution overview

Prerequisites

Setting up your environment

Creating an Amazon SageMaker Studio domain inside a VPC

Creating a user profile

Accessing Amazon SageMaker Studio

Security controls

Amazon S3 access control

Limiting internet ingress and egress

Cleaning up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.