How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

How Sportradar used the Deep Java Library to build production-scale ML platforms for increased performance and efficiency

This is a guest post co-written with Fred Wu from Sportradar.

Sportradar is the world’s leading sports technology company, at the intersection between sports, media, and betting. More than 1,700 sports federations, media outlets, betting operators, and consumer platforms across 120 countries rely on Sportradar knowhow and technology to boost their business.

Sportradar uses data and technology to:

  • Keep betting operators ahead of the curve with the products and services they need to manage their sportsbook
  • Give media companies the tools to engage more with fans
  • Give teams, leagues, and federations the data they need to thrive
  • Keep the industry clean by detecting and preventing fraud, doping, and match fixing

This post demonstrates how Sportradar used Amazon’s Deep Java Library (DJL) on AWS alongside Amazon Elastic Kubernetes Service (Amazon EKS) and Amazon Simple Storage Service (Amazon S3) to build a production-ready machine learning (ML) inference solution that preserves essential tooling in Java, optimizes operational efficiency, and increases the team’s productivity by providing better performance and accessibility to logs and system metrics.

The DJL is a deep learning framework built from the ground up to support users of Java and JVM languages like Scala, Kotlin, and Clojure. Right now, most deep learning frameworks are built for Python, but this neglects the large number of Java developers and developers who have existing Java code bases they want to integrate the increasingly powerful capabilities of deep learning into. With the DJL, integrating this deep learning is simple.

In this post, the Sportradar team discusses the challenges they encountered and the solutions they created to build their model inference platform using the DJL.

Business requirements

We are the US squad of the Sportradar AI department. Since 2018, our team has been developing a variety of ML models to enable betting products for NFL and NCAA football. We recently developed four more new models.

The fourth down decision models for the NFL and NCAA predict the probabilities of the outcome of a fourth down play. A play outcome could be a field goal attempt, play, or punt.

The drive outcome models for the NFL and NCAA predict the probabilities of the outcome of the current drive. A drive outcome could be an end of half, field goal attempt, touchdown, turnover, turnover on downs, or punt.

Our models are the building blocks of other models where we generate a list of live betting markets, include spread, total, win probability, next score type, next team to score, and more.

The business requirements for our models are as follows:

  • The model predictor should be able to load the pre-trained model file one time, then make predictions on many plays
  • We have to generate the probabilities for each play under 50-milisecond latency
  • The model predictor (feature extraction and model inference) has to be written in Java, so that the other team can import it as a Maven dependency

Challenges with the in-place system

The main challenge we have is how to bridge the gap between model training in Python and model inference in Java. Our data scientists train the model in Python using tools like PyTorch and save the model as PyTorch scripts. Our original plan was to also host the models in Python and utilize gRPC to communicate with another service, which will use the Java gRPC client to send the request.

However, a few issues came with this solution. Mainly, we saw the network overhead between two different services running in separate run environments or pods, which resulted in higher latency. But the maintenance overhead was the main reason we abandoned this solution. We had to build both the gRPC server and the client program separately and keep the protocol buffer files consistent and up to date. Then we needed to Dockerize the application, write a deployment YAML file, deploy the gRPC server to our Kubernetes cluster, and make sure it’s reliable and auto scalable.

Another problem was whenever an error occurred on the gRPC server side, the application client only got a vague error message instead of a detailed error traceback. The client had to reach out to the gRPC server maintainer to learn exactly which part of the code caused the error.

Ideally, we instead want to load the model PyTorch scripts, extract the features from model input, and run model inference entirely in Java. Then we can build and publish it as a Maven library, hosted on our internal registry, which our service team could import into their own Java projects. When we did our research online, the Deep Java Library showed up on the top. After reading a few blog posts and DJL’s official documentation, we were sure DJL would provide the best solution to our problem.

Solution overview

The following diagram compares the previous and updated architecture.

The following diagram outlines the workflow of the DJL solution.

workflow

The steps are as follows:

  1. Training the models – Our data scientists train the models using PyTorch and save the models as torch scripts. These models are then pushed to an Amazon Simple Storage Service (Amazon S3) bucket using DVC, a version control tool for ML models.
  2. Implementing feature extraction and feeding ML features – The framework team pulls the models from Amazon S3 into a Java repository where they implement feature extraction and feed ML features into the predictor. They use the DJL PyTorch engine to initialize the model predictor.
  3. Packaging and publishing the inference code and models – The GitLab CI/CD pipeline packages and publishes the JAR file that contains the inference code and models to an internal Apache Archiva registry.
  4. Importing the inference library and making calls – The Java client imports the inference library as a Maven dependency. All inference calls are made via Java function calls within the same Kubernetes pod. Because there are no gRPC calls, the inferencing response time is improved. Furthermore, the Java client can easily roll back the inference library to a previous version if needed. In contrast, the server-side error is not transparent for the client side in gRPC-based solutions, making error tracking difficult.

We have seen a stable inferencing runtime and reliable prediction results. The DJL solution offers several advantages over gRPC-based solutions:

  • Improved response time – With no gRPC calls, the inferencing response time is improved
  • Easy rollbacks and upgrades – The Java client can easily roll back the inference library to a previous version or upgrade to a new version
  • Transparent error tracking – In the DJL solution, the client can receive detailed error trackback messages in case of inferencing errors

Deep Java Library overview

The DJL is a full deep learning framework that supports the deep learning lifecycle from building a model, training it on a dataset, to deploying it in production. It has intuitive helpers and utilities for modalities like computer vision, natural language processing, audio, time series, and tabular data. DJL also features an easy model zoo of hundreds of pre-trained models that can be used out of the box and integrated into existing systems.

It is also a fully Apache-2 licensed open-source project and can be found on GitHub. The DJL was created at Amazon and open-sourced in 2019. Today, DJL’s open-source community is led by Amazon and has grown to include many countries, companies, and educational institutions. The DJL continues to grow in its ability to support different hardware, models, and engines. It also includes support for new hardware like ARM (both in servers like AWS Graviton and laptops with Apple M1) and AWS Inferentia.

The architecture of DJL is engine agnostic. It aims to be an interface describing what deep learning could look like in the Java language, but leaves room for multiple different implementations that could provide different capabilities or hardware support. Most popular frameworks today such as PyTorch and TensorFlow are built using a Python front end that connects to a high-performance C++ native backend. The DJL can use this to connect to these same native backends to take advantage of their work on hardware support and performance.

For this reason, many DJL users also use it for inference only. That is, they will train a model using Python and then load it using the DJL for deployment as part of their existing Java production system. Because the DJL utilizes the same engine that powers Python, it’s able to run without any decrease in performance or loss in accuracy. This is exactly the strategy that we found to support the new models.

The following diagram illustrates the workflow under the hood.

djl

When the DJL loads, it finds all the engine implementations available in the class path using Java’s ServiceLoader. In this case, it detects the DJL PyTorch engine implementation, which will act as the bridge between the DJL API and the PyTorch Native.

The engine then works to load the PyTorch Native. By default, it downloads the appropriate native binary based on your OS, CPU architecture, and CUDA version, making it almost effortless to use. You can also provide the binary using one of the many available native JAR files, which are more reliable for production environments that often have limited network access for security.

Once loaded, the DJL uses the Java Native Interface to translate all the easy high-level functionalities in DJL into the equivalent low-level native calls. Every operation in the DJL API is hand-crafted to best fit the Java conventions and make it easily accessible. This also includes dealing with native memory, which is not supported by the Java Garbage Collector.

Although all these details are within the library, calling it from a user standpoint couldn’t be easier. In the following section, we walk through this process.

How Sportradar implemented DJL

Because we train our models using PyTorch, we use the DJL’s PyTorch engine for the model inference.

Loading the model is incredibly easy. All it takes is to build a criteria describing the model to load and where it is from. Then, we load it and use the model to create a new predictor session. See the following code:

crite

For our model, we also have a custom translator, which we call MyTranslator. We use the translator to encapsulate the preprocessing code that converts from a convenient Java type into the input expected by the model and the postprocessing code that converts from the model output into a convenient output. In our case, we chose to use a float[] as the input type and the built-in DJL classifications as the output type. The following is a snippet of our translator code:

It’s pretty amazing that with just a few lines of code, the DJL loads the PyTorch scripts and our custom translator, and then the predictor is ready to make the predictions.

Conclusion

Sportradar’s product built on the DJL solution went live before the 2022–23 NFL regular season started, and it has been running smoothly since then. In the future, Sportradar plans to re-platform existing models hosted on gRPC servers to the DJL solution.

The DJL continues to grow in many different ways. The most recent release, v0.21.0, has many improvements, including updated engine support, improvements on Spark, Hugging Face batch tokenizers, an NDScope for easier memory management, and enhancements to the time series API. It also has the first major release of DJL Zero, a new API aiming to allow support for both using pre-trained models and training your own custom deep learning models even with zero knowledge of deep learning.

The DJL also features a model server called DJL Serving. It makes it simple to host a model on an HTTP server from any of the 10 supported engines, including the Python engine to support Python code. With v0.21.0 of DJL Serving, it includes faster transformer support, Amazon SageMaker multi-model endpoint support, updates for Stable Diffusion, improvements for DeepSpeed, and updates to the management console. You can now use it to deploy large models with model parallel inference using DeepSpeed and SageMaker.

There is also much upcoming with the DJL. The largest area under development is large language model support for models like ChatGPT or Stable Diffusion. There is also work to support streaming inference requests in DJL Serving. Thirdly, there are improvements to demos and the extension for Spark. Of course, there is also standard continuing work including features, fixes, engine updates, and more.

For more information on the DJL and its other features, see Deep Java Library.

Follow our GitHub repo, demo repository, Slack channel, and Twitter for more documentation and examples of the DJL!


About the authors

Fred Wu is a Senior Data Engineer at Sportradar, where he leads infrastructure, DevOps, and data engineering efforts for various NBA and NFL products. With extensive experience in the field, Fred is dedicated to building robust and efficient data pipelines and systems to support cutting-edge sports analytics.

Zach Kimberg is a Software Developer in the Amazon AI org. He works to enable the development, training, and production inference of deep learning. There, he helped found and continues to develop the DeepJavaLibrary project.

Kanwaljit Khurmi is a Principal Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Read More

Financial text generation using a domain-adapted fine-tuned large language model in Amazon SageMaker JumpStart

Financial text generation using a domain-adapted fine-tuned large language model in Amazon SageMaker JumpStart

Large language models (LLMs) with billions of parameters are currently at the forefront of natural language processing (NLP). These models are shaking up the field with their incredible abilities to generate text, analyze sentiment, translate languages, and much more. With access to massive amounts of data, LLMs have the potential to revolutionize the way we interact with language. Although LLMs are capable of performing various NLP tasks, they are considered generalists and not specialists. In order to train an LLM to become an expert in a particular domain, fine-tuning is usually required.

One of the major challenges in training and deploying LLMs with billions of parameters is their size, which can make it difficult to fit them into single GPUs, the hardware commonly used for deep learning. The sheer scale of these models requires high-performance computing resources, such as specialized GPUs with large amounts of memory. Additionally, the size of these models can make them computationally expensive, which can significantly increase training and inference times.

In this post, we demonstrate how we can use Amazon SageMaker JumpStart to easily fine-tune a large language text generation model on a domain-specific dataset in the same way you would train and deploy any model on Amazon SageMaker. In particular, we show how you can fine-tune the GPT-J 6B language model for financial text generation using both the JumpStart SDK and Amazon SageMaker Studio UI on a publicly available dataset of SEC filings.

JumpStart helps you quickly and easily get started with machine learning (ML) and provides a set of solutions for the most common use cases that can be trained and deployed readily with just a few steps. All the steps in this demo are available in the accompanying notebook Fine-tuning text generation GPT-J 6B model on a domain specific dataset.

Solution overview

In the following sections, we provide a step-by-step demonstration for fine-tuning an LLM for text generation tasks via both the JumpStart Studio UI and Python SDK. In particular, we discuss the following topics:

  • An overview of the SEC filing data in the financial domain that the model is fine-tuned on
  • An overview of the LLM GPT-J 6B model we have chosen to fine-tune
  • A demonstration of two different ways we can fine-tune the LLM using JumpStart:
    • Use JumpStart programmatically with the SageMaker Python SDK
    • Access JumpStart using the Studio UI
  • An evaluation of the fine-tuned model by comparing it with the pre-trained model without fine-tuning

Fine-tuning refers to the process of taking a pre-trained language model and training it for a different but related task using specific data. This approach is also known as transfer learning, which involves transferring the knowledge learned from one task to another. LLMs like GPT-J 6B are trained on massive amounts of unlabeled data and can be fine-tuned on smaller datasets, making the model perform better in a specific domain.

As an example of how performance improves when the model is fine-tuned, consider asking it the following question:

“What drives sales growth at Amazon?”

Without fine-tuning, the response would be:

“Amazon is the world’s largest online retailer. It is also the world’s largest online marketplace. It is also the world”

With fine tuning, the response is:

“Sales growth at Amazon is driven primarily by increased customer usage, including increased selection, lower prices, and increased convenience, and increased sales by other sellers on our websites.”

The improvement from fine-tuning is evident.

We use financial text from SEC filings to fine-tune a GPT-J 6B LLM for financial applications. In the next sections, we introduce the data and the LLM that will be fine-tuned.

SEC filing dataset

SEC filings are critical for regulation and disclosure in finance. Filings notify the investor community about companies’ business conditions and the future outlook of the companies. The text in SEC filings covers the entire gamut of a company’s operations and business conditions. Because of their potential predictive value, these filings are good sources of information for investors. Although these SEC filings are publicly available to anyone, downloading parsed filings and constructing a clean dataset with added features is a time-consuming exercise. We make this possible in a few API calls in the JumpStart Industry SDK.

Using the SageMaker API, we downloaded annual reports (10-K filings; see How to Read a 10-K for more information) for a large number of companies. We select Amazon’s SEC filing reports for years 2021–2022 as the training data to fine-tune the GPT-J 6B model. In particular, we concatenate the SEC filing reports of the company in different years into a single text file except for the “Management Discussion and Analysis” section, which contains forward-looking statements by the company’s management and are used as the validation data.

The expectation is that after fine-tuning the GPT-J 6B text generation model on the financial SEC documents, the model is able to generate insightful financial related textual output, and therefore can be used to solve multiple domain-specific NLP tasks.

GPT-J 6B large language model

GPT-J 6B is an open-source, 6-billion-parameter model released by Eleuther AI. GPT-J 6B has been trained on a large corpus of text data and is capable of performing various NLP tasks such as text generation, text classification, and text summarization. Although this model is impressive on a number of NLP tasks without the need for any fine-tuning, in many cases you will need to fine-tune the model on a specific dataset and NLP tasks you are trying to solve for. Use cases include custom chatbots, idea generation, entity extraction, classification, and sentiment analysis.

Access LLMs on SageMaker

Now that we have identified the dataset and the model we are going to fine-tune on, JumpStart provides two avenues to get started using text generation fine-tuning: the SageMaker SDK and Studio.

Use JumpStart programmatically with the SageMaker SDK

We now go over an example of how you can use the SageMaker JumpStart SDK to access an LLM (GPT-J 6B) and fine-tune it on the SEC filing dataset. Upon completion of fine-tuning, we will deploy the fine-tuned model and make inference against it. All the steps in this post are available in the accompanying notebook: Fine-tuning text generation GPT-J 6B model on domain specific dataset.

In this example, JumpStart uses the SageMaker Hugging Face Deep Learning Container (DLC) and DeepSpeed library to fine-tune the model. The DeepSpeed library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware. It supports single node distributed training, utilizing gradient checkpointing and model parallelism to train large models on a single SageMaker training instance with multiple GPUs. With JumpStart, we integrate the DeepSpeed library with the SageMaker Hugging Face DLC for you and take care of everything under the hood. You can easily fine-tune the model on your domain-specific dataset without manually setting it up.

Fine-tune the pre-trained model on domain-specific data

To fine-tune a selected model, we need to get that model’s URI, as well as the training script and the container image used for training. To make things easy, these three inputs depend solely on the model name, version (for a list of the available models, see Built-in Algorithms with pre-trained Model Table), and the type of instance you want to train on. This is demonstrated in the following code snippet:

from sagemaker import image_uris, model_uris, script_uris, hyperparameters

model_id, model_version = "huggingface-textgeneration1-gpt-j-6b", "*"
training_instance_type = "ml.g5.12xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    model_id=model_id,
    model_version=model_version,
    image_scope="training",
    instance_type=training_instance_type,
)

# Retrieve the training script
train_source_uri = script_uris.retrieve(
    model_id=model_id, model_version=model_version, script_scope="training"
)

# Retrieve the pre-trained model tarball to further fine-tune
train_model_uri = model_uris.retrieve(
    model_id=model_id, model_version=model_version, model_scope="training"
)

We retrieve the model_id corresponding to the same model we want to use. In this case, we fine-tune huggingface-textgeneration1-gpt-j-6b.

Defining hyperparameters involves setting the values for various parameters used during the training process of an ML model. These parameters can affect the model’s performance and accuracy. In the following step, we establish the hyperparameters by utilizing the default settings and specifying custom values for parameters such as epochs and learning_rate:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=model_id, model_version=model_version)

# [Optional] Override default hyperparameters with custom values
hyperparameters["epochs"] = "6"

hyperparameters["learning_rate"] = "2e-04"
print(hyperparameters)

JumpStart provides an extensive list of hyperparameters available to tune. The following list provides an overview of part of the key hyperparameters utilized in fine-tuning the model. For a full list of hyperparameters, see the notebook Fine-tuning text generation GPT-J 6B model on domain specific dataset.

  • epochs – Specifies at most how many epochs of the original dataset will be iterated.
  • learning_rate – Controls the step size or learning rate of the optimization algorithm during training.
  • eval_steps – Specifies how many steps to run before evaluating the model on the validation set during training. The validation set is a subset of the data that is not used for training, but instead is used to check the performance of the model on unseen data.
  • weight_decay – Controls the regularization strength during model training. Regularization is a technique that helps prevent the model from overfitting the training data, which can result in better performance on unseen data.
  • fp16 – Controls whether to use fp16 16-bit (mixed) precision training instead of 32-bit training.
  • evaluation_strategy – The evaluation strategy used during training.
  • gradient_accumulation_steps – The number of updates steps to accumulate the gradients for, before performing a backward/update pass.

For further details regarding hyperparameters, refer to the official Hugging Face Trainer documentation.

You can now fine-tune this JumpStart model on your own custom dataset using the SageMaker SDK. We use the SEC filing data we described earlier. The train and validation data is hosted under train_dataset_s3_path and validation_dataset_s3_path. The supported format of the data includes CSV, JSON, and TXT. For the CSV and JSON data, the text data is used from the column called text or the first column if no column called text is found. Because this is for text generation fine-tuning, no ground truth labels are required. The following code is an SDK example of how to fine-tune the model:

from sagemaker.estimator import Estimator
from sagemaker.utils import name_from_base
from sagemaker.tuner import HyperparameterTuner
from sagemaker.huggingface import HuggingFace

train_dataset_s3_path = "s3://jumpstart-cache-prod-us-west-2/training-datasets/tc/data.csv"
validation_dataset_s3_path = "s3://jumpstart-cache-prod-us-west-2/training-datasets/tc/data.csv"

training_job_name = name_from_base(f"jumpstart-example-{model_id}")

metric_definitions=[
    {'Name': 'train:loss', 'Regex': "'loss': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:loss', 'Regex': "'eval_loss': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:runtime', 'Regex': "'eval_runtime': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+.[0-9]+)"},
    {'Name': 'eval:eval_steps_per_second', 'Regex': "'eval_steps_per_second': ([0-9]+.[0-9]+)"},
]

# # Create SageMaker Estimator instance
tg_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    hyperparameters=hyperparameters,
    output_path=s3_output_location,
    base_job_name=training_job_name,
    enable_network_isolation=True,
    metric_definitions=metric_definitions
)

# Launch a SageMaker Training job by passing s3 path of the training data
tg_estimator.fit({"train": train_dataset_s3_path, "validation": validation_dataset_s3_path}, logs=True)

After we have set up the SageMaker Estimator with the required hyperparameters, we instantiate a SageMaker estimator and call the .fit method to start fine-tuning our model, passing it the Amazon Simple Storage Service (Amazon S3) URI for our training data. As you can see, the entry_point script provided is named transfer_learning.py (the same for other tasks and models), and the input data channel passed to .fit must be named train and validation.

JumpStart also supports hyperparameter optimization with SageMaker automatic model tuning. For details, see the example notebook.

Deploy the fine-tuned model

When training is complete, you can deploy your fine-tuned model. To do so, all we need to obtain is the inference script URI (the code that determines how the model is used for inference once deployed) and the inference container image URI, which includes an appropriate model server to host the model we chose. See the following code:

from sagemaker.predictor import Predictor
from sagemaker import image_uris
from sagemaker.utils import name_from_base
import boto3

sagemaker_session = sagemaker.Session(boto_session=boto3.Session(region_name="us-west-2"))

#Retrieve the inference docker container uri
deploy_image_uri = image_uris.retrieve(
    region=None,
    framework=None,
    image_scope="inference",
    model_id=model_id,
    model_version=model_version,
    instance_type=inference_instance_type,
)
    
endpoint_name = name_from_base(f"jumpstart-example-{model_id}")

# Use the estimator from the previous step to deploy to a SageMaker endpoint
finetuned_predictor = tg_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.12xlarge",
    image_uri=image_uri,
    endpoint_name=endpoint_name,
)

After a few minutes, our model is deployed and we can get predictions from it in real time!

Access JumpStart through the Studio UI

Another way to fine-tune and deploy JumpStart models is through the Studio UI. This UI provides a low-code/no-code solution to fine-tuning LLMs.

On the Studio console, choose Models, notebooks, solutions under SageMaker JumpStart in the navigation pane.

In the search bar, search for the model you want to fine-tune and deploy.

In our case, we chose the GPT-J 6B model card. Here we can directly fine-tune or deploy the LLM.

Model evaluation

When evaluating an LLM, we can use perplexity (PPL). PPL is a common measure of how well a language model is able to predict the next word in a sequence. In simpler terms, it’s a way to measure how well the model can understand and generate human-like language.

A lower perplexity score means that the model is shown to perform better at predicting the next word. In practical terms, we can use perplexity to compare different language models and determine which one performs better on a given task. We can also use it to track the performance of a single model over time. For more details, refer to Perplexity of fixed-length models.

We evaluate the model’s performance through a comparison of its pre- and post-fine-tuning performance. PPL is emitted in the training job’s Amazon CloudWatch logs. In addition, we look at the output generated by the model in response to specific test prompts.

Evaluation metric on the validation data Before fine-tuning After fine-tuning
Perplexity (PPL) 8.147 1.437

The model’s enhanced performance is demonstrated by two experiments:

  • Text generation – We show that the output the fine-tuned GPT-J generates text that has a resemblance to the SEC dataset, indicating that it has learned to emulate the language patterns and conventions of the financial regulatory domain
  • Zero-shot learning – We demonstrate that the fine-tuned GPT-J shows significantly better performance than the non-fine-tuned GPT-J on zero-shot question answering

For the text generation experiment, we use the following inference parameters to achieve our results: max_length of 400 (number of tokens returned), num_return_sequences of 1 (number of response text examples), top_k of 250 (pick from one of the top 250 possible continuations), top_p of 0.8 (select only from continuations that are at least 80% probable), do_sample is True (sampling), and temperature of 1 (high temperature is greater randomness in responses). For more details about the parameters, refer to GenerationConfig. The following table compares the fine-tuned GPT-J and non-fine-tuned one on a text completion task.

Inference example Output from GPT-J 6B Before Fine-Tuning Output from GPT-J 6B After Fine-Tuning
This Form 10-K report shows that

This Form 10-K report shows that:

The Companys net income attributable to the Company for the year ended December 31, 2016 was $3,923,000, or $0.21 per diluted share, compared to $3,818,000, or $0.21 per diluted share, for the year ended December 31, 2015.

The Companys net income attributable to the Company for the year ended December 31, 2016 was $4,816,000, or $0.28 per diluted share, compared to $4,726,000, or $0.27 per diluted share, for the year ended December 31, 2015.

The Companys net income attributable to the Company for the year ended December 31, 2016 was $5,828,000, or $0.34 per diluted share, compared to $5,716,000, or $0.33 per diluted share, for the year ended December 31, 2015.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended December 31, 2008.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended September 30, 2008.

On August 21, 2009, the Company filed a Form 10-Q for the quarter ended March 31, 2009.

This Form 10-K report shows that:(1) The registrant’s name, address, and telephone number, are as follows:Amazon.com, Inc. c/o Ernst & Young LLPSeattle, Washington Attention: Corporate Secretary (2) For the purpose of Article III of the Securities Exchange Act of 1934, the registrant’s name and address are as follows:(3) The registrant’s Exchange Act reportable time period is from and includingJanuary 1, 2020 to the present.(4) The registrant’s subjective assessment is that the financial statements, and other financial information included in this report, fairlypresent in all material respects the financial condition, results of operations and cashflows of the registrant as of, and for, the periods presented in this report.(5) The registrant’s management’s assessment is that the financial statements, and other financial information included in this report, fairlypresent in all material respects the financial condition, results of operations and cashflows of the registrant as of, and for, the periods presented in this report./s/ Ernst & Young LLPSeattle, WashingtonJanuary 29, 2020EX-31.1 5 amzn-20201231xex311.htm EX-31.1 DocumentExhibit 31.1CERTIFICATIONSI, Andrew R. Jassy, certify that:1. I have reviewed this Form 10-K of Amazon.com, Inc.;2. Based on my knowledge, this report does not contain any untrue statementof a material fact or omit to state a material fact necessary to make thestatements made, in light of the circumstances under which such statementswere made, not misleading with respect to the period covered by this report;
We serve consumers through

We serve consumers through:

Our Blog

Are you a business owner looking to grow your business? Do you want to create a more effective and efficient business? Do you want to learn how to grow your business? If you answered “yes” to any of these questions, then you need to read this article.

The first step to growing your business is to understand your business. You must be able to answer the following questions about your business:

Who are your customers?

What is your product or service?

What is your unique selling proposition?

What is your marketing strategy?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

What is your sales forecast?

We serve consumers through: Amazon.com, a retailer of physical and digital products; Amazon Web Services, which provides technology infrastructure services that enable virtually any type of business; Amazon Enterprise Solutions, which provides companies with custom technology and services to enable better business processes and improve efficiency and productivity; and Audible, a digital audiobook publisher. We serve developers and enterprises of all sizes through AWS, which offers a broad set of global compute, storage, database, and other service offerings. We also manufacture and sell electronic devices. We have organized our operations into two principal segments: North America and International. See Item 8 of Part II, “”Financial Statements and Supplementary Data-Note 12-Segment Information.”” See Item 7 of Part II, “”Management’s Discussion and Analysis of Financial Condition and Results ofOperations-Results of Operations-Supplemental Information”” for supplemental informationabout our net sales.ConsumersWe serve consumers through our online and physical stores and focus on selection,price, and convenience. We design our stores to enable hundreds of millions ofunique products to be sold by us and by third parties across dozens of productcategories. Customers access our offerings through our websites, mobile apps,Alexa, and physically visiting our stores. We also manufacture and sell Kindledevices. We strive to offer our customers the lowest prices possible throughlow everyday product pricing and shipping offers, and to improve ouroperating efficiencies so that we can continue to lower prices for ourcustomers. We also provide easy-to-use functionality, fast and reliablefulfillment, and timely customer service.In addition, we offer Amazon Prime, a membership program that includes unlimitedfree shipping on over 100 million items, access to unlimited streaming of tens ofthousands of movies and TV episodes, and other benefits.
This year we particularly focus on

This year we particularly focus on:

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

The role of the health care system in the prevention of obesity and diabetes

This year we particularly focus on: -Continued international expansion, including the development of newbusinesses in existing international market segments, as well as thedevelopment of new international market segments; -Expanding our customer base, including increasing sales to existingcustomers and attracting new customers; -Developing new business opportunities and entering into commercialagreements, including through licensing and other arrangements, with othercompanies, or through the acquisition of other companies; -Developing technology that provides for enhanced customer service andefficiently delivers our products and services; -Developing new product and service offerings and incorporating these intoexisting and new product and service offerings; -Developing efficient and effective systems for supporting our business;and -Other business and strategic initiatives. We believe that offering low prices to our customers is fundamental to ourfuture success. One way we offer lower prices is through free-shipping offersthat result in a net cost to us in delivering products, and through membershipin Amazon Prime, which provides free-shipping on millions of items andaccess to movies and other content. We also offer other promotions that enableus to turn a lower net cost of sales.We have organized our operations into two principal segments: North Americaand International. See Item 8 of Part II, “”Financial Statements andSupplementary Data-Note 12-Segment Information.”” See Item 7 of Part II,””Management’s Discussion and Analysis of Financial Condition and Results ofOperations-Results of Operations-Supplemental Information”” for supplementalinformation about our net sales.ConsumersWe serve consumers through our retail websites and physical stores and focuson selection, price, and convenience.

For the second experiment with a zero-shot question answering task, we set the max_length to be a different value (40, 200, and 40, respectively) based on the output we expect from each question. The parameters of top_k, top_p, do_sample, temperature are set as 250, 0.8, False, and 0, respectively. We set temperature close to 0 to control the randomness of the output, whereas in the text generation task we set it close to 1 to make the model more creative. The following table compares the results. For each question, the output from fine-tuned GPT-J is more accurate than the non-fine-tuned one.

Input Output from GPT-J 6B Before Fine-Tuning Output from GPT-J 6B After Fine-Tuning
What is Item 7 of the 10-K SEC filing about?:

What is Item 7 of the 10-K SEC filing about?:

Item 7 of the 10-K SEC filing is the following:

“The Company has not received any material adverse

Item 7. Management’s Discussion and Analysis of Financial Condition and Results ofOperations Forward-Looking Statements
How did the COVID-19 pandemic impact Amazon’s business?:

How did the COVID-19 pandemic impact Amazon’s business?:

Amazon’s business is being impacted by the COVID-19 pandemic in a number of ways.

The company is seeing a decline in the number of Prime memberships, which is a big part of its business.

The company is also seeing a decline in the number of Prime memberships.

Amazon is also seeing a decline in the number of orders placed on its website.

The company is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping on its website.

Amazon is also seeing a decline in the number of people who are shopping

How did the COVID-19 pandemic impact Amazon’s business?:The COVID-19 pandemic has impacted our business in several ways. First, it has disrupted our supply chain, and it has affected customer demand, which in turn has affected our sales and operating results. Second, it has increased our spending on advertising and marketing, which may not be effective in the long run. Third, it has increased our spending on technology infrastructure and computing capacity, which may not be effective in the long run. Fourth, it has increased our spending on fulfillment and customer service, which may not be effective in the long run. Finally, it has increased our spending on content, which may not be effective in the long run. See Item 8 of Part II, “Financial Statements and Supplementary Data — Note 1 — Description of Business and Accounting Policies.
What drives sales growth at Amazon?: Amazon is the world’s largest online retailer. It is also the world’s largest online marketplace. It is also the world’ Sales growth at Amazon is driven primarily by increased customer usage, including increased selection, lower prices, and increased convenience, and increased sales by other sellers on our websites.

Clean up

To avoid ongoing charges, delete the SageMaker inference endpoints. You can delete the endpoints via the SageMaker console or from the notebook using the following commands:

# Delete the SageMaker endpoint and the attached resources
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

Conclusion

JumpStart is a capability in SageMaker that allows you to quickly get started with ML. JumpStart uses open-source, pre-trained models to solve common ML problems like image classification, object detection, text classification, sentence pair classification, and question answering.

In this post, we showed you how to fine-tune and deploy a pre-trained LLM (GPT-J 6B) for text generation based on the SEC filling dataset. We demonstrated how the model transformed into a finance domain expert by undergoing the fine-tuning process on just two annual reports of the company. This fine-tuning enabled the model to generate content with an understanding of financial topics and greater precision. Try out the solution on your own and let us know how it goes in the comments.

Important: This post is for demonstrative purposes only. It is not financial advice and should not be relied on as financial or investment advice. The post used models pre-trained on data obtained from the SEC EDGAR database. You are responsible for complying with EDGAR’s access terms and conditions if you use SEC data.

To learn more about JumpStart, check out the following posts:


About the Authors

Dr. Xin Huang is a Senior Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Dr. Sanjiv Das is an Amazon Scholar and the Terry Professor of Finance and Data Science at Santa Clara University. He holds post-graduate degrees in Finance (M.Phil and PhD from New York University) and Computer Science (MS from UC Berkeley), and an MBA from the Indian Institute of Management, Ahmedabad. Prior to being an academic, he worked in the derivatives business in the Asia-Pacific region as a Vice President at Citibank. He works on multimodal machine learning in the area of financial applications.

Arun Kumar Lokanatha is a Senior ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, train, and migrate ML production workloads to SageMaker at scale. He specializes in deep learning, especially in the area of NLP and CV. Outside of work, he enjoys running and hiking.

Read More

Announcing the updated Microsoft OneDrive connector (V2) for Amazon Kendra

Announcing the updated Microsoft OneDrive connector (V2) for Amazon Kendra

Amazon Kendra is an intelligent search service powered by machine learning (ML), enabling organizations to provide relevant information to customers and employees, when they need it.

Amazon Kendra uses ML algorithms to enable users to use natural language queries to search for information scattered across multiple data souces in an enterprise, including commonly used document storage systems like Microsoft OneDrive.

OneDrive is an online cloud storage service that allows you to host your content and have it automatically sync across multiple devices. Amazon Kendra can index document formats like Microsoft OneNote, HTML, PDF, Microsoft Word, Microsoft PowerPoint, Microsoft Excel, Rich Text, JSON, XML, CSV, XSLT, and plain text.

We’re excited to announce that we have updated the OneDrive connector for Amazon Kendra to add even more capabilities. For example, we have added support to search OneNote documents. Additionally, you can now choose to use identity or ACL information to make your searches more granular.

The connector helps to index documents and their access control information to limit the search results to only those documents the user is allowed to access. To show the search results based on user access rights and using only the user information, the connector provides an identity crawler to load principal information, such as user and group mappings into a principal store.

In this post, we demonstrate how to configure multiple data sources in Amazon Kendra to provide a central place to search across your document repository.

Solution overview

For our solution, we demonstrate how to index a OneDrive repository or folder using the Amazon Kendra connector for OneDrive. The solution consists of the following steps:

  1. Create and configure an app on Microsoft Azure Portal and get the authentication credentials.
  2. Create a OneDrive data source via the Amazon Kendra console.
  3. Index the data in the OneDrive repository.
  4. Run a sample query to get the information.
  5. Filter the query by users or groups.

Prerequisites

To try out the Amazon Kendra connector for OneDrive, you need the following:

Configure an Azure application and assign connection permissions

Before we set up the OneDrive data source, we need a few details about the OneDrive repository. Complete the following steps:

  1. Log in to Azure.
  2. After logging in with your account credentials, choose App registrations, then choose New registration.
  3. Give an appropriate name to your application and register the application.
  4. Collect the information about the client ID, tenant ID, and other details of the application.
  5. To get a client secret, choose Add a certificate or secret under Client credentials.
  6. Choose New client secret and provide the proper description and expiry.
  7. Note the client-id, tenant-id, and secret-id values. We use these for authenticating the OAuth2 application.
  8. Navigate to App, choose API permissions in the navigation pane, and choose Add a permission.
  9. Choose Microsoft Graph.
  10. Under Application permissions, enter File in the search bar and under Files, select Files.Read.All.
  11. Choose Add permissions
  12. Similarly, add the following permissions on the Microsoft Graph option for the application you created:
    1. Group.Read.All
    2. Notes.Read.All

On completion, the API permissions will look like the following screenshot.

Configure the Amazon Kendra connector for OneDrive

To configure the Amazon Kendra connector, complete the following steps:

  1. On the Amazon Kendra console, choose Create an Index.
  2. For Index name, enter a name for the index (for example, my-onedrive-index).
  3. Enter an optional description.
  4. Choose Create a new role.
  5. For Role name, enter an IAM role name.
  6. Configure optional encryption settings and tags
  7. Choose Next
  8. In the Configure user access control section, select Yes under Access control settings.
  9.  For Token type, choose JSON on the drop-down menu.
  10. Leave the remaining values as their default values.
  11. Choose Next

Before we move to the next configuration step, we need to provide Amazon Kendra with a role that has the permissions necessary for connecting to the site. These include permission to get and decrypt the AWS Secrets Manager secret that contains the application ID and secret key necessary to connect to the OneDrive site.

  1. Open another tab for the AWS account, and on the IAM console, navigate to the role that you created earlier (for example, AmazonKendra-us-west-2-onedrive).
  2. Choose Add permissions and Create inline policy.
  3. For Service, choose Kendra.
  4. For Actions¸choose Write and specify BatchPutDocument.
  5. For Resources, choose All resources.
  6. Choose Review policy.
  7. For Name, enter a name (for example, BatchPutPolicy).
  8. Choose Create policy.
  9. Add this policy to the role you created.
  10. Additionally, attach the SecretsManagerReadWrite AWS managed policy to the role
  11. Return to the Amazon Kendra tab.
  12. Select Developer edition and choose Create.

This creates and propagates the IAM role and then creates the Amazon Kendra index, which can take up to 30 minutes.

  1. Return to the Amazon Kendra console, choose Data sources in the navigation pane, and choose Add data source.
  2. Under OneDrive connector V2.0, choose Add connector.
  3. For Data source name, enter a name (for example, my-onedrive).
  4. Enter an optional description.
  5. Choose Next.
  6. For OneDrive Tenant ID, enter the tenant ID you gathered earlier.
  7. For Configure VPC and security group, leave the default (No VPC).
  8. Keep Identity crawler is on selected. This imports identity information into the index.
  9. For IAM role, choose Create a new role.
  10. Enter a role name, such as AmazonKendra-us-west-2-onedrive, then choose Next.
  11. In the Authentication section, choose Create and add a secret.
  12. Create a secret with clientId and clientSecret as keys.
  13. Add their respective values with the information you collected earlier.
  14. Choose Next.
  15. In the Configure sync settings section, add the OneDrive users whose documents you want to index.
  16. Select the sync mode for the index. For this post, we select New, modified or deleted content sync.
  17. Choose the frequency of indexing as Run on demand, then choose Next.

Field mappings enable allow you to set the searchability and relevance of fields. For example, the lastUpdatedAt field can sort or boost the ranking of the documents based on how recently it was updated.

  1. Keep all the defaults in the Set field mappings section and choose Next.
  2. On the review page, choose Add data source

  3. Choose Sync now

The sync can take up to 30 minutes to complete.

Test the solution

Now that you have indexed the content from OneDrive, you can test it by querying the index.

  1. Go to your index on the Amazon Kendra console and choose Search indexed content in the navigation pane.
  2. Enter a search term and press Enter.

Notice that without a token, the ACLs prevent a search result from being returned.

  1. Expand Test query with an access token and choose Apply token.
  2. Enter the appropriate token with a user who has permissions to read the file and choose Apply.
  3. Search for information present in OneDrive again.

You can verify that Amazon Kendra presents the ranked results as expected.

Congratulations, you have configured Amazon Kendra to index and search documents in OneDrive and control access to them using ACL.

Conclusion

With the Microsoft OneDrive V2 connector for Amazon Kendra, organizations can tap into commonly used enterprise document stores, securely using intelligent search powered by Amazon Kendra. You can enhance the search experience by integrating the data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion.


About the authors

Pravinchandra Varma is a Senior Customer Delivery Architect with the AWS Professional Services team and is passionate about applications of machine learning and artificial intelligence services.

Supratim Barat is a Software Developer Engineer with AWS Kendra Yellowbadge Team and is a blockchain and cyber security enthusiast

Read More

How RallyPoint and AWS are personalizing job recommendations to help military veterans and service providers transition back into civilian life using Amazon Personalize

How RallyPoint and AWS are personalizing job recommendations to help military veterans and service providers transition back into civilian life using Amazon Personalize

This post was co-written with Dave Gowel, CEO of RallyPoint. In his own words,RallyPoint is an online social and professional network for veterans, service members, family members, caregivers, and other civilian supporters of the US armed forces. With two million members on the platform, the company provides a comfortable place for this deserving population to connect with each other and programs designed to support them.”

All those who serve – and those who support them – often face a variety of employment challenges when a servicemember transitions back into civilian life. RallyPoint has identified the transition period to a civilian career as a major opportunity to improve the quality of life for this population by creating automated and compelling job recommendations. However, the team historically employed a rule-based curation method to recommend jobs throughout its user experience, which doesn’t allow members to get job recommendations personalized to their individual experience, expertise, and interests.

“To improve this experience for its members, we at RallyPoint wanted to explore how machine learning (ML) could help. We don’t want our servicemembers, veterans, and their loved ones to waste time searching for a fulfilling civilian career path when they decide to leave the military. It should be an easy process. We want our members to tell us about their military experiences, any schools they’ve attended, and their personal preferences. Then by leveraging what we know from our millions of military and veteran members, relevant open jobs should be easily surfaced instead of laboriously searched. This free service for our members is also expected to drive revenue by at least seven figures from employers seeking the right military and veteran talent, allowing us to build more free capabilities for our members.”

This blog post summarizes how the Amazon Machine Learning Solution Lab (MLSL) partnered with RallyPoint to drive a 35% improvement in personalized career recommendations and a 66x increase in coverage, amongst other improvements for RallyPoint members from the current rule-based implementation.

“MLSL helped RallyPoint save and improve the lives of the US military community. Fortunate to work on multiple complex and impactful projects with MLSL to support the most deserving of populations, RallyPoint accelerated growth in multiple core organizational metrics in the process. MLSL’s high caliber talent, culture, and focus on aiding our realization of measurable and compelling results from machine learning investments enabled us to reduce suicide risk, improve career transition, and speed up important connections for our service members, veterans, and their families.”

Screenshot of the RallyPoint Website

*Photo provided by the RallyPoint team.

The following sections cover the business and technical challenges, the approach taken by the AWS and RallyPoint teams, and the performance of implemented solution that leverages Amazon Personalize.

Amazon Personalize makes it easy for developers to build applications capable of delivering a wide array of personalization experiences, including specific product recommendations, personalized product re-ranking, and customized direct marketing. Amazon Personalize is a fully managed ML service that goes beyond rigid, static rule-based recommendation systems by training, tuning, and deploying custom ML models to deliver highly customized recommendations to customers across industries such as retail and media and entertainment.

Business and Technical challenges

Multiple business challenges inspired this partnership. The most pertinent was the clickthrough rate on the top 10 recommended jobs on the RallyPoint website. RallyPoint analyzed user engagement within their platform and discovered that they needed to increase the number of relevant jobs that users are clicking. The idea is that the more relevant a recommended job is, the higher the likelihood of members applying to those jobs, leading to improved employment outcomes.

The next challenge was to increase the engagement by members on job services offered on the site. RallyPoint offers the opportunity for people to “Build your brand and engage the military community, advertise your products and services, run recruitment marketing campaigns, post jobs, and search veteran talent.” They once again identified an opportunity to apply AWS Personalize to help more people transition to civilian life, and sought to improve their click-to-customer conversion numbers, leading to better outcomes for RallyPoint’s direct customers.

From a technical perspective, like many traditional recommender system problems, data sparsity and a long tail was a challenge to overcome. The sample set of de-identified, already publicly shared data included thousands of anonymized user profiles, with more than fifty user-metadata points, but many had inconsistent or missing meta-data/profile information. To tackle this, the team leveraged the Amazon Personalize cold start recommendation functionality for relevant users.

Solution overview

To solve the problem, MLSL collaborated with RallyPoint to construct a custom Amazon Personalize pipeline for RallyPoint. Some of the services used include Amazon Simple Storage Service (Amazon S3), Amazon SageMaker Notebook Instances, and Amazon Personalize. The following diagram illustrates the solution architecture.

The anonymized raw data used for the solution consisted of a history of interactions with job postings along with metadata on user profiles and job positions. This was stored in S3. The MLSL team used Amazon SageMaker Notebook Instances to prepare data as input to Amazon Personalize. This step included data preprocessing, feature engineering, and creating dataset groups and schemas required for Amazon Personalize. For more information refer to Creating a Custom dataset group.

The next step was to create a solution in Amazon Personalize. A solution refers to the combination of an Amazon Personalize recipe, customized parameters, and one or more solution versions. For more information refer to Creating a solution. The team used the User-Personalization recipe to generate user-specific job recommendations for users in a validation set. The Amazon Personalize outputs, including the job recommendations and performance metrics, are stored in an Amazon S3 bucket for further analysis.

In the final step, the team used a notebook instance to prepare the output recommendations for external evaluation by human annotators, as described in the Using Domain Experts section.

Evaluation of Amazon Personalize results

The performance of an Amazon Personalize solution version can be evaluated using offline metrics, online metrics, and A/B testing. Offline metrics allow you to view the effects of modifying hyperparameters and algorithms used to train your models, calculated against historical data. Online metrics are the empirical results observed in your user’s interactions with real-time recommendations provided in a live environment (such as clickthrough rate). A/B testing is an online method of comparing the performance of multiple solution versions to a default solution. Users are randomly assigned to either the control (default) group or one of the treatment (test) groups. The control group users receive recommendations from the default solution (baseline), whereas each of the treatment groups interact with a different solution version. Statistical significance tests are used to compare the performance metrics (such as clickthrough rate or latency) and business metrics (such as revenue) to that of the default solution.

Amazon Personalize measures offline metrics during training a solution version. The team used offline metrics such as Mean Reciprocal Rank (MRR), normalized discounted cumulative gain (NCDG@k), Precision@k, and Coverage. For the definitions of all available offline metrics, refer to Metric definitions.

Although Amazon Personalize provides an extensive list of offline metrics that the team can use to objectively measure the performance of solutions during training, online metrics and A/B testing are recommended to track and validate model performance. One caveat to these tests is that they require users to interact with Amazon Personalize recommendations in real time. Because the RallyPoint Amazon Personalize model wasn’t deployed prior to this publication, the team didn’t have results to report for these tests.

Using Domain Experts

A/B testing is the preferred method of analyzing the quality of a recommendation system however, using domain experts to annotate recommendations is a viable precursor. Since online testing was not an option, to test the robustness of the recommendations, the team asked domain experts in RallyPoint to annotate the recommendations generated by the models and count the number of job positions the experts agreed should be recommended (given a user’s information and indicated preferences) as the number of “correct” recommendations. This metric was used to compare solution versions. A popularity solution (the current rule-based criteria) was used as a baseline which consisted of recommending top five most popular job positions to every user. Moreover, a solution with default settings was used as another baseline model called Amazon Personalize baseline solution.

Results

Using the best performing model resulted in a 35% improvement in the number of “correct” recommendations over the Amazon Personalize baseline solution and a 54% improvement over the popularity solution. The team could also achieve a 66x improvement in coverage, 30x improvement in MRR, and 2x improvement in precision@10 when compared to the popularity solution. In addition to the popularity solution, the team observed up to 2x increase in MRR and precision@10 when compared to the Amazon Personalize baseline solution.

Summary

RallyPoint recognized an opportunity to better serve their customers with more personalized career recommendations. They began their user personalization journey with customer obsession in mind, partnering with the Machine Learning Solutions Lab. RallyPoint now has the opportunity to give their users more valuable career recommendations, through this solution. Incorporating this improved recommendation system into their website will result in RallyPoint users seeing more relevant jobs in their career feed, easing the path to more fulfilling careers and an improved quality of life for their members.

Use Amazon Personalize to provide an individualized experience for your users today! If you’d like to collaborate with experts to bring ML solutions to your organization, contact the Amazon ML Solutions Lab.

Additional resources

For more information about Amazon Personalize, see the following:


About the Authors

Dave Gowel is an Army veteran and the CEO of RallyPoint. Dave is a graduate of West Point and the US Army Ranger School, served in Iraq as a tank platoon leader, and taught as an assistant professor at the Massachusetts Institute of Technology ROTC program. RallyPoint is the third technology company for which Dave has been CEO.

Matthew Rhodes is a Data Scientist working in the Amazon ML Solutions Lab. He specializes in building machine learning pipelines that involve concepts such as natural language processing and computer vision.

Amin Tajgardoon is an Applied Scientist at the Amazon ML Solutions Lab. He has an extensive background in computer science and machine learning. In particular, Amin’s focus has been on deep learning and forecasting, prediction explanation methods, model drift detection, probabilistic generative models, and applications of AI in the healthcare domain.

Yash Shah is a Science Manager in the Amazon ML Solutions Lab. He and his team of applied scientists and machine learning engineers work on a range of machine learning use cases from healthcare, sports, automotive and manufacturing.

Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.

Greg Tolmie is an Account Manager on the AWS Public Sector ISV partners team. Greg supports a portfolio of AWS public sector ISV partners to help them grow and mature their adoption of AWS services while maximizing benefits of the AWS partner network.

Read More

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Generate actionable insights for predictive maintenance management with Amazon Monitron and Amazon Kinesis

Reliability managers and technicians in industrial environments such as manufacturing production lines, warehouses, and industrial plants are keen to improve equipment health and uptime to maximize product output and quality. Machine and process failures are often addressed by reactive activity after incidents happen or by costly preventive maintenance, where you run the risk of over-maintaining the equipment or missing issues that could happen between the periodic maintenance cycles. Predictive condition-based maintenance is a proactive strategy that is better than reactive or preventive ones. Indeed, this approach combines continuous monitoring, predictive analytics, and just-in-time action. This enables maintenance and reliability teams to service equipment only when necessary, based on the actual equipment condition.

There have been common challenges with condition-based monitoring to generate actionable insights for large industrial asset fleets. These challenges include but are not limited to: build and maintain a complex infrastructure of sensors collecting data from the field, obtain a reliable high-level summary of industrial asset fleets, efficiently manage failure alerts, identify possible root causes of anomalies, and effectively visualize the state of industrial assets at scale.

Amazon Monitron is an end-to-end condition monitoring solution that enables you to start monitoring equipment health with the aid of machine learning (ML) in minutes, so you can implement predictive maintenance and reduce unplanned downtime. It includes sensor devices to capture vibration and temperature data, a gateway device to securely transfer data to the AWS Cloud, the Amazon Monitron service that analyzes the data for anomalies with ML, and a companion mobile app to track potential failures in your machinery. Your field engineers and operators can directly use the app to diagnose and plan maintenance for industrial assets.

From the operational technology (OT) team standpoint, using the Amazon Monitron data also opens up new ways to improve how they operate large industrial asset fleets thanks to AI. OT teams can reinforce the predictive maintenance practice from their organization by building a consolidated view across multiple hierarchies (assets, sites, and plants). They can combine actual measurement and ML inference results with unacknowledged alarms, sensors or getaways connectivity status, or asset state transitions to build a high-level summary for the scope (asset, site, project) they are focused on.

With the recently launched Amazon Monitron Kinesis data export v2 feature, your OT team can stream incoming measurement data and inference results from Amazon Monitron via Amazon Kinesis to AWS Simple Storage Service (Amazon S3) to build an Internet of Things (IoT) data lake. By leveraging the latest data export schema, you can obtain sensors connectivity status, gateways connectivity status, measurement classification results, closure reason code and details of asset state transition events.

Use cases overview

The enriched data stream Amazon Monitron now exposes enables you to implement several key use cases such as automated work order creation, enriching an operational single pane of glass or automating failure reporting. Let’s dive into these use cases.

You can use the Amazon Monitron Kinesis data export v2 to create work orders in Enterprise Asset Management (EAM) systems such as Infor EAM, SAP Asset Management, or IBM Maximo. For example, in the video avoiding mechanical issues with predictive maintenance & Amazon Monitron, you can discover how our Amazon Fulfillment Centers are avoiding mechanical issues on conveyor belts with Amazon Monitron sensors integrated with third-party software such as the EAM used at Amazon as well as with the chat rooms technicians used. This shows how you can naturally integrate Amazon Monitron insights into your existing workflows. Stay tuned in the coming months to read the next installment of this series with an actual implementation of this integration works.

You can also use the data stream to ingest Amazon Monitron insights back into a shop floor system such as a Supervisory Control and Data Acquisition (SCADA) or a Historian. Shop floor operators are more efficient when all the insights about their assets and processes are provided in a single pane of glass. In this concept, Amazon Monitron doesn’t become yet another tool technicians have to monitor, but another data source with insights provided in the single view they are already used to. Later this year, we will also describe an architecture you can use to perform this task and send Amazon Monitron feedback to major third-party SCADA systems and Historians.

Last but not least, the new data stream from Amazon Monitron includes the asset state transitions and closure codes provided by users when acknowledging alarms (which trigger the transition to a new state). Thanks to this data, you can automatically build visualizations that provide real-time reporting of the failures and actions taken while operating their assets.

Your team can then build a broader data analytics dashboard to support your industrial fleet management practice by combining this asset state data with Amazon Monitron measurement data and other IoT data across large industrial asset fleets by using key AWS services, which we describe in this post. We explain how to build an IoT data lake, the workflow to produce and consume the data, as well as a summary dashboard to visualize Amazon Monitron sensors data and inference results. We use an Amazon Monitron dataset coming from about 780 sensors installed in an industrial warehouse, which has been running for more than 1 year. For the detailed Amazon Monitron installation guide, refer to Getting started with Amazon Monitron.

Solution overview

Amazon Monitron provides ML inference of asset health status after 21 days of the ML model training period for each asset. In this solution, the measurement data and ML inference from these sensors are exported to Amazon S3 via Amazon Kinesis Data Streams by using the latest Amazon Monitron data export feature. As soon as Amazon Monitron IoT data is available in Amazon S3, a database and table are created in Amazon Athena by using an AWS Glue crawler. You can query Amazon Monitron data via AWS Glue tables with Athena, and visualize the measurement data and ML inference with Amazon Managed Grafana. With Amazon Managed Grafana, you can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure. In this post, you connect Amazon Managed Grafana to Athena, and learn how to build a data analytics dashboard with Amazon Monitron data to help you plan industrial asset operations at scale.

The following screenshot is an example of what you can achieve at the end of this post. This dashboard is divided into three sections:

  • Plant View – Analytical information from all sensors across plants; for example, the overall counts of various states of sensors (Healthy, Warning, or Alarm), number of unacknowledged and acknowledged alarms, gateway connectivity, and average time for maintenance
  • Site View – Site-level statistics, such as asset status statistics at each site, total number of days that an alarm remains unacknowledged, top/bottom performing assets at each site, and more
  • Asset View – Summary information for the Amazon Monitron project at the asset level, such as the alarm type for an unacknowledged alarm (ISO or ML), the timeline for an alarm, and more

These panels are examples that can help strategic operational planning, but they are not exclusive. You can use a similar workflow to customize the dashboard according to your targeted KPI.



Architecture overview

The solution you will build in this post combines Amazon Monitron, Kinesis Data Streams, Amazon Kinesis Data Firehose, Amazon S3, AWS Glue, Athena, and Amazon Managed Grafana.

The following diagram illustrates the solution architecture. Amazon Monitron sensors measure and detect anomalies from equipment. Both measurement data and ML inference outputs are exported at a frequency of once per hour to a Kinesis data stream, and they are delivered to Amazon S3 via Kinesis Data Firehose with a 1-minute buffer. The exported Amazon Monitron data is in JSON format. An AWS Glue crawler analyzes the Amazon Monitron data in Amazon S3 at a chosen frequency of once per hour, builds a metadata schema, and creates tables in Athena. Finally, Amazon Managed Grafana uses Athena to query the Amazon S3 data, allowing dashboards to be built to visualize both measurement data and device health status.

To build this solution, you complete the following high-level steps:

  1. Enable a Kinesis Data Stream export from Amazon Monitron and create a data stream.
  2. Configure Kinesis Data Firehose to deliver data from the data stream to an S3 bucket.
  3. Build the AWS Glue crawler to create a table of Amazon S3 data in Athena.
  4. Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Additionally, make sure that all the resources you deploy are in the same Region.

Enable a Kinesis data stream export from Amazon Monitron and create a data stream

To configure your data stream export, complete the following steps:

  1. On the Amazon Monitron console, from your project’s main page, choose Start live data export.
  2. Under Select Amazon Kinesis data stream, choose Create a new data stream.
  3. Under Data stream configuration, enter your data stream name.
  4. For Data stream capacity, choose On-demand.
  5. Choose Create data stream.

Note that any live data export enabled after April 4th, 2023 will stream data following the Kinesis Data Streams v2 schema. If you have an existing data export that was enabled before this date, the schema will follow the v1 format.

You can now see live data export information on the Amazon Monitron console with your specified Kinesis data stream.

Configure Kinesis Data Firehose to deliver data to an S3 bucket

To configure your Firehose delivery stream, complete the following steps:

  1. On the Kinesis console, choose Delivery streams in the navigation pane.
  2. Choose Create delivery stream.
  3. For Source, select Amazon Kinesis Data Streams.
  4. For Destination, select Amazon S3.
  5. Under Source settings, for Kinesis data stream, enter the ARN of your Kinesis data stream.
  6. Under Delivery stream name, enter the name of your Kinesis data stream.
  7. Under Destination settings, choose an S3 bucket or enter a bucket URI. You can either use an existing S3 bucket to store Amazon Monitron data, or you can create a new S3 bucket.
  8. Enable dynamic partitioning using inline parsing for JSON:
    • Choose Enabled for Dynamic partitioning.
    • Choose Enabled for Inline parsing for JSON.
    • Under Dynamic partitioning keys, add the following partition keys:
Key Name JQ Expression
project .projectName| "project=(.)"
site .eventPayload.siteName| "site=(.)"
asset .eventPayload.assetName| "asset=(.)"
position .eventPayload.positionName| "position=(.)"
time .timestamp| sub(" [0-9]{2}:[0-9]{2}:[0-9]{2}.[0-9]{3}$"; "")| "time=(.)"
  1. Choose Apply dynamic partitioning keys and confirm the generated S3 bucket prefix is:
!{partitionKeyFromQuery:project}/!{partitionKeyFromQuery:site}/!{partitionKeyFromQuery:asset}/!{partitionKeyFromQuery:position}/!{partitionKeyFromQuery:time}/.
  1. Enter a prefix for S3 bucket error output prefix. Any JSON payload that doesn’t contain the keys described earlier will be delivered in this prefix. For instance, thegatewayConnectedand gatewayDisconnected events are not linked to a given asset or position. Therefore, they won’t contain the assetName and positionName fields. Specifying this optional prefix here allows you to monitor this location and process these events accordingly.
  2. Choose Create delivery stream.

You can inspect the Amazon Monitron data in the S3 bucket. Note that the Amazon Monitron data will export live data at a frequency of once per hour, so wait for 1 hour to inspect the data.

This Kinesis Data Firehose setup enables dynamic partitioning, and the S3 objects delivered will use the following key format:

/project={projectName}/site={siteDisplayName}/asset={assetDisplayName}/ position={sensorPositionDisplayName}/time={yyyy-mm-dd 00:00:00}/{filename}.

Build the AWS Glue crawler to create a table of Amazon S3 data in Athena

After the live data has been exported to Amazon S3, we use an AWS Glue crawler to generate the metadata tables. In this post, we use AWS Glue crawlers to automatically infer database and table schema from Amazon Monitron data exported in Amazon S3, and store the associated metadata in the AWS Glue Data Catalog. Athena then uses the table metadata from the Data Catalog to find, read, and process the data in Amazon S3. Complete the following steps to create your database and table schema:

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Choose Create crawler.
  3. Enter a name for the crawler (for example,XXX_xxxx_monitron).
  4. Choose Next.
  5. For Is your data already mapped to Glue tables, choose Not yet.
  6. For Data Source, choose S3.
  7. For Location of S3 data, choose In this Account, and enter the path of your S3 bucket directory you set up in the previous section (s3://YourBucketName).
  8. For Repeat crawls of S3 data stores, select Crawl all sub-folders.
  9. Finally, choose Next.
  10. Select Create new IAM role and enter a name for the role.
  11. Choose Next.
  12. Select Add Database, and enter a name for the database. This creates the Athena database where your metadata tables are located after the crawler is complete.
  13. For Crawler Schedule, select a preferred time-based scheduler (for example, hourly) to refresh the Amazon Monitron data in the database, and choose Next.
  14. Review the crawler details and choose Create.
  15. On the Crawlers page of the AWS Glue console, select the crawler you created and choose Run crawler.

You may need to wait a few minutes, depending on the size of the data. When it’s complete, the crawler’s status shows as Ready. To see the metadata tables, navigate to your database on the Databases page and choose Tables in the navigation pane.

You can also view data by choosing Table data on the console.

You’re redirected to the Athena console to view the top 10 records of the Amazon Monitron data in Amazon S3.

Create a dashboard of Amazon Monitron devices with Amazon Managed Grafana

In this section, we build a customized dashboard with Amazon Managed Grafana to visualize Amazon Monitron data in Amazon S3, so that OT team can get streamlined access to assets in alarm across their whole Amazon Monitron sensors fleet. This will enable the OT team to plan next step actions based on the possible root cause of the anomalies.

To create a Grafana workspace, complete the following steps:

  1. Ensure that your user role is admin or editor.
  2. On the Amazon Managed Grafana console, choose Create workspace.
  3. For Workspace name, enter a name for the workspace.
  4. Choose Next.
  5. For Authentication access, select AWS IAM Identity Center (successor to AWS Single Sign-On). You can use the same AWS IAM Identity Center user that you used to set up your Amazon Monitron project.
  6. Choose Next.
  7. For this first workspace, confirm that Service managed is selected for Permission type. This selection enables Amazon Managed Grafana to automatically provision the permissions you need for the AWS data sources that you use for this workspace.
  8. Choose Current account.
  9. Choose Next.
  10. Confirm the workspace details, and choose Create workspace. The workspace details page appears. Initially, the status is CREATING.
  11. Wait until the status is ACTIVE to proceed to the next step.

To configure your Athena data source, complete the following steps:

  1. On the Amazon Managed Grafana console, choose the workspace you want to work on.
  2. On the Data sources tab, select Amazon Athena, and choose Actions, Enable service-managed policy.
  3. Choose Configure in Grafana in the Amazon Athena row.
  4. Sign in to the Grafana workspace console using IAM Identity Center if necessary. The user should have the Athena access policy attached to the user or role to have access to the Athena data source. See AWS managed policy: AmazonGrafanaAthenaAccess for more info.
  5. On the Grafana workspace console, in the navigation pane, choose the lower AWS icon (there are two) and then choose Athena on the Data sources menu.
  6. Select the default Region that you want the Athena data source to query from, select the accounts that you want, then choose Add data source.
  7. Follow the steps to configure Athena details.

If your workgroup in Athena doesn’t have an output location configured already, you need to specify an S3 bucket and folder to use for query results. After setting up the data source, you can view or edit it in the Configuration pane.

In the following subsections, we demonstrate several panels in the Amazon Monitron dashboard built in Amazon Managed Grafana to gain operational insights. The Athena data source provides a standard SQL query editor that we’ll use to analyze the Amazon Monitron data to generate desired analytics.

First, if there are many sensors in the Amazon Monitron project and they are in different states (healthy, warning, alarm, and needs maintenance), the OT team wants to visually see the count of positions that sensors are in various states. You can obtain such information as a pie chart widget in Grafana via the following Athena query:

Select * FROM (Select latest_status, COUNT(assetdisplayname)OVER (PARTITION BY latest_status) AS asset_health_count FROM (SELECT timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1) GROUP BY latest_status, asset_health_count; 

The following screenshot shows a panel with the latest distribution of Amazon Monitron sensor status.

To format your SQL query for Amazon Monitron data, refer to Understanding the data export schema.

Next, your Operations Technology team may want to plan predictive maintenance based on assets that are in alarm status, and therefore they want to quickly know the total number of acknowledged alarms vs. unacknowledged alarms. You can show the summary information of alarm state as simple stats panels in Grafana:

Select COUNT(*) FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState as latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'Alarm';

The following panel shows acknowledged and unacknowledged alarms.

The OT team can also query the amount of time the sensors remain in alarm status, so that they can decide their maintenance priority:

Select c.assetdisplayname, b.sensorpositiondisplayname, b.alarm_date FROM (Select a.assetdisplayname, a.sensorpositiondisplayname, COUNT(*)/24+1 AS number_of_days_in_alarm_state FROM (Select * FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE (assetState.newState = 'ALARM' AND assetState.newState = assetState.previousState) ORDER BY timestamp DESC) a GROUP BY a.assetdisplayname, a.sensorpositiondisplayname) b INNER JOIN (Select * FROM (Select timestamp, sitedisplayname, assetdisplayname, assetState.newState AS latest_status, RANK() OVER (PARTITION BY assetdisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND tt.latest_status = 'ALARM') c ON b.assetdisplayname = c.assetdisplayname;

The output of this analysis can be visualized by a bar chart in Grafana, and the alarm in alarm state can be easily visualized as shown in the following screenshot.

To analyze top/bottom asset performance based on the total amount of time the assets are in an alarm or need maintenance state, use the following query:

Select s.sitedisplayname, s.assetdisplayname, COUNT(s.timestamp)/24 AS trouble_time FROM (Select timestamp, sitedisplayname, assetdisplayname, sensorpositiondisplayname, assetState.newState FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name" WHERE assetState.newState = 'ALARM' OR assetState.newState = 'NEEDS_MAINTENANCE') AS s GROUP BY s.assetdisplayname, s.sitedisplayname ORDER BY trouble_time, s.assetdisplayname ASC LIMIT 5;

The following bar gauge is used to visualize the preceding query output, with the top performing assets showing 0 days of alarm states, and the bottom performing assets showing accumulated alarming states over the past year.

To help the OT team understand the possible root cause of an anomaly, the alarm types can be displayed for these assets still in alarm state with the following query:

Select a.assetdisplayname, a.sensorpositiondisplayname, a.latest_status, CASE WHEN a.temperatureML != 'HEALTHY' THEN 'TEMP' WHEN a.vibrationISO != 'HEALTHY' THEN 'VIBRATION_ISO' ELSE 'VIBRATION_ML' END AS alarm_type  FROM (Select sitedisplayname, assetdisplayname, sensorpositiondisplayname, models.temperatureML.persistentClassificationOutput as temperatureML, models.vibrationISO.persistentClassificationOutput as vibrationISO, models.vibrationML.persistentClassificationOutput as vibrationML, assetState.newState as latest_status FROM (Select *, RANK() OVER (PARTITION BY assetdisplayname, sensorpositiondisplayname ORDER BY timestamp DESC)AS rnk FROM "AwsDataCatalog"."Replace with your Athena database name"."Replace with your Athena table name") tt WHERE tt.rnk=1 AND assetState.newState = 'ALARM' ) a WHERE (a.temperatureML != 'HEALTHY' OR a. vibrationISO != 'HEALTHY' OR a. vibrationML != 'HEALTHY');

You can visualize this analysis as a table in Grafana. In this Amazon Monitron project, two alarms were triggered by ML models for vibration measurement.

The Amazon Managed Grafana dashboard is shown here for illustration purposes. You can adapt the dashboard design according to your own business needs.

Failure Reports

When a user acknowledges an alarm in the Amazon Monitron app, the associated assets transition to a new state. The user also has the opportunity to provide some details about this alarm:

  • Failure cause – This can be one of the following: ADMINISTRATION, DESIGN, FABRICATION, MAINTENANCE, OPERATION, OTHER, QUALITY, WEAR, or UNDEDETERMINED
  • Failure mode – This can be one of the following: NO_ISSUE, BLOCKAGE, CAVITATION, CORROSION, DEPOSIT, IMBALANCE, LUBRICATION, MISALIGNMENT, OTHER, RESONANCE, ROTATING_LOOSENESS, STRUCTURAL_LOOSENESS, TRANSMITTED_FAULT, or UNDETERMINED
  • Action taken – This can be ADJUST, CLEAN, LUBRICATE, MODIFY, OVERHAUL, REPLACE, NO_ACTION, or OTHER

The event payload associated to the asset state transition contains all this information, the previous state of the asset, and the new state of the asset. Stay tuned for an update of this post with more details on how you can use this information in an additional Grafana panel to build Pareto charts of the most common failures and actions taken across your assets.

Conclusion

Enterprise customers of Amazon Monitron are looking for a solution to build an IoT data lake with Amazon Monitron’s live data, so they can manage multiple Amazon Monitron projects and assets, and generate analytics reports across multiple Amazon Monitron projects. This post provide a detailed walkthrough of a solution to build this IoT data lake with the latest Amazon Monitron Kinesis data export v2 feature. This solution also showed how to use other AWS services, such as AWS Glue and Athena to query the data, generate analytics outputs, and visualize such outputs with Amazon Managed Grafana with frequent refresh.

As a next step, you can expand this solution by sending ML inference results to other EAM systems that you might use for work order management. This will allow your operation team to integrate Amazon Monitron with other enterprise applications, and improve their operation efficiency. You can also start building more in-depth insights into your failure modes and actions taken by processing the asset state transitions and the closure codes that are now part of the Kinesis data stream payload.


About the authors

Julia Hu is a Sr. AI/ML Solutions Architect at Amazon Web Services. She has extensive experience in IoT architecture and Applied Data Science, and is part of both the Machine Learning and IoT Technical Field Community. She works with customers, ranging from start-ups to enterprises, to develop AWSome IoT machine learning (ML) solutions, at the Edge and in the Cloud. She enjoys leveraging latest IoT and big data technology to scale up her ML solution, reduce latency, and accelerate industry adoption.

Bishr Tabbaa is a solutions architect at Amazon Web Services. Bishr specializes in helping customers with machine learning, security, and observability applications. Outside of work, he enjoys playing tennis, cooking, and spending time with family.

Shalika Pargal is a Product Manager at Amazon Web Services. Shalika focuses on building AI products and services for Industrial customers. She brings significant experience at the intersection of Product, Industrial and Business Development. She recently shared Monitron’s success story at Reinvent 2022.

Garry Galinsky is a Principal Solutions Architect supporting Amazon on AWS. He has been involved with Monitron since its debut and has helped integrate and deploy the solution into Amazon’s worldwide fulfillment network. He recently shared Amazon’s Monitron success story at re:Invent 2022.

Michaël Hoarau is an AI/ML Specialist Solutions Architect at AWS who alternates between data scientist and machine learning architect, depending on the moment. He is passionate about bringing the AI/ML power to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. He published a book on time series analysis in 2022 and regularly writes about this topic on LinkedIn and Medium. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.

Read More

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Deploy large models at high performance using FasterTransformer on Amazon SageMaker

Sparked by the release of large AI models like AlexaTM, GPT, OpenChatKit, BLOOM, GPT-J, GPT-NeoX, FLAN-T5, OPT, Stable Diffusion, and ControlNet, the popularity of generative AI has seen a recent boom. Businesses are beginning to evaluate new cutting-edge applications of the technology in text, image, audio, and video generation that have the potential to revolutionize the services they provide and the ways they interact with customers. However, as the size and complexity of the deep learning models that power generative AI continue to grow, deployment can be a challenging task. Advanced techniques such as model parallelism and quantization become necessary to achieve latency and throughput requirements. Without expertise in using these techniques, many customers struggle to get started with hosting large models for generative AI applications.

This post can help! We begin by discussing different types of model optimizations that can be used to boost performance before you deploy your model. Then, we highlight how Amazon SageMaker large model inference deep learning containers (LMI DLCs) can help with optimization and deployment. Finally, we include code examples using LMI DLCs and FasterTransformer model parallelism to deploy models like flan-t5-xxl and flan-ul2. You can find an accompanying example notebook in the SageMaker examples repository.

Large model deployment pipeline

Major steps in any model inference workflow include loading a model into memory and handling inference requests on this in-memory model through a model server. Large models complicate this process because loading a 350 GB model such as BLOOM-176B can take tens of minutes, which materially impacts endpoint startup time. Furthermore, because these models can’t fit within the memory of a single accelerator, the model must be organized and partitioned such that it can be spread across the memory of multiple accelerators; then, model servers must handle processes and communication across multiple accelerators. Beyond model loading, partitioning, and serving, compression techniques are increasingly necessary to achieve performance goals (such as subsecond latency) for customers working with large models. Quantization and compression can reduce model size and serving cost by reducing the precision of weights or reducing the number of parameters via pruning or distillation. Compilation can optimize the computation graph and fuse operators to reduce memory and compute requirements of a model. Achieving low latency for large language models (LLMs) requires improvements in all the steps in the inference workflow: compilation, model loading, compression (runtime quantization), partitioning (tensor or pipeline parallelism), and model serving. At a high level, partitioning (with kernel optimization) brings down inference latency up to 66% (for example, BLOOM-176B from 30 seconds to 10 seconds), compilation by 20%, and compression by 50% (fp32 to fp16). An example pipeline for large model hosting with runtime partitioning is illustrated in the following diagram.

Overview of large model inference optimization techniques

With the large model deployment pipeline in mind, we now explore the optimizations. Optimizations can be critical to achieve latency and throughput goals. However, you need to be thoughtful about which optimizations you use and to what degree, because the accuracy of your model can be affected.

The following diagram is a high-level overview of different inference optimization techniques. Optimization approaches can be at the hardware or software level. We focus only on software optimization techniques in this post.

Optimized kernels and compilation

Today, optimized kernels are the greatest source of performance improvement for LMI (for example, DeepSpeed’ kernels reduced BLOOM-176B latency by three times). Fused kernel operators are model specific, and different model parallel libraries have different approaches. DeepSpeed created an inject policy for each model family. DeepSpeed has handwritten PyTorch modules and CUDA kernels that could speed up the model partially. Meanwhile, FasterTransformer rewrites the model in pure C++ and CUDA to speed up model as a whole. PyTorch 2.0 offers an open portal (via torch.compile) to allow easy compilation into different platforms. To bring cost/performance-wise optimization on SageMaker for LLMs, we offer SageMaker LMI containers that provide the best open-source compilation stack offering on a model basis, like T5 with FasterTransformers and GPT-J with DeepSpeed.

Compilation or integration to optimized runtime

ML compilers, such as Amazon SageMaker Neo, apply techniques such as operator fusion, memory planning, graph optimizations, and automatic integration to optimized inference libraries. Because inference includes only a forward propagation, intermediate tensors between layers are discarded instead of stored for reuse in back-propagation. The graph optimization techniques improve the inference throughput and have a small impact on model memory footprints. Relative to other optimization techniques, compilation for inference provides a limited benefit for reducing a model’s memory requirements. Several runtime libraries for GPU are available today, such as FasterTransformer, TensorRT, and ONNX Runtime.

Model compression

Model compression is a collection of approaches that researchers and practitioners can use to reduce the size of their model, realize faster speed, and reduce hosting cost. Model compression techniques primarily include knowledge distillation, pruning, and quantization. Most compression technologies are challenging for LLMs due to requiring additional training cycles to improve the accuracy of compressed models.

Quantization

Quantization is the process of mapping values from a larger or continuous set of numbers to a smaller set of numbers (for example, INT8 {-128:127}, uINT8 {0:255}). Using a smaller set of numbers reduces memory use and complexity of computations, but the decreased precision can degrade the accuracy of the model. The level of quantization can be adjusted to fit size constraints and accuracy needs. For example, a model quantized to FP8 will be about half the size of a model in FP16 but at the expense of reduced accuracy.

Quantization has shown great and consistent success for inference tasks by reducing the size of the model up to 75%, offering 2–4 times throughput improvements and cost savings.

The success of quantization is because it’s broadly applicable across a range of models and use cases with approximately 1% accuracy/score loss, if a proper technique is used. It doesn’t require changing model architecture. Typically, it starts with an existing floating-point model and quantizes it to obtain a fixed-point quantized model. Quantizing from FP32 to INT8 reduces the model size by 75%, but the accuracy/score loss impact is often less than a point.

Distillation

With distillation, a larger teacher model transfers knowledge to a smaller student model. The model size can be reduced until the student model can fit on an edge device or smaller cloud-based hardware, but accuracy decreases as the model is reduced. There is no industry standard for distillation, and many techniques are experimental. Distillation requires more work by the customer in tuning and trial and error to shrink the model without affecting accuracy. For more information, refer to Knowledge distillation in deep learning and its applications.

Pruning

Pruning is a model compression technique that reduces the number of operations by removing parameters. To minimize the impact to model accuracy, parameters are first ranked by importance. Parameters that are less important are set to zero or connections to the neuron are removed. This decreases the number of operations with minimal impact to model accuracy. For example, when using a pre-trained model for a narrow use case, parts of the larger model that are less relevant to your application could be pruned away to reduce size without significantly degrading performance for your task.

Model partitioning

A model that can’t fit on a single accelerator’s memory must be split into multiple partitions. At a high level, there are two fundamental approaches to partitioning the model (model parallelism): tensor parallelism and pipeline parallelism.

Tensor parallelism is also called intra-layer model parallelism. In this approach, each one of the layers is partitioned across the workers (accelerators). On the positive side, we can handle models with very large layers, because the layers are split across workers. Therefore, we no longer need to fit at least a single layer on a worker, as was the case for pipeline parallelism. However, this leads to an all-to-all communication pattern between the workers after each one of the layers, so there’s a heavy burden on the GPU/accelerator interconnect.

Pipeline parallelism partitions the model into layers. Each worker may end up with having one or more layers. This approach uses point-to-point communication and therefore introduces lower communication overhead compared to tensor parallelism. However, this approach won’t be useful if a layer can’t fit into a single worker’s or accelerator’s memory. This approach is also prone to pipeline idleness and may reduce the scaling efficiency.

Open-source frameworks like DeepSpeed, Hugging Face Accelerate, and FasterTransformer allow per model-based optimization to shard the model. Especially for DeepSpeed, the partitioning algorithm is tightly coupled with fused kernel operators. SageMaker LMI containers come with pre-integrated model partitioning frameworks like FasterTransformer, DeepSpeed, HuggingFace, and Transformers-NeuronX,. Currently, DeepSpeed, FasterTransformer, and Hugging Face Accelerate shard the model at model loading time. Runtime model partitioning can take more than 10 minutes (OPT-66B) and consume extensive CPU, GPU, and accelerator memory. Ahead-of-time (AOT) partitioning can help reduce model loading times. With AOT, models are partitioned before deployment and partitions are kept ready for downstream optimization and subsequent ingestion by model parallel frameworks. When model parallel frameworks are fed already partitioned models, then runtime partition doesn’t happen. This improves model loading time and reduces CPU, GPU, and accelerator memory consumption. DeepSpeed and FasterTransformer have support for pre-partitioning and saving for models.

Prompt engineering

Prompt engineering refers to efforts to extract accurate, consistent, and fair outputs from large models, such text-to-image synthesizers or large language models. LLMs are trained on large-scale bodies of text, so they encode a great deal of factual information about the world. A prompt consists of text and optionally an image given to a pre-trained model for a prediction task. A prompt text may consist of additional components like context, task (instruction, question, and so on), image or text, and training samples. Prompt engineering also provides a way for LLMs to do few-shot generalization, in which a machine learning model trained on a set of generic tasks learns a new or related task from just a handful of examples. For more information, refer to EMNLP: Prompt engineering is the new feature engineering. Refer to the following GitHub repo for more information about getting the most out of your large models using prompt engineering on SageMaker.

Model downloading and loading

Large language models incur long download times (for example, 40 minutes to download BLOOM-176B). In 2022, SageMaker Hosting added the support for larger Amazon Elastic Block Store (Amazon EBS) volumes up to 500 GB, longer download timeout up to 60 minutes, and longer container startup time of 60 minutes. You can enable this configuration to deploy LLMs on SageMaker. SageMaker LMI containers includes model download optimization by using the s5cmd library to speed up the model download time and container startup times, and eventually speed up auto scaling on SageMaker.

Diving deep into SageMaker LMI containers

SageMaker maintains large model inference containers with popular open-source libraries for hosting large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on AWS infrastructure. With these containers, you can use corresponding open-source libraries such as DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX to partition model parameters using model parallelism techniques to use the memory of multiple GPUs or accelerators for inference. Transformers-NeuronX is a model parallel library introduced by the AWS Neuron team for AWS Inferentia and AWS Trainium to support LLMs. It supports tensor parallelism across Neuron cores.

The LMI container uses DJLServing as the pre-built integrated model server; pre-built integrated model partitioning frameworks like DeepSpeed, Accelerate, FasterTransformer, and Transformers-NeuronX; support for PyTorch; and comes with pre-installed cuDNN, cuBLAS, NCCL CUDA Toolkit for GPUs, MKL for CPU, and the Neuron SDK and runtime for running models on AWS Inferentia and Trainium.

Pre-integrated model partitioning frameworks in SageMaker LMI containers

SageMaker LMI comes with pre-integrated model partitioning frameworks to suite your performance and model support requirements.

Most of the model parallel frameworks support both pipeline and tensor parallelism. Pipeline parallelism is simpler implementation compared to tensor parallelism. However, due to its sequential operating nature, it’s slower than tensor parallelism. Pipeline parallelism and tensor parallelism can be combined together.

Transformers-NeuronX is a model parallel library introduced by the Neuron team to support LLMs on AWS Inferentia and Trainium. It supports tensor parallelism across Neuron cores. The following table summarizes different model partitioning frameworks. This will help you select the right framework for deploying your models on SageMaker.

Hugging Face Accelerate DeepSpeed FasterTransformer TransformersNeuronX (Inf2/Trn1)
Model Parallel Pipeline Parallelism Pipeline and Tensor Parallelism Pipeline and Tensor Parallelism Tensor Parallelism
Load Hugging Face checkpoints
Runtime partition .
Ahead-of-time partition . .
Model partitioning on CPU memory . . .
Supported models All Hugging Face models All GPT family, Stable Diffusion, and T5 family GPT2/OPT/BLOOM/T5 GPT2/OPT/GPTJ/GPT-NeoX*
Streaming tokens .
Fast model loading .
Model loading speed Medium Fast Fast .
Performance on model types All other non-optimized models GPT family T5 and BLOOM All supported models
Hardware support CPU/GPU GPU GPU Inf2/Trn1
SM MME support .

Large model deployment pipeline on SageMaker

SageMaker LMI containers offer a low-code/no-code mechanism to set up your large model deployment pipeline with the following capabilities:

  • Faster model download time using s5cmd
  • Pre-built optimized model parallel frameworks including Transformers-NeuronX, DeepSpeed, Hugging Face Accelerate, and FasterTransformer
  • Pre-built foundation software stack including PyTorch, NCCL, and MPI
  • Low-code/no-code deployment of large models by configuring serving.properties
  • SageMaker-compatible containers

The following diagram gives an overview of a SageMaker LMI deployment pipeline you can use to deploy your models.

Deploy a FLAN-T5-XXL model on SageMaker using the newly released LMI container version

FasterTransformer is a library implementing an accelerated engine for the inference of transformer-based neural networks, with a special emphasis on large models, spanning many GPUs and nodes in a distributed manner. FasterTransformer contains the implementation of the highly optimized version of the transformer block that contains the encoder and decoder parts. With this block, you can run the inference of both the full encoder-decoder architectures like T5, as well as encoder-only models such as BERT, or decoder-only models such as GPT. It’s written in C++/CUDA and relies on the highly optimized cuBLAS, cuBLASLt, and cuSPARSELt libraries. This allows you to build the fastest transformer inference pipeline on GPU.

The FasterTransformer model parallel library is now available in a SageMaker LMI container, adding support for popular models such as flan-t5-xxl and flan-ul2. FasterTransformer is an open-source library from NVIDIA that provides an accelerated engine for efficiently running transformer-based neural network inference. It has been designed to handle large models that require multiple GPUs or accelerators and nodes in a distributed manner. The library includes an optimized version of the transformer block, which comprises both the encoder and decoder parts, enabling you to run the inference of full encoder-decoder architectures like T5, as well as encoder-only models like BERT and decoder-only models like GPT.

Runtime architecture of hosting a model using an LMI container’s FasterTransformer engine on SageMaker

The FasterTransformer engine in an LMI container supports loading model weights from an Amazon Simple Storage Service (Amazon S3) path or Hugging Face Hub. After fetching the model, it converts the Hugging Face model checkpoint to FasterTransformer supported partitioned model artifacts based on input parameters like tensor parallel degree and loads the partitioned model artifacts across GPU devices. It has faster loading and uses multi-process loading on Python. It supports AOT compilation and uses CPU to partition the model. SageMaker LMI containers improve the performance in downloading the models from Amazon S3 using s5cmd, provide the FasterTransformer engine, which provides a layer of abstraction for developers that loads the model in Hugging Face checkpoint or PyTorch bin format, and uses the FasterTransformer library to convert it into FasterTransformer-compatible format. These steps happen during the container startup and load the model in the memory before the inference requests come in. The FasterTransformer engine provides high performance C++ and CUDA implementations for the models to run inference. This helps improve the container startup time and reduce the inference latency. The following diagram illustrates the runtime architecture of serving models using FasterTransformer on SageMaker. For more information about DJLServing’s runtime architecture, refer to Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Use SageMaker LMI container images

To use a SageMaker LMI container to host a FLAN-T5 model, we have no-code option or a bring-your-own-script option. We showcase the bring-your-own-script option in this post. The first step in the process is to use the right LMI container image. An example notebook is available in the GitHub repo.

Use the following code to use the SageMaker LMI container image after replacing the Region with the specific Region you’re running the notebook in:

inference_image_uri = image_uris.retrieve(
    framework="djl-fastertransformer", region=sess.boto_session.region_name, version="0.21.0"
)

Download the model weights

An LMI container allows us to download the model weights from the Hugging Face Hub at run time when spinning up the instance for deployment. However, that takes longer because it’s dependent on the network and on the provider. The faster option is to download the model weights into Amazon S3 and then use the LMI container to download them to the container from Amazon S3. This is also a preferred method when we need to scale up our instances. In this post, we showcase how to download the weights to Amazon S3 and then use them when configuring the container. See the following code:

model_name = "google/flan-t5-xxl"
# Only download pytorch checkpoint files
allow_patterns = ["*.json", "*.pt", "*.bin", "*.txt", "*.model"]
# - Leverage the snapshot library to download the model since the model is stored in repository using LFS
model_download_path = snapshot_download(
    repo_id=model_name,
    cache_dir=local_model_path,
    allow_patterns=allow_patterns,
)

# define a variable to contain the s3url of the location that has the model
pretrained_model_location = f"s3://{model_bucket}/{s3_model_prefix}/"

model_artifact = sess.upload_data(path=model_download_path, key_prefix=s3_model_prefix)

Create the model configuration and inference script

First, we create a file called serving.properties that configure the container. This tells the DJL model server to use the FasterTransformer engine to load and shard the model weights. Secondly, we point to the S3 URI where the model weights have been installed. The LMI container will download the model artifacts from Amazon S3 using s5cmd. The file contains the following code:

engine = FasterTransformer
option.tensor_parallel_degree = 4
option.s3url = {{s3url}}

For the no-code option, the key changes are to specify the entry_point as the built-in handler. We specify the value as djl_python.fastertransformer. For more details, refer to the GitHub repo. You can use this code to modify for your own use case as needed. A complete example that illustrates the no-code option can be found in the following notebook. The serving.properties file will now look like the following code:

engine=FasterTransformer
option.entryPoint=djl_python.fastertransformer
option.s3url={{s3url}}
option.tensor_parallel_degree=4

Next, we create our model.py file, which defines the code needed to load and then serve the model. The only mandatory method is handle(inputs). We continue to use the functional programing paradigm to build the other helpful methods like load_model(), pipeline_generate(), and more. In our code, we read in the tensor_parallel_degree property value (the default value is 1). This sets the number of devices over which the tensor parallel modules are distributed. Secondly, we get the model weights downloaded under the /tmp location on the container and referenceable by the environment variable “model_dir”. To load the model, we use the FasterTransformer init method as shown in the following code. Note we load the full precision weights in FP32. You can also quantize the model at runtime by setting dtype = "fp16" in the following code and setting tensor_parallel_degree = 2 in serving.properties. However, note that the FP16 version of this model may not provide similar performance in terms of output quality as compared to FP32 version. In addition, refer to an existing issue related to impact on the model quality on FasterTransformer for the T5 model for certain NLP tasks.

import fastertransformer as ft
from djl_python import Input, Output
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    T5Tokenizer,
    T5ForConditionalGeneration,
)
import os
import logging
import math
import torch


def load_model(properties):
    model_name = "google/flan-t5-xxl"
    tensor_parallel_degree = properties["tensor_parallel_degree"]
    pipeline_parallel_degree = 1
    model_location = properties["model_dir"]
    if "model_id" in properties:
        model_location = properties["model_id"]
    logging.info(f"Loading model in {model_location}")

    tokenizer = T5Tokenizer.from_pretrained(model_location)
    dtype = "fp32"
    model = ft.init_inference(
        model_location, tensor_parallel_degree, pipeline_parallel_degree, dtype
    )
    return model, tokenizer


model = None
tokenizer = None


def handle(inputs: Input):
    """
    inputs: Contains the configurations from serving.properties
    """
    global model, tokenizer

    if not model:
        model, tokenizer = load_model(inputs.get_properties())

    if inputs.is_empty():
        # Model server makes an empty call to warmup the model on startup
        return None

    data = inputs.get_as_json()

    input_sentences = data["inputs"]
    params = data["parameters"]

    outputs = model.pipeline_generate(input_sentences, **params)
    result = {"outputs": outputs}

    return Output().add_as_json(result)

Create a SageMaker endpoint for inference

In this section, we go through the steps to create a SageMaker model and endpoint for inference.

Create a SageMaker model

We now create a SageMaker model. We use the Amazon Elastic Container Registry (Amazon ECR) image provided by and the model artifact from the previous step to create the SageMaker model. In the model setup, we configure tensor_parallel_degree to 4 in serving.properties, which means the model is partitioned along 4 GPUs. See the following code:

from sagemaker.utils import name_from_base
model_name = name_from_base(f"flan-xxl-fastertransformer")
print(model_name)
create_model_response = sm_client.create_model(
    ModelName=model_name,
    ExecutionRoleArn=role,
    PrimaryContainer={
        "Image": inference_image_uri, 
        "ModelDataUrl": s3_code_artifact
    },
)
model_arn = create_model_response["ModelArn"]
print(f"Created Model: {model_arn}")

Create a SageMaker endpoint for inference

You can use any instances with multiple GPUs for testing. In this demo, we use a g5.12xlarge instance. In the following code, note how we set ModelDataDownloadTimeoutInSeconds and ContainerStartupHealthCheckTimeoutInSeconds. We don’t set the VolumeSizeInGB parameters because this instance comes with SSD. The VolumeSizeInGB parameter is applicable to GPU instances supporting the EBS volume attachment.

endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
{
            "VariantName": "variant1",
            "ModelName": model_name,
            "InstanceType": "ml.g5.12xlarge",
            "InitialInstanceCount": 1,
            #"VolumeSizeInGB" : 200,
            "ModelDataDownloadTimeoutInSeconds": 600,
            "ContainerStartupHealthCheckTimeoutInSeconds": 600,
        },
    ],)'

Lastly, we create a SageMaker endpoint:

create_endpoint_response = sm_client.create_endpoint(
EndpointName=f"{endpoint_name}", EndpointConfigName=endpoint_config_name)

Starting the endpoint might take a few minutes. You can try a few more times if you run into the InsufficientInstanceCapacity error, or you can raise a request to AWS to increase the limit in your account.

Invoke the model

This is a generative model, so we pass in a text as a prompt and model will complete the sentence and return the results.

You can pass a batch of prompts as input to the model. This done by setting inputs to the list of prompts. The model then returns a result for each prompt. The text generation can be configured using appropriate parameters.

# -- we set the prompt in the parameter name which matches what we try and extract in model.py
response_model = smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps({
        "batch_size": 1,
        "inputs" : "Amazon.com is an awesome site",
        "parameters" : {},
    }),
    ContentType="application/json",
)
response_model["Body"].read().decode("utf8")

Model parameters at inference time

The following code lists the set of default parameters that is used by the model. You can set these arguments to a specific value of your choice while invoking the endpoint.

default_args = dict(
            inputs_embeds=None,
            beam_width=1,
            max_seq_len=200,
            top_k=1,
            top_p=0.0,
            beam_search_diversity_rate=0.0,
            temperature=1.0,
            len_penalty=0.0,
            repetition_penalty=1.0,
            presence_penalty=None,
            min_length=0,
            random_seed=0,
            is_return_output_log_probs=False,
            is_return_cum_log_probs=False,
            is_return_cross_attentions=False,
            bad_words_list=None,
            stop_words_list=None
        )

The following code has a sample invocation to the endpoint we deployed. We use the max_seq_len parameter to control the number of tokens that are generated and temperature to control the randomness of the generated text.

smr_client.invoke_endpoint(
    EndpointName=endpoint_name,
    Body=json.dumps(
        {
            "inputs": [
                "Title: ”University has a new facility coming up“\nGiven the above title of an imaginary article, imagine the article.n"
            ],
            "parameters": {"max_seq_len": 200, "temperature": 0.7},
            "padding": True,
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

Clean up

When you’re done testing the model, delete the endpoint to save costs if the endpoint is no longer required:

# - Delete the end point
sm_client.delete_endpoint(EndpointName=endpoint_name)

Performance tuning

If you intend to use this post and accompanying notebook with a different model, you may want to explore some of the tunable parameters that SageMaker, DeepSpeed, and the DJL offer. Iteratively experimenting with these parameters can have a material impact on the latency, throughput, and cost of your hosted large model. To learn more about tuning parameters such as number of workers, degree of tensor parallelism, job queue size, and others, refer to DJLServing configurations and Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

Benchmarking results on hosting FLAN-T5 model on SageMaker

The following table summarizes our benchmarking results.

Model Model Partitioning and Optimization Engine Quantization Batch Size Tensor Parallel Degree Number of Workers Inference Latency
P50
(ms)
Inference Latency
P90
(ms)
Inference Latency
P99
(ms)
Data Quality
flan-t5-xxl FasterTransformer FP32 4 4 1 327.39 331.01 612.73 Normal

For our benchmark, we used four different type of tasks that form into a single batch and benchmarked Flan-T5-XXL model. FasterTransformer is using a tensor parallel degree of 4 (the model gets partitioned across four accelerator devices on the same host). From our benchmark observation, FasterTransformer was the most performant in terms of latency and throughput as compared to other frameworks for hosting this model. The p99 inference latency was 612 milliseconds.

Conclusion

In this post, we gave an overview of large model hosting challenges, and how SageMaker LMI containers help you address these challenges using its low-code/no-code capabilities. We showcased how to host large models using FasterTransformer with high performance on SageMaker using the SageMaker LMI container. We demonstrated this new capability in an example of deploying a FLAN-T5-XXL model on SageMaker. We also covered options available to tune the performance of your models using different model optimization approaches and how SageMaker LMI containers offer low-code/no-code options to you in hosting and optimizing the large models.


About the authors

Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.

Rohith Nallamaddi is a Software Development Engineer at AWS. He works on optimizing deep learning workloads on GPUs, building high performance ML inference and serving solutions. Prior to this, he worked on building microservices based on AWS for Amazon F3 business. Outside of work he enjoys playing and watching sports.

Robert Van Dusen is a Senior Product Manager with Amazon SageMaker. He leads deep learning model optimization for applications such as large model inference.

Rupinder Grewal is a Sr Ai/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on SageMaker. Prior to this role he has worked as Machine Learning Engineer building and hosting models. Outside of work he enjoys playing tennis and biking on mountain trails.

Pinak Panigrahi works with customers to build machine learning driven solutions to solve strategic business problems on AWS. When not occupied with machine learning, he can be found taking a hike, reading a book or catching up with sports.

Qing Lan is a Software Development Engineer in AWS. He has been working on several challenging products in Amazon, including high performance ML inference solutions and high performance logging system. Qing’s team successfully launched the first Billion-parameter model in Amazon Advertising with very low latency required. Qing has in-depth knowledge on the infrastructure optimization and Deep Learning acceleration.

Read More

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

Authoring custom transformations in Amazon SageMaker Data Wrangler using NLTK and SciPy

“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” – Andrew Ng

A data-centric AI approach involves building AI systems with quality data involving data preparation and feature engineering. This can be a tedious task involving data collection, discovery, profiling, cleansing, structuring, transforming, enriching, validating, and securely storing the data.

Amazon SageMaker Data Wrangler is a service in Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data using little to no coding. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify data preprocessing and feature engineering, taking data preparation to production faster without the need to author PySpark code, install Apache Spark, or spin up clusters.

For scenarios where you need to add your own custom scripts for data transformations, you can write your transformation logic in Pandas, PySpark, PySpark SQL. Data Wrangler now supports NLTK and SciPy libraries for authoring custom transformations to prepare text data for ML and perform constraint optimization.

You might run into scenarios where you have to add your own custom scripts for data transformation. With the Data Wrangler custom transform capability, you can write your transformation logic in Pandas, PySpark, PySpark SQL.

In this post, we discuss how you can write your custom transformation in NLTK to prepare text data for ML. We will also share some example custom code transform using other common frameworks such as NLTK, NumPy, SciPy, and scikit-learn as well as AWS AI Services. For the purpose of this exercise, we use the Titanic dataset, a popular dataset in the ML community, which has now been added as a sample dataset within Data Wrangler.

Solution overview

Data Wrangler provides over 40 built-in connectors for importing data. After data is imported, you can build your data analysis and transformations using over 300 built-in transformations. You can then generate industrialized pipelines to push the features to Amazon Simple Storage Service (Amazon S3) or Amazon SageMaker Feature Store. The following diagram shows the end-to-end high-level architecture.

Prerequisites

Data Wrangler is a SageMaker feature available within Amazon SageMaker Studio. You can follow the Studio onboarding process to spin up the Studio environment and notebooks. Although you can choose from a few authentication methods, the simplest way to create a Studio domain is to follow the Quick start instructions. The Quick start uses the same default settings as the standard Studio setup. You can also choose to onboard using AWS IAM Identity Center (successor to AWS Single Sign-On) for authentication (see Onboard to Amazon SageMaker Domain Using IAM Identity Center).

Import the Titanic dataset

Start your Studio environment and create a new Data Wrangler flow. You can either import your own dataset or use a sample dataset (Titanic) as shown in the following screenshot. Data Wrangler allows you to import datasets from different data sources. For our use case, we import the sample dataset from an S3 bucket.

Once imported, you will see two nodes (the source node and the data type node) in the data flow. Data Wrangler automatically identifies the data type for all the columns in the dataset.

Custom transformations with NLTK

For data preparation and feature engineering with Data Wrangler, you can use over 300 built-in transformations or build your own custom transformations. Custom transforms can be written as separate steps within Data Wrangler. They become part of the .flow file within Data Wrangler. The custom transform feature supports Python, PySpark, and SQL as different steps in code snippets. After notebook files (.ipynb) are generated from the .flow file or the .flow file is used as recipes, the custom transform code snippets persist without requiring any changes. This design of Data Wrangler allows custom transforms to become part of a SageMaker Processing job for processing massive datasets with custom transformations.

Titanic dataset has couple of features (name and home.dest) that contain text information. We use NLTK to split the name column and extract the last name, and print the frequency of last names. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength natural language processing (NLP) libraries.

To add a new transform, complete the following steps:

  1. Choose the plus sign and choose Add Transform.
  2. Choose Add Step and choose Custom Transform.

You can create a custom transform using Pandas, PySpark, Python user-defined functions, and SQL PySpark.

  1. Choose Python (Pandas) and add the following code to extract the last name from the name column:
    import nltk
    nltk.download('punkt')
    tokens = [nltk.word_tokenize(name) for name in df['Name']]
    
    # Extract the last names of the passengers
    df['last_name'] = [token[0] for token in tokens]

  2. Choose Preview to review the results.

The following screenshot shows the last_name column extracted.

  1. Add another custom transform step to identify the frequency distribution of the last names, using the following code:
    import nltk
    fd = nltk.FreqDist(df["last_name"])
    print(fd.most_common(10))

  2. Choose Preview to review the results of the frequency.

Custom transformations with AWS AI services

AWS pre-trained AI services provide ready-made intelligence for your applications and workflows. AWS AI services easily integrate with your applications to address many common use cases. You can now use the capabilities for AWS AI services as a custom transform step in Data Wrangler.

Amazon Comprehend uses NLP to extract insights about the content of documents. It develops insights by recognizing the entities, key phrases, language, sentiments, and other common elements in a document.

We use Amazon Comprehend to extract the entities from the name column. Complete the following steps:

  1. Add a custom transform step.
  2. Choose Python (Pandas).
  3. Enter the following code to extract the entities:
    import boto3
    comprehend = boto3.client("comprehend")
    
    response = comprehend.detect_entities(LanguageCode = 'en', Text = df['name'].iloc[0])
    
    for entity in response['Entities']:
    print(entity['Type'] + ":" + entity["Text"])

  4. Choose Preview and visualize the results.

We have now added three custom transforms in Data Wrangler.

  1. Choose Data Flow to visualize the end-to-end data flow.

Custom transformations with NumPy and SciPy

NumPy is an open-source library for Python offering comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more. SciPy is an open-source Python library used for scientific computing and technical computing, containing modules for optimization, linear algebra, integration, interpolation, special functions, fast Fourier transform (FFT), signal and image processing, solvers, and more.

Data Wrangler custom transforms allow you to combine Python, PySpark, and SQL as different steps. In the following Data Wrangler flow, different functions from Python packages, NumPy, and SciPy are applied on the Titanic dataset as multiple steps.

NumPy transformations

The fare column of the Titanic dataset has boarding fares of different passengers. The histogram of the fare column shows uniform distribution, except for the last bin. By applying NumPy transformations like log or square root, we can change the distribution (as shown by the square root transformation).

See the following code:

import pandas as pd
import numpy as np
df["fare_log"] = np.log(df["fare_interpolate"])
df["fare_sqrt"] = np.sqrt(df["fare_interpolate"])
df["fare_cbrt"] = np.cbrt(df["fare_interpolate"])

SciPy transformations

SciPy functions like z-score are applied as part of the custom transform to standardize fare distribution with mean and standard deviation.

See the following code:

df["fare_zscore"] = zscore(df["fare_interpolate"])
from scipy.stats import zscore

Constraint optimization with NumPy and SciPy

Data Wrangler custom transforms can handle advanced transformations like constraint optimization applying SciPy optimize functions and combining SciPy with NumPy. In the following example, fare as a function of age doesn’t show any observable trend. However, constraint optimization can transform fare as a function of age. The constraint condition in this case is that the new total fare remains the same as the old total fare. Data Wrangler custom transforms allow you to run the SciPy optimize function to determine the optimal coefficient that can transform fare as a function of age under constraint conditions.

Optimization definition, objective definition, and multiple constraints can be mentioned as different functions while formulating constraint optimization in a Data Wrangler custom transform using SciPy and NumPy. Custom transforms can also bring different solver methods that are available as part of the SciPy optimize package. A new transformed variable can be generated by multiplying the optimal coefficient with the original column and added to existing columns of Data Wrangler. See the following code:

import numpy as np
import scipy.optimize as opt
import pandas as pd

df2 = pd.DataFrame({"Y":df["fare_interpolate"], "X1":df["age_interpolate"]})

# optimization defination
def main(df2):
x0 = [0.1]
res = opt.minimize(fun=obj, x0=x0, args=(df2), method="SLSQP", bounds=[(0,50)], constraints=cons)
return res

# objective function
def obj(x0, df2):
sumSquares = np.sum(df2["Y"] - x0*df2["X1"])
return sumSquares

# constraints
def constraint1(x0):
sum_cons1 = np.sum(df2["Y"] - x0*df2["X1"]) - 0
return sum_cons1
con1 = {'type': 'eq', 'fun': constraint1}
cons = ([con1])

print(main(df2))

df["new_fare_age_optimized"]=main(df2).x*df2["X1"]

The Data Wrangler custom transform feature has the UI capability to show the results of SciPy optimize functions like value of optimal coefficient (or multiple coefficients).

Custom transformations with scikit-learn

scikit-learn is a Python module for machine learning built on top of SciPy. It’s an open-source ML library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.

Discretization

Discretization (otherwise known as quantization or binning) provides a way to partition continuous features into discrete values. Certain datasets with continuous features may benefit from discretization, because discretization can transform the dataset of continuous attributes to one with only nominal attributes. One-hot encoded discretized features can make a model more expressive, while maintaining interpretability. For instance, preprocessing with a discretizer can introduce nonlinearity to linear models.

In the following code, we use KBinsDiscretizer to discretize the age column into 10 bins:

# Table is available as variable `df`
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np
# discretization transform the raw data
df = df.dropna()
kbins = KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')
ages = np.array(df["age"]).reshape(-1, 1)
df["age"] = kbins.fit_transform(ages)
print(kbins.bin_edges_)

You can see the bin edges printed in the following screenshot.

One-hot encoding

Values in the Embarked columns are categorical values. Therefore, we have to represent these strings as numerical values in order to perform our classification with our model. We could also do this using a one-hot encoding transform.

There are three values for Embarked: S, C, and Q. We represent these with numbers. See the following code:

# Table is available as variable `df`
from sklearn.preprocessing import LabelEncoder

le_embarked = LabelEncoder()
le_embarked.fit(df["embarked"])

encoded_embarked_training = le_embarked.transform(df["embarked"])
df["embarked"] = encoded_embarked_training

Clean up

When you’re not using Data Wrangler, it’s important to shut down the instance on which it runs to avoid incurring additional fees.

Data Wrangler automatically saves your data flow every 60 seconds. To avoid losing work, save your data flow before shutting Data Wrangler down.

  1. To save your data flow in Studio, choose File, then choose Save Data Wrangler Flow.
  2. To shut down the Data Wrangler instance, in Studio, choose Running Instances and Kernels.
  3. Under RUNNING APPS, choose the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  4. Choose Shut down all to confirm.

Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from RUNNING INSTANCES when you shut down the Data Wrangler app.

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.

Conclusion

In this post, we demonstrated how you can use custom transformations in Data Wrangler. We used the libraries and framework within the Data Wrangler container to extend the built-in data transformation capabilities. The examples in this post represent a subset of the frameworks used. The transformations in the Data Wrangler flow can now be scaled in to a pipeline for DataOps.

To learn more about using data flows with Data Wrangler, refer to Create and Use a Data Wrangler Flow and Amazon SageMaker Pricing. To get started with Data Wrangler, see Prepare ML Data with Amazon SageMaker Data Wrangler. To learn more about Autopilot and AutoML on SageMaker, visit Automate model development with Amazon SageMaker Autopilot.


About the authors

Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

 Sovik Kumar Nath is an AI/ML solution architect with AWS. He has extensive experience in end-to-end designs and solutions for machine learning; business analytics within financial, operational, and marketing analytics; healthcare; supply chain; and IoT. Outside work, Sovik enjoys traveling and watching movies.

Abigail is a Software Development Engineer at Amazon SageMaker. She is passionate about helping customers prepare their data in DataWrangler and building distributed machine learning systems. In her free time, Abigail enjoys traveling, hiking, skiing, and baking.

Read More

Overcome the machine learning cold start challenge in fraud detection using Amazon Fraud Detector

Overcome the machine learning cold start challenge in fraud detection using Amazon Fraud Detector

As more businesses increase their online presence to serve their customers better, new fraud patterns are constantly emerging. In today’s ever-evolving digital landscape, where fraudsters are becoming more sophisticated in their tactics, detecting and preventing such fraudulent activities has become paramount for companies and financial institutions.

Traditional rule-based fraud detection systems are capped in their ability to quickly iterate as they rely on predefined rules and thresholds to flag potentially fraudulent activity. These systems can generate a large number of false positives, significantly increasing the volume of manual investigations performed by the fraud team. Furthermore, humans are also error-prone and have limited capacity to process large amounts of data, making manual efforts to detect fraud time-consuming, which can result in missed fraudulent transactions, increased losses, and reputational damage.

Machine learning (ML) plays a crucial role in detecting fraud because it can quickly and accurately analyze large volumes of data to identify anomalous patterns and possible fraud trends. ML fraud model performance relies heavily on the quality of data it is trained on, and, specifically for the supervised models, accurate labeled data is crucial. In ML, a lack of significant historical data to train a model is called the cold start problem.

In the world of fraud detection, the following are some traditional cold start scenarios:

  • Building an accurate fraud model while lacking a history of transactions or fraud cases
  • Being able to accurately distinguish legitimate activity from fraud for new customers and accounts
  • Risk-decisioning payments to an address or beneficiary never seen before by the fraud system

There are multiple ways to solve for these scenarios. For example, you can use generic models, known as one-size-fits-all models, which are typically trained on top of fraud data sharing platforms like fraud consortiums. The challenge with this approach is that no business is equal, and fraud attack vectors change constantly.

Another option is to use an unsupervised anomaly detection model to monitor and surface unusual behavior among customer events. The challenge with this approach is that not all fraud events are anomalies, and not all anomalies are indeed fraud. Therefore, you can expect higher false positive rates.

In this post, we show how you can quickly bootstrap a real-time fraud prevention ML model with a little as 100 events using the Amazon Fraud Detector new feature, Cold Start, thereby dramatically lowering the barrier of entry to custom ML models for many organizations that simply don’t have the time or ability to collect and accurately label large datasets. Moreover, we discuss how by using Amazon Fraud Detector stored events, you can review results and correctly label the events to retrain your models, thereby improving the effectiveness of fraud prevention measures over time.

Solution overview

Amazon Fraud Detector is a fully managed fraud detection service that automates detecting potentially fraudulent activities online. You can use Amazon Fraud Detector to build customized fraud detection models using your own historical dataset, add decision logic using the built-in rules engine, and orchestrate risk decision workflows with a click of a button.

Previously, you had to provide over 10,000 labeled events with at least 400 examples of fraud to train a model. With the release of the Cold Start feature, you can quickly train a model with a minimum of 100 events and at least 50 classified as fraud. Compared with initial data requirements, this is a reduction of 99% in historical data and an 87% reduction in label requirements.

The new Cold Start feature provides intelligent methods for enriching, extending, and risk modeling small sets of data. Moreover, Amazon Fraud Detector performs label assignments and sampling for unlabeled events.

Experiments performed with public datasets show that, by lowering the limits to 50 fraud and only 100 events, you can build fraud ML models that consistently outperform unsupervised and semi-supervised models.

Cold Start model performance

The ability of an ML model to generalize and make accurate predictions on unseen data is impacted by the quality and diversity of the training dataset. For Cold Start models, this is no different. You should have processes in place as more data is collected to correctly label these events and retrain the models, ultimately leading to an optimal model performance.

With a lower data requirement, the instability of reported performance increases due to the increased variance of the model and the limited test data size. To help you build the right expectation of model performance, besides model AUC, Amazon Fraud Detector also reports uncertainty range metrics. The following table defines these metrics.

. . AUC
. . < 0.6 0.6 – 0.8 >= 0.8
AUC uncertainty interval > 0.3 The model performance is very low and might vary greatly. Expect low fraud detection performance. The model performance is low and might vary greatly. Expect limited fraud detection performance. The model performance might vary greatly.
0.1 – 0.3 The model performance is very low and might vary significantly. Expect low fraud detection performance. The model performance is low and might vary significantly. Expect limited fraud detection performance. The model performance might vary significantly.
< 0.1 The model performance is very low. Expect low fraud detection performance. The model performance is low. Expect limited fraud detection performance. No Warning

Train a Cold Start model

Training a Cold Start fraud model is identical to training any other Amazon Fraud Detector model; what differs is the dataset size. You can find sample datasets for Cold Start training in our GitHub repo. To train an Amazon Fraud Detector custom model, you can follow our hands-on tutorial. You can either use the Amazon Fraud Detector console tutorial or the SDK tutorial to build, train, and deploy a fraud detection model.

After your model is trained, you can review performance metrics and then deploy it by changing its status to Active. To learn more about model scores and performance metrics, see Model scores and Model performance metrics. At this point, you can now add your model to your detector, add business rules to interpret the risk scores that the model outputs, and make real-time predictions using the GetEventPrediction API.

Fraud ML model continuous improvement and feedback loop

With the Amazon Fraud Detector Cold Start feature, you can quickly bootstrap a fraud detector endpoint and start protecting your businesses immediately. However, new fraud patterns are constantly emerging, so it’s critical to retrain Cold Start models with newer data to improve the accuracy and effectiveness of the predictions over time.

To help you iterate on your models, Amazon Fraud Detector automatically stores all events sent to the service for inference. You can change or validate the event ingestion flag is on at the event type level, as shown in the following screenshot.

With the stored events feature, you can use the Amazon Fraud Detector SDK to programmatically access an event, review the event metadata and the prediction explanation, and make an informed risk decision. Moreover, you can label the event for future model retraining and continuous model improvement. The following diagram shows an example of this workflow.

In the following code snippets, we demonstrate the process to label a stored event:

  • To do a real-time fraud prediction on an event, call the GetEventPrediction API:
import boto3

def get_event_prediction():
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.get_event_prediction(
        detectorId='your_detector_name',
        detectorVersionId='1',
        eventId='my-event-id-1234',
        eventTypeName='your_event_type',
        entities=[
            {
                'entityType': 'user',
                'entityId': 'A12345'
            },
        ],
        eventTimestamp= '2023-03-23T21:42:03.658Z',
        eventVariables={
            'email': 'test@anymockcompany.com',
            'ip': '123.123.123.123',
            'card_bin': '400022',
            'billing_zip': '50401'
        }
    )
    return(prediction)

API Response:
{
  "modelScores": [
    {
      "modelVersion": {
        "modelId": "your_model_name",
        "modelType": "TRANSACTION_FRAUD_INSIGHTS",
        "modelVersionNumber": "1.0"
      },
      "scores": {
        "your_model_insightscore": 932
      }
    }
  ],
  "ruleResults": [
    {
      "ruleId": "high_risk_score",
      "outcomes": [
        "high_risk_send_for_manual_review"
      ]
    }
  ]

As seen in the response, based on the decision engine rule matched, the event should be sent for manual review by the fraud team. By gathering the prediction explanation metadata, you can gain insights into how each event variable impacted the model’s fraud prediction score.

  • To collect these insights, we use the get_event_prediction_metada API:
import boto3

def get_event_prediction_metadata(event, context):
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.get_event_prediction_metadata(
        eventId = 'my-event-id-1234',
        eventTypeName = 'your_event_type',
        predictionTimestamp = '2023-03-23T21:44:39.318Z',
        detectorId = 'your_detector_name',
        detectorVersionId = '1'
    )
    return(prediction)

API Response:

{
  "modelScores": [
    {
      "modelVersion": {
        "modelId": "your_model_name",
        "modelType": "TRANSACTION_FRAUD_INSIGHTS",
        "modelVersionNumber": "1.0"
      },
      "scores": {
        "your_model_insightscore": 932
      }
    }
  ],
  "ruleResults": [
    {
      "ruleId": "high_risk_score",
      "outcomes": [
        "high_risk_send_for_manual_review"
      ]
    }
  ]


{
  "eventId": "my-event-id-1234",
  …
  <REDACTED>
  …
  "eventVariables": [
    {
      "name": "ip",
      "value": "123.123.123.123"
    },
    {
      "name": "billing_zip",
      "value": "50401"
    },
    {
      "name": "email",
      "value": "test@anymockcompany.com"
    },
    {
      "name": "card_bin",
      "value": "400022"
    }
  ],
…
 <REDACTED>
…
   "evaluations": [
        {
          "evaluationScore": "932.0",
          "predictionExplanations": {
            "variableImpactExplanations": [
              {
                "eventVariableName": "billing_zip",
                "relativeImpact": "1",
                "logOddsImpact": 1.018196990713477135
              },
              {
                "eventVariableName": "ip",
                "relativeImpact": "0",
                "logOddsImpact": -0.23122438788414001
              },
              {
                "eventVariableName": "email",
                "relativeImpact": "0",
                "logOddsImpact": 0.004304269328713417
              },
              {
                "eventVariableName": "card_bin",
                "relativeImpact": "0",
                "logOddsImpact": -0.011150157079100609
              } 
           ],
}

With these insights, the fraud analyst can make an informed risk decision about the event in question and update the event label.

  • To update the event label  call the update_event_label API:
import boto3

def update_event_label(event, context):
    fraudDetector = boto3.client('frauddetector')
    
    prediction = fraudDetector.update_event_label(
        eventId = "my-event-id-1234",
        eventTypeName = "your_event_type",
        assignedLabel='1', # Fraud
        labelTimestamp='2023-03-25T11:20:03.658Z'
    )
    
    return(prediction)

API Response

{
  "ResponseMetadata": {
    "RequestId": "3e28caa0-2a06-4b8d-9a10-9081811bf22d",
    "HTTPStatusCode": 200,
    …
     <REDACTED>
    …

    "RetryAttempts": 0
  }
}

As a final step, you can verify if the event label was correctly updated.

  • To verify the event label, call the get_event API:
import boto3

def get_event():
    fraudDetector = boto3.client('frauddetector')
    
    event = fraudDetector.get_event(
        eventId='my-event-id-1234',
        eventTypeName=’your_event_type'
    )
    
    return(event)

API Response

{
  "event": {
    "eventId": "my-event-id-1234",
    "eventTimestamp": "2023-03-23T21:42:03.658Z",
    "eventVariables": {
      "billing_zip": "50401",
      "card_bin": "400022",
      "email": "test@anymockcompany.com",
      "ip": "123.123.123.123"
    },
    "currentLabel": "1",
    "labelTimestamp": "2023-03-25T11:20:03.658Z",
    "entities": [
      {
        "entityType": "user",
        "entityId": "A12345"
      }
    ]
  }
}

Clean up

To avoid incurring future charges, delete the resources created for the solution.

Conclusion

This post demonstrated how you can quickly bootstrap a real-time fraud prevention system with a few as 100 events using the Amazon Fraud Detector new Cold Start feature. We discussed how you can use stored events to review results and correctly label the events and retrain your models, improving the effectiveness of fraud prevention measures over time.

Fully managed AWS services such as Amazon Fraud Detector help reduce the time businesses spend analyzing user behavior to identify fraud in their platforms and focus more on driving business value. To learn more about how Amazon Fraud Detector can help your business, visit Amazon Fraud Detector.


About the Authors

Marcel Pividal is a Global Sr. AI Services Solutions Architect in the World-Wide Specialist Organization. Marcel has more than 20 years of experience solving business problems through technology for FinTechs, payment providers, pharma, and government agencies. His current areas of focus are risk management, fraud prevention, and identity verification.

Julia Xu is a Research Scientist with Amazon Fraud Detector. She is passionate about solving customer challenges using machine learning techniques. In her free time, she enjoys hiking, painting, and exploring new coffee shops.

Guilherme Ricci is a Senior Solution Architect at AWS, helping Startups to modernize and optimize the costs of their applications. With over 10 years of experience with companies in the financial sector, he is currently working together with the team of AI/ML specialists.

Read More

Connect Amazon EMR and RStudio on Amazon SageMaker

Connect Amazon EMR and RStudio on Amazon SageMaker

RStudio on Amazon SageMaker is the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

In conjunction with tools like RStudio on SageMaker, users are analyzing, transforming, and preparing large amounts of data as part of the data science and ML workflow. Data scientists and data engineers use Apache Spark, Hive, and Presto running on Amazon EMR for large-scale data processing. Using RStudio on SageMaker and Amazon EMR together, you can continue to use the RStudio IDE for analysis and development, while using Amazon EMR managed clusters for larger data processing.

In this post, we demonstrate how you can connect your RStudio on SageMaker domain with an EMR cluster.

Solution overview

We use an Apache Livy connection to submit a sparklyr job from RStudio on SageMaker to an EMR cluster. This is demonstrated in the following diagram.

Scope of Solution
All code demonstrated in the post is available in our GitHub repository. We implement the following solution architecture.

Prerequisites

Prior to deploying any resources, make sure you have all the requirements for setting up and using RStudio on SageMaker and Amazon EMR:

We’ll also build a custom RStudio on SageMaker image, so ensure you have Docker running and all required permissions. For more information, refer to Use a custom image to bring your own development environment to RStudio on Amazon SageMaker.

Create resources with AWS CloudFormation

We use an AWS CloudFormation stack to generate the required infrastructure.

If you already have an RStudio domain and an existing EMR cluster, you can skip this step and start building your custom RStudio on SageMaker image. Substitute the information of your EMR cluster and RStudio domain in place of the EMR cluster and RStudio domain created in this section.

Launching this stack creates the following resources:

  • Two private subnets
  • EMR Spark cluster
  • AWS Glue database and tables
  • SageMaker domain with RStudio
  • SageMaker RStudio user profile
  • IAM service role for the SageMaker RStudio domain
  • IAM service role for the SageMaker RStudio user profile

Complete the following steps to create your resources:

Choose Launch Stack to create the stack.

  1. On the Create stack page, choose Next.
  2. On the Specify stack details page, provide a name for your stack and leave the remaining options as default, then choose Next.
  3. On the Configure stack options page, leave the options as default and choose Next.
  4. On the Review page, select
  5. I acknowledge that AWS CloudFormation might create IAM resources with custom names and
  6. I acknowledge that AWS CloudFormation might require the following capability: CAPABILITY_AUTO_EXPAND.
  7. Choose Create stack.

The template generates five stacks.

To see the EMR Spark cluster that was created, navigate to the Amazon EMR console. You will see a cluster created for you called sagemaker. This is the cluster we connect to through RStudio on SageMaker.

Build the custom RStudio on SageMaker image

We have created a custom image that will install all the dependencies of sparklyr, and will establish a connection to the EMR cluster we created.

If you’re using your own EMR cluster and RStudio domain, modify the scripts accordingly.

Make sure Docker is running. Start by getting into our project repository:

cd sagemaker-rstudio-emr/sparklyr-image
./build-r-image.sh

We will now build the Docker image and register it to our RStudio on SageMaker domain.

  1. On the SageMaker console, choose Domains in the navigation pane.
  2. Choose the domain select rstudio-domain.
  3. On the Environment tab, choose Attach image.

    Now we attach the sparklyr image that we created earlier to the domain.
  4. For Choose image source, select Existing image.
  5. Select the sparklyr image we built.
  6. For Image properties, leave the options as default.
  7. For Image type, select RStudio image.
  8. Choose Submit.

    Validate the image has been added to the domain. It may take a few minutes for the image to attach fully.
  9. When it’s available, log in to the RStudio on SageMaker console using the rstudio-user profile that was created.
  10. From here, create a session with the sparklyr image that we created earlier.

    First, we have to connect to our EMR cluster.
  11. In the connections pane, choose New Connection.
  12. Select the EMR cluster connect code snippet and choose Connect to Amazon EMR Cluster.

    After the connect code has run, you will see a Spark connection through Livy, but no tables.
  13. Change the database to credit_card:
    tbl_change_db(sc, “credit_card”)
  14. Choose Refresh Connection Data.
    You can now see the tables.
  15. Now navigate to the rstudio-sparklyr-code-walkthrough.md file.

This has a set of Spark transformations we can use on our credit card dataset to prepare it for modeling. The following code is an excerpt:

Let’s count() how many transactions are in the transactions table. But first we need to cache Use the tbl() function.

users_tbl &amp;lt;- tbl(sc, "users")
cards_tbl &amp;lt;- tbl(sc, "cards")
transactions_tbl &amp;lt;- tbl(sc, "transactions")

Let’s run a count of the number of rows for each table.

count(users_tbl)
count(cards_tbl)
count(transactions_tbl)

Now let’s register our tables as Spark Data Frames and pull them into the cluster-wide in memory cache for better performance. We will also filter the header that gets placed in the first row for each table.

users_tbl &lt;- tbl(sc, 'users') %&gt;%
  filter(gender != 'Gender')
sdf_register(users_tbl, "users_spark")
tbl_cache(sc, 'users_spark')
users_sdf &lt;- tbl(sc, 'users_spark')

cards_tbl &lt;- tbl(sc, 'cards') %&gt;%
  filter(expire_date != 'Expires')
sdf_register(cards_tbl, "cards_spark")
tbl_cache(sc, 'cards_spark')
cards_sdf &lt;- tbl(sc, 'cards_spark')

transactions_tbl &lt;- tbl(sc, 'transactions') %&gt;%
  filter(amount != 'Amount')
sdf_register(transactions_tbl, "transactions_spark")
tbl_cache(sc, 'transactions_spark')
transactions_sdf &lt;- tbl(sc, 'transactions_spark')

To see the full list of commands, refer to the rstudio-sparklyr-code-walkthrough.md file.

Clean up

To clean up any resources to avoid incurring recurring costs, delete the root CloudFormation template. Also delete all Amazon Elastic File Service (Amazon EFS) mounts created and any Amazon Simple Storage Service (Amazon S3) buckets and objects created.

Conclusion

The integration of RStudio on SageMaker with Amazon EMR provides a powerful solution for data analysis and modeling tasks in the cloud. By connecting RStudio on SageMaker and establishing a Livy connection to Spark on EMR, you can take advantage of the computing resources of both platforms for efficient processing of large datasets. RStudio, one of the most widely used IDEs for data analysis, allows you to take advantage of the fully managed infrastructure, access control, networking, and security capabilities of SageMaker. Meanwhile, the Livy connection to Spark on Amazon EMR provides a way to perform distributed processing and scaling of data processing tasks.

If you’re interested in learning more about using these tools together, this post serves as a starting point. For more information, refer to RStudio on Amazon SageMaker. If you have any suggestions or feature improvements, please create a pull request on our GitHub repo or leave a comment on this post!


About the Authors

Ryan Garner is a Data Scientist with AWS Professional Services. He is passionate about helping AWS customers use R to solve their Data Science and Machine Learning problems.


Raj Pathak
 is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).


Saiteja Pudi
 is a Solutions Architect at AWS, based in Dallas, Tx. He has been with AWS for more than 3 years now, helping customers derive the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in Data Science and Machine Learning.

Read More