Forecasting AWS spend using the AWS Cost and Usage Reports, AWS Glue DataBrew, and Amazon Forecast

AWS Cost Explorer enables you to view and analyze your AWS Cost and Usage Reports (AWS CUR). You can also predict your overall cost associated with AWS services in the future by creating a forecast of AWS Cost Explorer, but you can’t view historical data beyond 12 months. Moreover, running custom machine learning (ML) models on historical data can be labor and knowledge intensive, often requiring some programming language for data transformation and building models.

In this post, we show you how to use Amazon Forecast, a fully managed service that uses ML to deliver highly accurate forecasts, with data collected from AWS CUR. AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and ML. We can use DataBrew to transform CUR data into the appropriate dataset format, which Forecast can later ingest to train a predictor and create a forecast. We can transform the data into the required format and predict the cost for a given service or member account ID without writing a single line of code.

The following is the architecture diagram of the solution.

Walkthrough

You can choose hourly, daily, or monthly reports that break out costs by product or resource (including self-defined tags) using the AWS CUR. AWS delivers the CUR to an Amazon Simple Storage Service (Amazon S3) bucket, where this data is securely retained and accessible. Because cost and usage data are timestamped, you can easily deliver the data to Forecast. In this post, we use CUR data that is collected on a daily basis. After you set up the CUR, you can use DataBrew to transform the CUR data into the appropriate format for Forecast to train a predictor and create a forecast output without writing a single line of code. In this post, we walk you through the following high-level tasks:

  1. Prepare the data.
  2. Transform the data.
  3. Prepare the model.
  4. Validate the predictors.
  5. Generate a forecast.

Prerequisites

Before we begin, let’s create an S3 bucket to store the results from the DataBrew job. Remember the bucket name, because we refer to this bucket during deployment. With DataBrew, you can determine the transformations and then schedule them to run automatically on new daily, weekly, or monthly data as it comes in without having to repeat the data preparation manually.

Preparing the data

To prepare your data, complete the following steps:

  1. On the DataBrew console, create a new project.
  2. For Select a dataset, select New dataset.

When selecting CUR data, you can select a single object, or the contents of an entire folder.

When selecting CUR data, you can select a single object, or the contents of an entire folder.

If you don’t have substantial or interesting usage in your report, you can use a sample file available on Verify Your CUR Files Are Being Delivered. Make sure you follow the folder structure when uploading the Parquet files, and make sure the folder only contains the Parquet files needed. If the folder has other random files, it errors out.

If the folder has other random files, it errors out.

  1. Create a role in AWS Identity and Access Management (IAM) that allows DataBrew to access CUR files.

You can either create a custom role or have DataBrew create one on your behalf.

  1. Choose Create project.

DataBrew takes you to the the project screen to view and analyze your data.

First, we need to select only those columns required for Forecast. We can do this by grouping the columns by our desired dimensions and creating a summed column of unblended costs.

  1. Choose the Group icon on the navigation bar and group by the following columns:
    1. line_item_usage_start_date, GROUP BY
    2. product_product_name, GROUP BY
    3. line_item_usage_account_id, GROUP BY
    4. line_item_unblended_cost, SUM
  1. For Group type, select Group as new table to replace all existing columns with new columns.

This extracts only the required columns for our forecast.

  1. Choose Finish.

Choose Finish.

In DataBrew, a recipe is a set of data transformation steps. As you progress, DataBrew documents your data transformation steps. You can save and use these recipes in the future for new datasets and transformation iterations.

  1. Choose Add step.
  2. For Create column options¸ choose Based on functions.
  3. For Select a function, choose DATEFORMAT.
  4. For Values using, choose Source column.
  5. For Source column, choose line_item_usage_start_date.
  6. For Date format, choose yyyy-mm-dd.

For Date format, choose yyyy-mm-dd.

  1. Add a destination column of your choice.
  2. Choose Apply.

We can delete the original timestamp column because it’s a duplicate.

  1. Choose Add step.
  2. Delete the original line_item_usage_start_date column.
  3. Choose Apply.

Finally, let’s change our summed cost column to a numeric data type.

  1. Choose the Setting icon for the line_item_unblended_cost_sum column.
  2. For Change type to, choose # number.
  3. Choose Apply.

Choose Apply.

DataBrew documented all four steps of this recipe. You can version recipes as your analytical needs change.

  1. Choose Publish to publish an initial version of the recipe.

Choose Publish to publish an initial version of the recipe.

Transforming the data

Now that we have finished all necessary steps to transform our data, we can instruct DataBrew to run the actual transformation and output the results into an S3 bucket.

  1. On the DataBrew project page, choose Create job.
  2. For Job name, enter a name for your job.
  3. Under Job output settings¸ for File type¸ choose CSV.
  4. For S3 location, enter the S3 bucket we created in the prerequisites.

For S3 location, enter the S3 bucket we created in the prerequisites.

  1. Choose either the IAM role you created or the one DataBrew created for you.
  2. Choose Run job.

It may take several minutes for the job to complete. When it’s complete, navigate to the Amazon S3 console and choose the S3 bucket to find the results. The bucket contains multiple CSV files with keys starting with _part0000.csv. As a fully managed service, DataBrew can run jobs in parallel on multiple nodes to process large files on your behalf. This isn’t a problem because you can specify an entire folder for Forecast to process.

Scheduling DataBrew jobs

You can schedule DataBrew jobs to transform the CUR to update to provide Forecast with a refreshed dataset.

  1. On the Job menu, choose the Schedules tab.
  2. Provide a schedule name.
  3. Specify the run frequency on the day and hour.
  4. Optionally, provide the start time for the job run.

DataBrew then runs the job per your configuration.

Preparing the model

To prepare your model, complete the following steps:

  1. On the Forecast console, choose Dataset groups.
  2. Choose Create dataset group.
  3. For Dataset group name, enter a name.
  4. For Forecasting domain, choose Custom.
  5. Choose Next.

Choose Next.

We need to create a target time series dataset.

  1. For Frequency of data, choose 1 day.

For Frequency of data, choose 1 day.

We can define the data schema to help Forecast become aware of our data types.

  1. Drag the timestamp attribute name column to the top of the list.
  2. For Timestamp Format, choose yyyy-MM-dd. 

This aligns to the data transformation step we completed in DataBrew.

  1. Add an attribute as the third column with the name as account_id and attribute type as string.

Add an attribute as the third column with the name as account_id and attribute type as string.

  1. For Dataset import name, enter a name.
  2. For Select time zone, choose your time zone.
  3. For Data location, enter the path to the files in your S3 bucket.

If you browse Amazon S3 for the data location, you can only choose individual files. Because our DataBrew output consists of multiple files, enter the S3 path to the files’ location and make sure that a slash (/) is at the end of the path.

  1. For IAM role, you can create a role so Forecast has permissions to only access the S3 bucket containing the DataBrew output.
  2. Choose Start import.

Choose Start import.

Forecast takes approximately 45 minutes to import your data. You can view the status of your import on the Datasets page.

  1. When the latest import status shows as Active, proceed with training a predictor.
  2. For Predictor name, enter a name.
  3. For Forecast horizon, enter a number that tells Forecast how far into the future to predict your data.

The forecast horizon can’t exceed one-third length of the target time series.

  1. For Forecast frequency, leave at 1 day.
  2. If you’re unsure of which algorithm to use to train your model, for Algorithm selection, select Automatic (AutoML). 

This option lets Forecast select the optimal algorithm for your datasets, which automatically trains a model and provides accuracy metrics and generate forecasts. Otherwise, you can manually select one of the built-in algorithms. For this post, we use AutoML.

  1. For Forecast dimension, choose account_id.

This allows Forecast to predict cost by account ID in addition to product name.

This allows Forecast to predict cost by account ID in addition to product name.

  1. Leave all other options at their default and choose Train predictor.

Forecast begins training the optimal ML model on your dataset. This could take up to an hour to complete. You can check on the training status on the Predictors page. You can generate a forecast after the predictor training status shows as Active.

When it’s complete, you can see that Forecast chose DeepAR+ as the optimal ML algorithm. DeepAR+ analyzes the data as similar time series across a set of cross-functional units. These time series groupings demand different product names and account IDs. In this case, it can be beneficial to train a single model jointly over all time series.

Validating the predictors

Forecast provides comprehensive accuracy metrics to help you understand the performance of your forecasting model, and can compare it to the previous forecasting models you’ve created that may have looked at a different set of variables or used a different period of time for the historical data.

By validating our predictors, we can measure the accuracy of forecasts for individual items. In the Algorithm metrics section on the predictor details page, you can view the accuracy metrics of the predictor, which include the following:

  • WQL – Weighted quantile loss at a given quantile
  • WAPE – Weighted absolute percentage error
  • RMSE – Root mean square error

As another method, we can export the accuracy metrics and forecasted values using algorithm metrics for our predictor. With this method, you can view accuracy metrics for specific services when forecasting, such as Amazon Elastic Compute Cloud (Amazon EC2) or Amazon Relational Database Service (Amazon RDS).

  1. Select the predictor you created.
  2. Choose Export backtest results.
  3. For Export name¸ enter a name.
  4. For IAM role¸ choose the role you used earlier.
  5. For S3 predictor backtest export location, enter the S3 path where you want Forecast to export the accuracy metrics and forecasted values.

For S3 predictor backtest export location, enter the S3 path where you want Forecast to export the accuracy metrics and forecasted values.

  1. Choose Create predictor backtest report.

After some time, Forecast delivers the export results to the S3 location you specified. Forecast exports two files to Amazon S3 in two different folders: forecasted-values and accuracy-metric-values. For more information about accuracy metrics, see Amazon Forecast now supports accuracy measurements for individual items. 

Generating a forecast

To create a forecast, complete the following steps:

  1. For Forecast name, enter a name.
  2. For Predictor¸ choose the predictor you created.
  3. For Forecast types, you can specify up to five quantile values. For this post, we leave it blank to use the defaults.
  4. Choose Create new forecast.

Choose Create new forecast.

When it’s complete, the forecast status shows as Active. Let’s now create a forecast lookup.

  1. For Forecast¸ choose the forecast you just created.
  2. Specify the start and end date within the bounds of the forecast.
  3. For Value, enter a service name (for this post, Amazon RDS).
  4. Choose Get Forecast.

Choose Get Forecast

Forecast returns P10, P50, and P90 estimates as the default lower, middle, and upper bounds, respectively. For more information about predictor metrics, see Evaluating Predictor Accuracy. Feel free to explore different forecasts for different services in addition to creating a forecast for account ID.

Feel free to explore different forecasts for different services in addition to creating a forecast for account ID.

Congratulations! You’ve just created a solution to retrieve forecasts on your CUR by using Amazon S3, DataBrew, and Forecast without writing a single line of code. With these services, you only pay for what you use and don’t have to worry about managing the underlying infrastructure to run transformations and ML inferences. 

Conclusion

In this post, we illustrated how to use DataBrew to transform the CUR into a format for Forecast to make predictions without any need for ML expertise. We created datasets, predictors, and a forecast, and used Forecast to predict costs for specific AWS services. To get started with Amazon Forecast, visit the product page. We also recently announced that CUR is available to member (linked) accounts. Now any entity with the proper permissions under any account within an AWS organization can use the CUR to view and manage costs.


About the Authors

Jyoti Tyagi is a Solutions Architect with great passion for artificial intelligence and machine learning. She helps customers to architect highly secured and well-architected applications on the AWS Cloud with best practices. In her spare time, she enjoys painting and meditation.

 

 

Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. Outside of work, he enjoys cooking and spending time with his family.

Read More

Managing your machine learning lifecycle with MLflow and Amazon SageMaker

With the rapid adoption of machine learning (ML) and MLOps, enterprises want to increase the velocity of ML projects from experimentation to production.

During the initial phase of an ML project, data scientists collaborate and share experiment results in order to find a solution to a business need. During the operational phase, you also need to manage the different model versions going to production and your lifecycle. In this post, we’ll show how the open-source platform MLflow helps address these issues. For those interested in a fully managed solution, Amazon Web Services recently announced Amazon SageMaker Pipelines at re:Invent 2020, the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). You can learn more about SageMaker Pipelines in this post.

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It includes the following components:

  • Tracking – Record and query experiments: code, data, configuration, and results
  • Projects – Package data science code in a format to reproduce runs on any platform
  • Models – Deploy ML models in diverse serving environments
  • Registry – Store, annotate, discover, and manage models in a central repository

The following diagram illustrates our architecture.

In the following sections, we show how to deploy MLflow on AWS Fargate and use it during your ML project with Amazon SageMaker. We use SageMaker to develop, train, tune, and deploy a Scikit-learn based ML model (random forest) using the Boston House Prices dataset. During our ML workflow, we track experiment runs and our models with MLflow.

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Walkthrough overview

This post demonstrates how to do the following:

  • Host a serverless MLflow server on Fargate
  • Set Amazon Simple Storage Service (Amazon S3) and Amazon Relational Database Service (Amazon RDS) as artifact and backend stores, respectively
  • Track experiments running on SageMaker with MLflow
  • Register models trained in SageMaker in the MLflow Model Registry
  • Deploy an MLflow model into a SageMaker endpoint

The detailed step-by-step code walkthrough is available in the GitHub repo.

Architecture overview

You can set up a central MLflow tracking server during your ML project. You use this remote MLflow server to manage experiments and models collaboratively. In this section, we show you how you can Dockerize your MLflow tracking server and host it on Fargate.

An MLflow tracking server also has two components for storage: a backend store and an artifact store.

We use an S3 bucket as our artifact store and an Amazon RDS for MySQL instance as our backend store.

The following diagram illustrates this architecture.

Running an MLflow tracking server on a Docker container

You can install MLflow using pip install mlflow and start your tracking server with the mlflow server command.

By default, the server runs on port 5000, so we expose it in our container. Use 0.0.0.0 to bind to all addresses if you want to access the tracking server from other machines. We install boto3 and pymysql dependencies for the MLflow server to communicate with the S3 bucket and the RDS for MySQL database. See the following code:

FROM python:3.8.0

RUN pip install 
    mlflow 
    pymysql 
    boto3 & 
    mkdir /mlflow/

EXPOSE 5000

## Environment variables made available through the Fargate task.
## Do not enter values
CMD mlflow server 
    --host 0.0.0.0 
    --port 5000 
    --default-artifact-root ${BUCKET} 
    --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE}

Hosting an MLflow tracking server with Fargate

In this section, we show how you can run your MLflow tracking server on a Docker container that is hosted on Fargate.

Fargate is an easy way to deploy your containers on AWS. It allows you to use containers as a fundamental compute primitive without having to manage the underlying instances. All you need is to specify an image to deploy and the amount of CPU and memory it requires. Fargate handles updating and securing the underlying Linux OS, Docker daemon, and Amazon Elastic Container Service (Amazon ECS) agent, as well as all the infrastructure capacity management and scaling.

For more information about running an application on Fargate, see Building, deploying, and operating containerized applications with AWS Fargate.

The MLflow container first needs to be built and pushed to an Amazon Elastic Container Registry (Amazon ECR) repository. The container image URI is used at registration of our Amazon ECS task definition. The ECS task has an AWS Identity and Access Management (IAM) role attached to it, allowing it to interact with AWS services such as Amazon S3.

The following screenshot shows our task configuration.

The Fargate service is set up with autoscaling and a network load balancer so it can adjust to the required compute load with minimal maintenance effort on our side.

When running our ML project, we set mlflow.set_tracking_uri(<load balancer uri>) to interact with the MLflow server via the load balancer.

Using Amazon S3 as the artifact store and Amazon RDS for MySQL as backend store

The artifact store is suitable for large data (such as an S3 bucket or shared NFS file system) and is where clients log their artifact output (for example, models). MLflow natively supports Amazon S3 as artifact store, and you can use --default-artifact-root ${BUCKET} to refer to the S3 bucket of your choice.

The backend store is where MLflow Tracking Server stores experiments and runs metadata, as well as parameters, metrics, and tags for runs. MLflow supports two types of backend stores: file store and database-backed store. It’s better to use an external database-backed store to persist the metadata.

As of this writing, you can use databases such as MySQL, SQLite, and PostgreSQL as a backend store with MLflow. For more information, see Backend Stores.

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database and can also be used for this.

For this example, we set up an RDS for MySQL instance. Amazon RDS makes it easy to set up, operate, and scale MySQL deployments in the cloud. With Amazon RDS, you can deploy scalable MySQL servers in minutes with cost-efficient and resizable hardware capacity.

You can use --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE} to refer MLflow to the MySQL database of your choice.

Launching the example MLflow stack

To launch your MLflow stack, follow these steps:

  1. Launch the AWS CloudFormation stack provided in the GitHub repo
  2. Choose Next.
  3. Leave all options as default until you reach the final screen.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources.
  5. Choose Create.

The stack takes a few minutes to launch the MLflow server on Fargate, with an S3 bucket and a MySQL database on RDS. The load balancer URI is available on the Outputs tab of the stack.

You can then use the load balancer URI to access the MLflow UI.

In this illustrative example stack, our load balancer is launched on a public subnet and is internet facing.

For security purposes, you may want to provision an internal load balancer in your VPC private subnets where there is no direct connectivity from the outside world. For more information, see Access Private applications on AWS Fargate using Amazon API Gateway PrivateLink.

Tracking SageMaker runs with MLflow

You now have a remote MLflow tracking server running accessible through a REST API via the load balancer URI.

You can use the MLflow Tracking API to log parameters, metrics, and models when running your ML project with SageMaker. For this you need to install the MLflow library when running your code on SageMaker and set the remote tracking URI to be your load balancer address.

The following Python API command allows you to point your code running on SageMaker to your MLflow remote server:

import mlflow
mlflow.set_tracking_uri('<YOUR LOAD BALANCER URI>')

Connect to your notebook instance and set the remote tracking URI. The following diagram shows the updated architecture.

Managing your ML lifecycle with SageMaker and MLflow

You can follow this example lab by running the notebooks in the GitHub repo.

This section describes how to develop, train, tune, and deploy a random forest model using Scikit-learn with the SageMaker Python SDK. We use the Boston Housing dataset, present in Scikit-learn, and log our ML runs in MLflow.

You can find the original lab in the SageMaker Examples GitHub repo for more details on using custom Scikit-learn scripts with SageMaker.

Creating an experiment and tracking ML runs

In this project, we create an MLflow experiment named boston-house and launch training jobs for our model in SageMaker. For each training job run in SageMaker, our Scikit-learn script records a new run in MLflow to keep track of input parameters, metrics, and the generated random forest model.

The following example API calls can help you start and manage MLflow runs:

  • start_run() – Starts a new MLflow run, setting it as the active run under which metrics and parameters are logged
  • log_params() – Logs a parameter under the current run
  • log_metric() – Logs a metric under the current run
  • sklearn.log_model() – Logs a Scikit-learn model as an MLflow artifact for the current run

For a complete list of commands, see MLflow Tracking.

The following code demonstrates how you can use those API calls in your train.py script:

# set remote mlflow server
mlflow.set_tracking_uri(args.tracking_uri)
mlflow.set_experiment(args.experiment_name)

with mlflow.start_run():
    params = {
        "n-estimators": args.n_estimators,
        "min-samples-leaf": args.min_samples_leaf,
        "features": args.features
    }
    mlflow.log_params(params)
    
    # TRAIN
    logging.info('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1
    )

    model.fit(X_train, y_train)

    # ABS ERROR AND LOG COUPLE PERF METRICS
    logging.info('evaluating model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    for q in [10, 50, 90]:
        logging.info(f'AE-at-{q}th-percentile: {np.percentile(a=abs_err, q=q)}')
        mlflow.log_metric(f'AE-at-{str(q)}th-percentile', np.percentile(a=abs_err, q=q))

    # SAVE MODEL
    logging.info('saving model in MLflow')
    mlflow.sklearn.log_model(model, "model")

Your train.py script needs to know which MLflow tracking_uri and experiment_name to use to log the runs. You can pass those values to your script using the hyperparameters of the SageMaker training jobs. See the following code:

# uri of your remote mlflow server
tracking_uri = '<YOUR LOAD BALANCER URI>' 
experiment_name = 'boston-house'

hyperparameters = {
    'tracking_uri': tracking_uri,
    'experiment_name': experiment_name,
    'n-estimators': 100,
    'min-samples-leaf': 3,
    'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
    'target': 'target'
}

estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    train_instance_count=1,
    train_instance_type='local',
    framework_version='0.23-1',
    base_job_name='mlflow-rf',
)

Performing automatic model tuning with SageMaker and tracking with MLflow

SageMaker automatic model tuning, also known as Hyperparameter Optimization (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

In the 2_track_experiments_hpo.ipynb example notebook, we show how you can launch a SageMaker tuning job and track its training jobs with MLflow. It uses the same train.py script and data used in single training jobs, so you can accelerate your hyperparameter search for your MLflow model with minimal effort.

When the SageMaker jobs are complete, you can navigate to the MLflow UI and compare results of different runs (see the following screenshot).

This can be useful to promote collaboration within your development team.

Managing models trained with SageMaker using the MLflow Model Registry

The MLflow Model Registry component allows you and your team to collaboratively manage the lifecycle of a model. You can add, modify, update, transition, or delete models created during the SageMaker training jobs in the Model Registry through the UI or the API.

In your project, you can select a run with the best model performance and register it into the MLflow Model Registry. The following screenshot shows example registry details.

After a model is registered, you can navigate to the Registered Models page and view its properties.

Deploying your model in SageMaker using MLflow

This sections shows how to use the mlflow.sagemaker module provided by MLflow to deploy a model into a SageMaker-managed endpoint. As of this writing, MLflow only supports deployments to SageMaker endpoints, but you can use the model binaries from the Amazon S3 artifact store and adapt them to your deployment scenarios.

Next, you need to build a Docker container with inference code and push it to Amazon ECR.

You can build your own image or use the mlflow sagemaker build-and-push-container command to have MLflow create one for you. This builds an image locally and pushes it to an Amazon ECR repository called mlflow-pyfunc.

The following example code shows how to use mlflow.sagemaker.deploy to deploy your model into a SageMaker endpoint:

# URL of the ECR-hosted Docker image the model should be deployed into
image_uri = '<YOUR mlflow-pyfunc ECR IMAGE URI>'
endpoint_name = 'boston-housing'
# The location, in URI format, of the MLflow model to deploy to SageMaker.
model_uri = '<YOUR MLFLOW MODEL LOCATION>'

mlflow.sagemaker.deploy(
    mode='create',
    app_name=endpoint_name,
    model_uri=model_uri,
    image_url=image_uri,
    execution_role_arn=role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    region_name=region
)

The command launches a SageMaker endpoint into your account, and you can use the following code to generate predictions in real time:

# load boston dataset
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

runtime= boto3.client('runtime.sagemaker')
# predict on the first row of the dataset
payload = df.iloc[[0]].to_json(orient="split")

runtime_response = runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=payload)
result = json.loads(runtime_response['Body'].read().decode())
print(f'Payload: {payload}')
print(f'Prediction: {result}')

Current limitation on user access control

As of this writing, the open-source version of MLflow doesn’t provide user access control features in case you have multiple tenants on your MLflow server. This means any user with access to the server can modify experiments, model versions, and stages. This can be a challenge for enterprises in regulated industries that need to keep strong model governance for audit purposes.

Summary

In this post, we covered how you can host an open-source MLflow server on AWS using Fargate, Amazon S3, and Amazon RDS. We then showed an example ML project lifecycle of tracking SageMaker training and tuning jobs with MLflow, managing model versions in the MLflow Model Registry, and deploying an MLflow model into a SageMaker endpoint for prediction. Try out the solution on your own by accessing the GitHub repo and let us know if you have any questions in the comments!


About the Authors

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

 

 

 

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps Manufacturing, Industrial, Automotive and Aerospace customers build Machine Learning and optimization related architectures to solve their business challenges using the AWS platform.

Read More

Understanding the key capabilities of Amazon SageMaker Feature Store

One of the challenging parts of machine learning (ML) is feature engineering, the process of transforming data to create features for ML. Features are processed data signals used for training ML models and for deployed models to make accurate predictions. Data scientists and ML engineers can spend up to 60-70% of their time on feature engineering. It’s also typical to have this work repeated by different teams within an organization who use the same data to build ML models for different solutions, further increasing effort levels for feature engineering. Moreover, it’s important that the generated features should be available for both training and real-time inference use cases, to ensure consistency between model training and inference serving.

A purpose-built feature store for ML is needed to ensure both high-quality ML predictions with a consistent set of features, and cost reduction by eliminating duplicate feature engineering effort and storage overhead. Consistent features are needed between different parts of an organization, and between training and inference for any given ML model. There is also a need for feature stores to meet the high performance, scale, and availability requirements to serve features in near-real time for inferences. Because of this, organizations are often forced to do the heavy lifting of building and maintaining feature store systems, which can become expensive and difficult to maintain.

At AWS we are constantly listening to our customers and building solutions and services that delight them. We heard from many customers about the pain their data science and data engineering teams face when managing features, and used those inputs to build the Amazon SageMaker Feature Store, which was launched at re:Invent on December 1, 2020. Amazon SageMaker Feature Store is a fully managed, purpose-built repository to securely store, update, retrieve, and share ML features.

Although there is a lot to unpack in terms of the capabilities that SageMaker Feature Store brings to the table, in this post, we focus on key capabilities for data ingestion, data access, and security and access control.

Overview of SageMaker Feature Store

As a purpose-built feature store for ML, SageMaker Feature Store is designed to enable you to do the following:

  • Securely store and serve features for real-time and batch applications – SageMaker Feature Store serves features at a very low latency for real-time use-cases. It enables you to use ML to make near-real time decisions by enabling feature vector retrievals with low millisecond latencies (p95 latency lower than 10 milliseconds for a 15-kilobyte payload).
  • Accelerate model development by sharing and reusing features – Feature engineering is a long and tedious process that often gets repeated by multiple teams within an organization working on the same data. SageMaker Feature Store enables data scientists to spend less time on data preparation and feature computation, and more time on ML innovation, by letting them discover and reuse existing engineered features across the organization.
  • Provide historical data access – Features are used for training purposes, and a good feature store should provide easy and quick access to historical feature values to recreate training datasets at a given point in time in the past. Amazon SageMaker Feature Store enables this with support for time-travel queries—querying data at a point in time—which enables you to re-create features at specific points of time in the past.
  • Reduce training-serving skew – Data science teams often struggle with training-serving skew caused by data discrepancy between model training and inference serving, which can cause models to perform worse than expected in production. SageMaker Feature Store reduces training-serving skew by keeping feature consistency between training and inference.
  • Enable data encryption and access control – As with other data assets, ML feature data security is paramount in all organizations. At AWS, security and operational performance are our top priorities, and SageMaker Feature Store provides a suite of capabilities for enterprise-grade data security and access control, including encryption at rest and in transit, and role-based access control using AWS Identity and Access Management (IAM).
  • Guarantee a robust service level – Managed feature store production use cases need service-level guarantees, ensuring that you get the desired performance and availability, and you can rely on expert help should something go wrong. SageMaker Feature Store is backed by AWS’s unmatched reliability, scale and operational efficiency.

SageMaker Feature Store is designed to play a central role in ML architectures, helping you streamline the ML lifecycle, and integrating seamlessly with many other services. For example, you can use tools like AWS Glue DataBrew and SageMaker Data Wrangler for feature authoring. You can use Amazon EMR, AWS Glue, and SageMaker Processing in conjunction with SageMaker Feature Store for performing feature transformation tasks. You can use a suite of tools, including SageMaker Pipelines, AWS Step Functions, or Apache AirFlow for scheduling and orchestrating feature pipelines to automate feature engineering process flow. When you have features in the feature store, you can pull them with low latency from the online store to feed models hosted with services like SageMaker Hosting. You can use existing tools like Amazon Athena, Presto, Spark, and EMR to extract datasets from the offline store for use with SageMaker Training and batch transforms. Lastly, you can use Amazon Kinesis, Apache Kafka, and AWS Lambda for streaming feature engineering and ingestion. The following diagram illustrates some of the services that can be integrated with SageMaker Feature Store.

The following diagram illustrates some of the services that can be integrated with SageMaker Feature Store.

Before we go into more detail, we briefly introduce some SageMaker Feature Store concepts:

  • Feature group – a logical grouping of ML features
  • Record – a set of values for features in a feature group
  • Online store – the low latency, high availability store that enables real-time lookup of records
  • Offline store – the store that manages historical data in your Amazon Simple Storage Service (Amazon S3) bucket, and is used for exploration and model training use cases

For more information, see Get started with Amazon SageMaker Feature Store.

Data ingestion

SageMaker Feature Store provides multiple ways to ingest data, including batch ingestion, ingestion from streaming data sources and a combination of both. SageMaker Feature Store is built in a modular fashion and is designed to ingest data from a variety of sources, including directly from SageMaker Data Wrangler, or sources like Kinesis or Apache Kafka. The following diagram shows the various data ingestion and mechanisms supported by SageMaker Feature Store.

The following diagram shows the various data ingestion and mechanisms supported by SageMaker Feature Store.

Streaming ingestion

SageMaker Feature Store provides the low latency PutRecord API, which is designed to give you millisecond-level latency and high throughput cost-optimized data ingestion. The API is designed to be called by different streams, and you can use streaming sources such as Apache Kafka, Kinesis, Spark Streaming, or another source to extract features in real-time and feed them directly into the online store, or both the online and offline store.

For even faster ingestion, the PutRecord API can be parallelized to support higher throughput writes. The data from all these PUT requests is synchronously written to the online store, and buffered and written to an offline store (Amazon S3) if that option is selected. The data is written to the offline store within a few minutes of ingestion. SageMaker Feature Store provides data and schema validations at ingestion time to ensure data quality is maintained. Validations are done to make sure that input data conforms to the defined data types and that the input record contains all features. If you have configured an offline store, SageMaker Feature Store provides automatic replication of the ingested data into the offline store for future training and historical record access use cases.

Batch ingestion

You can perform batch ingestion to SageMaker Feature Store by integrating it with your feature generation and processing pipelines. You have the flexibility to build feature pipelines with your choice of technology. After performing any data transformations and batch aggregations, the processing pipelines can ingest feature data into the SageMaker Feature Store via batch ingestion.

You can perform batch ingestion in the following 3 modes:

  • Batch ingest into the online store – This can be done by calling the synchronous PutRecord API. SageMaker Feature Store gives you the flexibility to set up an online-only feature store for use cases that don’t require offline feature access, keeping your costs low by avoiding any unnecessary storage. If you have configured your feature group as online-only, the latest values of a record override older values.
  • Batch ingest into the offline store – You can choose to ingest data directly into your offline store. This is useful when you want to backfill historical records for training use cases. This can be done from SageMaker Data Wrangler or directly through a SageMaker Processing job Spark container. The offline store resides in your account and uses Amazon S3 to store data. This gives you the benefits of Amazon S3, including low cost of storage, durability, reliability and flexible access control. In addition, the feature group created in the offline store can be registered with appropriate metadata to provide support for search, discovery, and automatic creation of an AWS Glue Data Catalog.
  • Batch ingest into both the online and offline store – If your feature group is configured to have both online and offline stores, you can do batch ingestion by calling the PutRecord API. In this case, only the latest values are stored in the online store, and the offline store maintains both your older records and the latest record. The offline store is an append-only data store.

To see an example of how you can couple streaming and batch feature engineering for an ML use-case, see Using streaming ingestion with Amazon SageMaker Feature Store to make ML-backed decisions in near-real time.

Data Access

In this section, we discuss the details of real-time data access, access from the offline store, and advanced query patterns.

Real-time data access

SageMaker Feature Store provides a low latency GetRecord API, which is designed to serve real-time inference use cases. This is a synchronous API that provides strong read consistency and can be parallelized to support high-throughput applications. The GetRecord API lets you retrieve an entire record with all its features or a specific subset of features, which helps optimize access for shared feature groups with hundreds or thousands of features.

Data access from Offline Store

SageMaker Feature Store uses an S3 bucket in your account to store offline data. You can use query engines like Athena against the offline data store in Amazon S3 to analyze feature data or to join more than one feature group in a single query. SageMaker Feature Store automatically builds the AWS Glue Data Catalog for feature groups during feature group creation, and you can then use this catalog to access and query the data from the offline store using Athena or even open-source tools like Presto. You can set up an AWS Glue crawler to run on a schedule to make sure your catalog is always up to date. Because the offline store is in Amazon S3 in your account, you can use any of the capabilities that Amazon S3 provides, like replication.

For an example showing how you can run an Athena query on a dataset containing two feature groups using a Data Catalog that was built automatically, see Build Training Dataset. For detailed sample queries, see Athena and AWS Glue in Feature Store. These queries are also available in SageMaker Studio.

Advanced query patterns

The SageMaker Feature Store design allows you to access your data using advanced query patterns. For example, it’s easy to run a query against your offline store to see what your data looked like a month ago (time-travel). SageMaker Feature Store requires a parameter called EventTimeFeatureName in your feature group to store an event time for each record. This, combined with the append-only capability of the offline store, allows you to easily use query engines to get a snapshot of your data based on the event time feature. Other patterns include querying data after removing duplicates, reconstructing a dataset based on past events for training models, and gathering data needed for ensuring compliance with regulations.

We plan to publish a detailed post on how to use advanced query patterns (including time-travel) very soon.

Security: Encryption and access control

At AWS, we take data security very seriously, and as such, SageMaker Feature Store is architected to provide end-to-end encryption, fine-grained access control mechanisms, and the ability to set up private access via VPC.

Encryption at rest and in transit

After you ingest data, your data is always encrypted at rest and in transit. When you create a feature group for online or offline access, you can provide an AWS Key Management Service (AWS KMS) customer master key (CMK) to encrypt all your data at rest. If you don’t provide a CMK, we ensure that your data is encrypted on the server side using an AWS-managed CMK. We also support having different CMKs for online and offline stores.

Access control

SageMaker Feature Store lets you set up fine-grained access control to your data and APIs by using IAM user roles and policies to allow or deny specific actions. You can set up access control at the API or account level to enforce policies across all feature groups, or for individual feature groups. Creating, deleting, describing, and listing feature groups are all operations that can be managed by IAM policies. You can also set up private access to all operations in your app from your VPC via AWS PrivateLink.

Summary

At Amazon, customer obsession is in our DNA. We have spent countless hours listening to many customers and understanding their key pain points with managing features at an enterprise level for ML, and have used those requirements to develop SageMaker Feature Store.

SageMaker Feature Store is a purpose-built store that lets you define features one time for both large-scale offline model building and batch inference use cases, and also to get up to single-digit millisecond retrievals for real-time inference. You can easily name, organize, find, and share feature groups among teams of developers and data scientists—all from Amazon SageMaker Studio. SageMaker Feature Store offers feature consistency between training and inference by automatically replicating feature values from the online store to the historical offline store for model building. It’s tightly integrated with SageMaker Data Wrangler and SageMaker Pipelines to build repeatable feature engineering pipelines, but is also modular enough to easily integrate with your existing data processing and inferencing workflows. SageMaker Feature Store provides end-to-end encryption, secure data access, and API level controls to ensure that your data is adequately protected. For more information, see New – Store, Discover, and Share Machine Learning Features with Amazon SageMaker Feature Store.

We understand how crucial it is for you to get the right service guarantee in terms of running your mission critical applications on Amazon SageMaker Feature Store. Thus SageMaker Feature Store is backed by the same service assurances that AWS customers rely on AWS to provide.


About the Authors

Lakshmi Ramakrishnan is a Principal Engineer at Amazon SageMaker Machine Learning (ML) platform team in AWS, providing technical leadership for the product. He has worked in several engineering roles in Amazon for over 9 years. He has a Bachelor of Engineering degree in Information Technology from National Institute of Technology, Karnataka, India and a Master of Science degree in Computer Science from the University of Minnesota Twin Cities.

 

 

Mark Roy is a Principal Machine Learning Architect for AWS, helping customers design and build AI/ML solutions. Mark’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, media and entertainment, healthcare, utilities, and manufacturing. Mark holds six AWS certifications, including the ML Specialty Certification. Prior to joining AWS, Mark was an architect, developer, and technology leader for over 25 years, including 19 years in financial services.

 

Ravi KhandelwalRavi Khandelwal is a Software Dev Manager in Amazon SageMaker team leading engineering for SageMaker Feature Store. Prior to joining AWS, he has held engineering leadership roles in Amazon.com, FICO, and Thomson Reuters. He has an MBA from Carlson School of Management and an engineering degree from Indian Institute of Technology, Varanasi. He enjoys backpacking in the Pacific Northwest and is working towards a goal to hike in all US National Parks.

 

 

Romi DattaDr. Romi Datta is a Principal Product Manager in Amazon SageMaker team responsible for training and feature store. He has been in AWS for over 2 years, holding several product management leadership roles in S3 and IoT. Prior to AWS he worked in various product management, engineering and operational leadership roles at IBM, Texas Instruments and Nvidia. He has an M.S. and Ph.D. in Electrical and Computer Engineering from the University of Texas at Austin, and an MBA from the University of Chicago Booth School of Business.

Read More

Saving time with personalized videos using AWS machine learning

CLIPr aspires to help save 1 billion hours of people’s time. We organize video into a first-class, searchable data source that unlocks the content most relevant to your interests using AWS machine learning (ML) services. CLIPr simplifies the extraction of information in videos, saving you hours by eliminating the need to skim through them manually to find the most relevant information. CLIPr provides simple AI-enabled tools to find, interact, and share content across videos, uncovering your buried treasure by converting unstructured information into actionable data and insights.

How CLIPr uses AWS ML services

At CLIPr, we’re leveraging the best of what AWS and the ML stack is offering to delight our customers. At its core, CLIPr uses the latest ML, serverless, and infrastructure as code (IaC) design principles. AWS allows us to consume cloud resources just when we need them, and we can deploy a completely new customer environment in a couple of minutes with just one script. The second benefit is the scale. Processing video requires an architecture that can scale vertically and horizontally by running many jobs in parallel.

As an early-stage startup, time to market is critical. Building models from the ground up for key CLIPr features like entity extraction, topic extraction, and classification would have taken us a long time to develop and train. We quickly delivered advanced capabilities by using AWS AI services for our applications and workflows. We used Amazon Transcribe to convert audio into searchable transcripts, Amazon Comprehend for text classification and organizing by relevant topics, Amazon Comprehend Medical to extract medical ontologies for a health care customer, and Amazon Rekognition to detect people’s names, faces, and meeting types for our first MVP. We were able to iterate fairly quickly and deliver quick wins that helped us close our pre-seed round with our investors.

Since then, we have started to upgrade our workflows and data pipelines to build in-house proprietary ML models, using the data we gathered in our training process. Amazon SageMaker has become an essential part of our solution. It’s a fabric that enables us to provide ML in a serverless model with unlimited scaling. The ease of use and flexibility to use any ML and deep learning framework of choice was an influencing factor. We’re using TensorFlow, Apache MXNet, and SageMaker notebooks.

Because we used open-source frameworks, we were able to attract and onboard data scientists to our team who are familiar with these platforms and quickly scale it in a cost-effective way. In just a few months, we integrated our in-house ML algorithms and workflows with SageMaker to improve customer engagement.

The following diagram shows our architecture of AWS services.

The more complex user experience is our Trainer UI, which allows human reviews of data collected via CLIPr’s AI processing engine in a timeline view. Humans can augment the AI-generated data and also fix potential issues. Human oversight helps us ensure accuracy and continuously improve and retrain models with updated predictions. An excellent example of this is speaker identification. We construct spectrographs from samples of the meeting speakers’ voices and video frames, and can identify and correlate the names and faces (if there is a video) of meeting participants. The Trainer UI also includes the ability to inspect the process workflow, and issues are flagged to help our data scientists understand what additional training may be required. A typical example of this is the visual clues to identify when speaker names differ in various meeting platforms.

Using CLIPr to create a personalized re:Invent video

We used CLIPr to process all the AWS re:Invent 2020 keynotes and leadership sessions to create a searchable video collection so you can easily find, interact, and share the moments you care about most across hundreds of re:Invent sessions. CLIPr became generally available in December 2020, and today we launched the ability for customers to upload their own content.

The following is an example of a CLIPr processed video of Andy’s keynote. You get to apply filters to the entire video to match topics that are auto-generated by CLIPr ML algorithms.

CLIPr dynamically creates a custom video from the keynote by aggregating the topics and moments that you select. Upon choosing Watch now, you can view your video composed of the topics and moments you selected. In this way, CLIPr is a video enrichment platform.

Our commenting and reaction features provide a co-viewing experience where you can see and interact with other users’ reactions and comments, adding collaborative value to the content. Back in the early days of AWS, low-flying-hawk was a huge contributor to the AWS user forums. The AWS team often sought low-flying-hawk’s thoughts on new features, pricing, and issues we were experiencing. Low-flying-hawk was like having a customer in our meetings without actually being there. Imagine what it would be like to have customers, AWS service owners, and presenters chime in and add context to the re:Invent presentations at scale.

Our customers very much appreciate the Smart Skip feature, where CLIPr gives you the option to skip to the beginning of the next topic of interest.

We built a natural language query and search capability so our customers can find moments easily and fast. For instance, you can search “SageMaker” in CLIPr search. We do a deep search across our entire media assets, ranging from keywords, video transcripts, topics, and moments, to present instant results. In a similar search (see the following screenshot), CLIPr highlights Andy’s keynote sessions, and also includes specific moments when SageMaker is mentioned in Swami Sivasubramanian and Matt Wood’s sessions.

CLIPr also enables advanced analytics capabilities using knowledge graphs, allowing you to understand the most important moments, including correlations across your entire video assets. The following is an example of the knowledge graph correlations from all the re:Invent 2020 videos filtered by topics, speakers, or specific organizations.

We provide a content library of re:Invent sessions, with all the keynotes and leadership sessions, to save you time and make the most out of re:Invent. Try CLIPr in action with re:Invent videos, see how CLIPr uses AWS to make it all happen.

Conclusion

Create an account at www.clipr.ai and create a personalized view of re:Invent content. You can also upload your own videos, so you can spend more time building and less time watching!

About the Authors

Humphrey Chen‘s experience spans from product management at AWS and Microsoft to advisory roles with Noom, Dialpad, and GrayMeta. At AWS, he was Head of Product and then Key Initiatives for Amazon’s Computer Vision. Humphrey knows how to take an idea and make it real. His first startup was the equivalent of shazam for FM radio and launched in 20 cities with AT&T and Sprint in 1999. Humphrey holds a Bachelor of Science degree from MIT and an MBA from Harvard.

Aaron Sloman is a Microsoft alum who launched several startups before joining CLIPr, with ventures including Nimble Software Systems, Inc., CrossFit Chalk, and speakTECH. Aaron was recently the architect and CTO for OWNZONES, a media supply chain and collaboration company, using advanced cloud and AI technologies for video processing.

Read More

Deepset achieves a 3.9x speedup and 12.8x cost reduction for training NLP models by working with AWS and NVIDIA

This is a guest post from deepset (creators of the open source frameworks FARM and Haystack), and was contributed to by authors from NVIDIA and AWS. 

At deepset, we’re building the next-level search engine for business documents. Our core product, Haystack, is an open-source framework that enables developers to utilize the latest NLP models for semantic search and question answering at scale. Our software as a service (SaaS) platform, Haystack Hub, is used by developers from various industries, including finance, legal, and automotive, to find answers in all kinds of text documents. You can use these answers to improve the search experience, cover the long-tail of chat bot queries, extract structured data from documents, or automate invoicing processes.

Pretrained language models like BERT, RoBERTa, and ELECTRA form the core for this latest type of semantic search and many other NLP applications. Although plenty of English models are available, the availability for other languages and more industry-specific terms (such as finance or automotive) is usually very limited and often complicates applications in the industry. Therefore, we regularly train language models for languages not covered by existing models (such as German BERT and German ELECTRA), models for special domains (such as finance and aerospace), or even models for client-specific jargon.

Challenge

Pretraining language models from scratch typically involves two major challenges: cost and development effort.

Training a language model is an extremely compute-intensive task and requires multiple GPUs running for multiple days. To give you a rough idea, training the original RoBERTa model took about 1 day on 1024 NVIDIA V100 GPUs.

Computation costs aren’t the only thing that can stress your budget. A considerable amount of manual development is required to create the training data and vocabulary, configure hyperparameters, start and monitor training jobs, and run periodical evaluation of different model checkpoints. In our first training runs, we also found several bugs only after multiple hours of training, resulting in a slow development cycle. In summary, language model training can be a painful job for a developer and easily consumes multiple days of work.

Solution

In a collaborative effort, AWS, NVIDIA, and deepset were able to complete training 3.9 times faster while lowering cost by 12.8 times and reducing developer effort from days to hours. We optimized the GPU utilization during training via PyTorch’s DistributedDataParallel (DDP) and enabled larger batch sizes by switching to Automatic Mixed Precision (AMP). Furthermore, we introduced a StreamingDataSilo that allows us to load the training data lazily from disk and to do the preprocessing on the fly, leading to a lower memory footprint and no initial preprocessing time. Last but not least, we integrated the training with Amazon SageMaker to reduce manual development effort and benefit from around a 70% cost reduction by using Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances.  

In this post, we explore each of these technologies and their impact on improving BERT training performance. 

DistributedDataParallel

DistributedDataParallel (DDP) implements distributed data parallelism in PyTorch. This is a key module that’s essential for running training jobs at scale, on multiple machines or on multiple GPUs in a single machine. DDP parallelizes a given network module by splitting the input across specified devices (GPUs). The input is split in the batch dimension. The network module is replicated on each device, and each such replica handles a slice of the input. The gradients from each device are averaged during a backward pass. DDP is used in conjunction with the torch.distributed framework, which handles all communication and synchronization in distributed training jobs with PyTorch.

Before DDP was introduced, torch.nn.DataParallel (DP) was the standard module for doing single-machine multi-GPU training in PyTorch. The system in DP works as follows:

  1. The entire batch of input is loaded on the main thread.
  2. The batch is split and scattered across all the GPUs in the network.
  3. Each GPU runs the forward pass on its batch, split on a separate thread.
  4. The network outputs are gathered on the master GPU, the loss value is computed, and this loss value is then scattered across the other GPUs.
  5. Each GPU uses the loss value to run a backward pass and compute the gradients.
  6. The gradients are reduced on the master GPU and the model parameters on the master GPU are updated. This completes one iteration of training.
  7. To ensure all GPUs have the latest model parameters, they are broadcasted to the other GPUs at the start of the next iteration.

This design of DP has several inefficiencies:

  • Uneven GPU utilization – Only the primary GPU handles loss calculation, gradient reduction, and parameter updates, which leads to higher GPU memory consumption compared to the rest of the GPUs
  • Unnecessary broadcast at the beginning of each iteration – Because the model parameters are only updated in the master GPU, they need to be broadcast to the other GPUs before every iteration
  • Unnecessary gathering step – There is an unnecessary gathering step of model outputs on the GPU
  • Redundant data copies – Data is first copied to the master GPU, which is then split and copied over to the other GPUs
  • Multithreading overhead – Performance overhead caused by the Global Interpreter Lock (GIL) of the Python interpreter

DDP eliminates all such inefficiencies of DP. DDP uses a multiprocessing architecture, unlike the multithreaded one in DP. This means each GPU has its own dedicated process which runs independently and there is no master GPU anymore. Each process starts by loading its own split of data from the disk. Then the forward pass and loss computation is run independently on each GPU. This eliminates the need for gathering network outputs. During the backward pass, the gradients are AllReduced across the GPUs. Averaging the gradients with AllReduce ensures that the gradients in each GPU are identical. As a result, the model updates in each GPU are identical as well, which eliminates the need for parameter broadcast at the start of the next iteration.

Results with DDP

With DP, the GPU memory utilization was skewed towards the master GPU, which consumed 15 GB, whereas all other GPUs consumed 6 GB. With DDP, the memory consumption was split equally across 4 GPUs, or about 9 GB on each. This also allowed us to increase the per GPU batch size, which further contributed to the speedup in throughput. We reduced the number of gradient accumulation steps to keep the effective batch size constant, to ensure that there was no impact in convergence.

The following table shows the results on a P3.8xlarge instance with 4 NVIDIA V100 GPUs. With DDP, the training time reduced from 616 hours to 347 hours.

Run Batch Size Accumulation Steps Effective Batch Size Throughput (Effective Batches per Hour) Total Estimated Training Time (Hours)
BERT Training with DP 105 9 945 811 616
BERT Training with DDP 240 4 960 1415 347

The following screenshots show the GPU performance profiles captured by the NVIDIA Nsight Systems. The first screenshot shows the profile while running with DP. Light blue bars in the box marked as “GPU Utilization” show if GPUs are busy. Gaps between the blue areas show that the GPU utilization is zero. The red blocks in between are CPU to GPU memory copy operations. An additional blue block is between the memory copies, which is the aggregation operation that is computed on a single GPU.

The first screenshot shows the profile while running with DP.

The following screenshot shows high GPU utilization with DDP, which effectively deprecates all those inefficiencies.

The following screenshot shows high GPU utilization with DDP, which effectively deprecates all those inefficiencies.

Training with Automatic Mixed Precision

Automatic Mixed Precision (AMP) speeds up deep learning training with minimal impact to final accuracy. Traditionally, 32-bit precision floating point (FP32) variables are commonly used in deep learning training. You can improve speed of training with 16-bit precision floating point (FP16) variables because it requires lower storage and less memory bandwidth. However, training with lower precision could decrease the accuracy of the results. Mixed precision training is a balanced approach to achieve the computational speed up of lower precision training while maintaining accuracy close to FP32 precision training. Training with mixed precision provides additional speedup by using NVIDIA Tensor Cores, which are specialized hardware available on NVIDIA GPUs for accelerated computation.

To maintain accuracy, training with mixed precision involves the following steps:

  1. Port the model to use the FP16 datatype where appropriate.
  2. Handle specific functions or operations that must be done in FP32 to maintain accuracy.
  3. Add loss scaling to preserve small gradient values.

The AMP feature handles all these steps for deep learning training. As of this writing, all popular deep learning frameworks like PyTorch, TensorFlow, and Apache MXNet support AMP.

AMP in PyTorch was supported via the NVIDIA APEX library. AMP support has recently been moved to PyTorch core with the 1.6 release.

AMP in APEX library provides four levels of optimization for different application usage. Optimization levels O1 and O2 are both mixed precision modes with slight differences, where O1 is the recommended way for typical use cases and 02 is more aggressively converting most layers into FP16 mode. O0 and O4 opt levels are actually the FP32 mode and FP16 mode designed for reference only.

The following table shows the impact of applying AMP along with DDP. Without AMP, batch size up to 240 could be run on the GPU. With AMP, the larger batch size of 320 could be supported, reducing the total training time from 347 hours to 223 hours

Run Batch Size Accumulation Steps Effective Batch Size Throughput (Effective Batches per Hour) Total Estimated Training Time (Hours)
BERT Training with DDP 240 4 960 1415 347
BERT Training with DDP & AMP 01 304 3 912 2025 243
BERT Training with DDP & AMP 02 320 3 960 2210 223

As mentioned earlier, AMP O2 converts more layers into FP16 mode, so we can run DDP and AMP O2 with a larger batch size and get a better throughput compared to DDP and AMP O1. When selecting between these two opt levels, you should do a validation of the prediction results to make sure AMP O2 meets your accuracy requirements.

The following screenshot shows the GPU performance profile after applying DDP but running the deep learning training with FP32 variables. In this profile, we have added custom markers called NVTX markers, which show the time taken for each epoch, each step, and the time for the forward and backward pass.

The following screenshot shows the GPU performance profile after applying DDP but running the deep learning training with FP32 variables. 

The following screenshot shows the profile after enabling AMP with opt level O2. The time to run a forward and backward pass reduced significantly even though we increased the batch size for training when using AMP.

The following screenshot shows the profile after enabling AMP with opt level O2

Earlier, we mentioned that AMP utilizes Tensor Cores available on NVIDIA GPU hardware for significant speedup for deep learning training. GPU performance profiles show when operations are utilizing Tensor Cores. The following screenshot shows a sample GPU kernel that is run in FP32 mode. The GPU operation is marked here as the volta_sgemm kernel.

The GPU operation is marked here as the volta_sgemm kernel.

The following screenshot shows similar operations run in FP16 mode, which utilizes Tensor Cores. Kernels running with Tensor Cores are marked as volta_fp16_s884gemm.

The following screenshot shows similar operations run in FP16 mode

Data pipeline

The datasets used for training language models typically contain 10–200 GB of raw text data. Loading the whole dataset in RAM can be challenging. Furthermore, the typical pipeline of first running the preprocessing for all data and then pulling batches during training isn’t optimal because the up-front preprocessing can take multiple hours in which we don’t utilize the GPUs on the server.

Therefore, we introduced a StreamingDataSilo, which loads data lazily from disk just in time when it’s needed in the training loop. The whole preprocessing happens on the fly. Our implementation builds upon PyTorch’s IterableDataset and DistributedSampler, but requires some custom parts to ensure enough preprocessed batches are always in the queue for our trainer, so that the GPU never has to wait for the next batch to be ready. For implementation steps, see the GitHub repo. Together with an increased number of workers to fill the queue, we ended up with another 28% throughput improvement, as shown in the following table.

Run Batch Size Accumulation Steps Effective Batch Size Throughput (Effective Batches per Hour) Total Estimated Training Time (Hours)
DDP with 8 workers 320 3 960 2210 223
DDP with 16 workers 320 3 960 3077 160

A second tricky case that we had to handle was related to the unknown number of batches in our dataset and the distributed training via DDP. If the batches in your dataset can’t be evenly distributed across the number of workers, some workers don’t get any batches in the last step of the epoch while others do. This asynchronicity can crash your whole training run or result in a deadlock (for a related PyTorch issue, see [RFC] Join-based API to support uneven inputs in DDP). We handled this by adding a small synchronization step where all workers communicate if they still have data left. For implementation details, see the GitHub repo.

Spot Instances

Besides reducing the total training time, using EC2 Spot Instances is another compelling approach to reduce training costs. This is pretty straightforward to configure in SageMaker; just set the parameter EnableManagedSpotTraining to True when launching your training job. SageMaker launches your training job and saves checkpoints periodically. When your Spot Instance ends, SageMaker takes care of spinning up a new instance and loading the latest checkpoint to continue training from there.

In your code, you need to make sure to save regular checkpoints containing the states of all the objects that are relevant for your training session. This includes not only your model weights, but also the states of your optimizer, the data loader, and all random number generators to replicate the results from your continuous runs without Spot Instances. For implementation details, see the GitHub repo.

In our test runs, we achieved around 70% cost savings in comparison to regular On-Demand Instances.

Conclusion

Language models have become the backbone of modern NLP. Although using existing public models works well in many cases, many domains with special languages can benefit from training a new model from scratch. Having a fast, simple, and cheap training pipeline is essential for these big training jobs. In addition, the increased efficiency of training jobs reduces our energy usage and lowers our carbon footprint. By tackling different areas of FARM’s training pipeline, we were able to significantly optimize the resource utilization. In the end, we were able to achieve a speedup in training time of 3.9 times faster, a 12.8 times reduction in training cost, and reduced the developer effort required from days to hours.

If you’re interested in training your own BERT model, you can look at the open-source code in FARM or try our free SageMaker algorithm on the AWS Marketplace.


About the Authors

Abhinav Sharma is a Software Engineer at AWS Deep Learning. He works on bringing state-of-the-art deep learning research to customers, building products that help customers use deep learning engines. Outside of work, he enjoys playing tennis, noodling on his guitar and watching thriller movies.

Malte Pietsch is Co-Founder & CTO at deepset, where he builds the next-level enterprise search engine fueled by open source and NLP. He holds a M.Sc. with honors from TU Munich and conducted research at Carnegie Mellon University. He is an open-source lover, likes reading papers before breakfast, and is obsessed with automating the boring parts of our work.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for AWS machine learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Jiahong Liu is a Solution Architect on the NVIDIA Cloud Service Provider team, where he helps customers adopt ML and AI solutions with better utilization of NVIDIA’s GPU to solve their business challenges.

Anish Mohan is a Machine Learning Architect at NVIDIA and the technical lead for ML and DL engagements with key NVIDIA customers in the greater Seattle region. Before NVIDIA, he was at Microsoft’s AI Division, working to develop and deploy AI and ML algorithms and solutions.

Read More

How to deliver natural conversational experiences using Amazon Lex Streaming APIs

Natural conversations often include pauses and interruptions. During customer service calls, a caller may ask to pause the conversation or hold the line while they look up the necessary information before continuing to answer a question. For example, callers often need time to retrieve credit card details when making bill payments. Interruptions are also common. Callers may interrupt a human agent with an answer before the agent finishes asking the entire question (for example, “What’s the CVV code for your credit card? It is the three-digit code top right corner.…”). Just like conversing with human agents, a caller interacting with a bot may interrupt or instruct the bot to hold the line. Previously, you had to orchestrate such dialog on Amazon Lex by managing client attributes and writing code via an AWS Lambda function. Implementing a hold pattern required code to keep track of the previous intent so that the bot could continue the conversation. The orchestration of these conversations was complex to build and maintain, and impacted the time to market for conversational interfaces. Moreover, the user experience was disjointed because the properties of prompts such as ability to interrupt were defined in the session attributes on the client.

Amazon Lex’s new streaming conversation APIs allow you to deliver sophisticated natural conversations across different communication channels. You can now easily configure pauses, interruptions and dialog constructs while building a bot with the Wait and Continue and Interrupt features. This simplifies the overall design and implementation of the conversation and makes it easier to manage. By using these features, the bot builder can quickly enhance the conversational capability of virtual agents or IVR systems.

In the new Wait and Continue feature, the ability to put the conversation into a waiting state is surfaced during slot elicitation. You can configure the slot to respond with a “Wait” message such as “Sure, let me know when you’re ready” when a caller asks for more time to retrieve information. You can also configure the bot to continue the conversation with a “Continue” response based on defined cues such as “I’m ready for the policy ID. Go ahead.” Optionally, you can set a “Still waiting” prompt to play messages like “I’m still here” or “Let me know if you need more time.” You can set the frequency of these messages to play and configure a maximum wait time for user input. If the caller doesn’t provide any input within the maximum wait duration, Amazon Lex resumes the dialog by prompting for the slot. The following screenshot shows the wait and continue configuration options on the Amazon Lex console.

The following screenshot shows the wait and continue configuration options on the Amazon Lex console. 

The Interrupt feature enables callers to barge-in while a prompt is played by the bot. A caller may interrupt the bot and answer a question before the prompt is completed. This capability is surfaced at the prompt level and provided as a default setting. On the Amazon Lex console, navigate to the Advanced Settings and under Slot prompts, enable the setting to allow users to interrupt the prompt.

On the Amazon Lex console, navigate to the Advanced Settings and under Slot prompts, enable the setting to allow users to interrupt the prompt.

After configuring these features, you can initiate a streaming interaction with the Lex bot by using the StartConversation API. The streaming capability enables you to capture user input, manage state transitions, handle events, and deliver a response required as part of a conversation. The input can be one of three types: audio, text, or DTMF, whereas the response can be either audio or text. The dialog progresses by eliciting an intent, populating any slots, confirming the intent, and finally closing the intent. Streaming allows intents to be defined based on different conversation states such as: InProgress, Waiting, Confirmed, Denied, Fulfilled, ReadyForFulfillment or Failed. A detailed list of different dialog and intents states, see the documentation for Amazon Lex intents.

The StartConversation API uses HTTP/2 for bidirectional communication between the client and Amazon Lex. Clients are expected to maintain a single long-running stream throughout the conversation and exchange required information with the server (and vice versa) via different events. To learn more about streaming for Amazon Lex bots, please refer the Lex streaming documentation.

Amazon Lex Streaming APIs in action

Let’s put all of this together in a bot. I first use the Amazon Lex console to configure a bot and then use the following sample telephony application to interact with the bot. For the purposes of this post, we use a simple bot to complete a bill payment with a credit card. Because callers often need time to retrieve their credit card details, we enable the wait and continue construct when eliciting the credit card number. We also set up a service-side interrupt to manage interruption.

The following is a sample conversation with our bot:

Caller:    I’d like to make a payment on my account
Bot:        Sure. Can we start off with your account ID?
Caller:    Yes the account ID is 1234
Bot:        Great. The balance due is $100. How much would you like to pay today?
Caller:    $100
Bot:        Ok. Can you share the 16-digit number on your credit card?
Caller:    Can you wait a minute?
Bot:        Sure

—————- a few seconds later —————-

Bot:        Let me know when ready. No rush.

—————- a few seconds later —————-

Caller:    Ok I am back
Bot:        Ok please go ahead
Caller:    The credit card number is 1234 5678 1234 5678
Bot:        Got it. Let me repeat that for you 1234 5678 1234 5678. Is that correct?
Caller:    Yes
Bot:        Ok. What’s the CVV code? You can find it in the top right on the signature panel on the back of the card.
Caller:    123
Bot:        Awesome. I will go ahead and process the $100 payment on your account using the card ending in 5678.
Caller:    Ok
Bot:        The payment went through. Your confirmation code is 1234.

The first step is to build an Amazon Lex bot with intents to process payment and get balance on the account. The ProcessPayment intent elicits the information required to process the payment, such as the payment amount, credit card number, CVV code, and expiration date. The GetBalanceAmount intent provides the balance on the account. The FallbackIntent is triggered when the user input can’t be processed by either of the two configured intents.

Deploying the sample bot

To create the sample bot, complete the following steps. This creates an Amazon Lex bot called PaymentsBot.

  1. On the Amazon Lex console, choose Create Bot.
  2. In the Bot configuration section, give the bot the name PaymentsBot.
  3. Specify AWS Identity and Access Management (IAM) permissions and COPPA flag.
  4. Choose Next.
  5. Under Languages, choose English(US).
  6. Choose Done.
  7. Add the ProcessPayment and GetBalanceAmount intents to your bot.
  8. For the ProcessPayment intent, add the following slots:
    1. PaymentAmount slot using the built-in AMAZON.Number slot type
    2. CreditCardNumber slot using the built-in AMAZON.AlphaNumeric slot type
    3. CVV slot using the built-in AMAZON.Number slot type
    4. ExpirationDate using the built-in AMAZON.Date built-in slot type
  9. Configure slot elicitation prompts for each slot.
  10. Configure a closing response for the ProcessPayment intent.
  11. Similarly, add and configure slots and prompts for GetBalanceAmount intents.
  12. Choose Build to test your bot.

For more information about creating a bot, see the Lex V2 documentation.

Configuring Wait and Continue

  1. Choose the ProcessPayment intent and navigate to the CreditCardNumber slot.
  2. Choose Advanced Settings to open the slot editor.
  3. Enable Wait and Continue for the slot.
  4. Provide the Wait, Still Waiting, and Continue responses.
  5. Save the intent and choose Build.

The bot is now configured to support the Wait and Continue dialog construct. Now let’s configure the client code. You can use a telephony application to interact with your Lex bot. You can download the code for setting up a telephony IVR interface via Twilio at the GitHub project. The link contains information to set up a telephony interface as well as a client application code to communicate between the telephony interface and Amazon Lex.

Now, let us review the client-side setup to use the bot configuration that we just enabled on the Amazon Lex console. The client application uses the Java SDK to capture payment information. In the beginning, you use the ConfigurationEvent to set up the conversation parameters. Then, you start sending an input event (AudioInputEvent, TextInputEvent or DTMFInputEvent) to send user input to the bot depending on the input type. When sending audio data, you would need to send multiple AudioInputEvent events, with each event containing a slice of the data.

The service first responds with TranscriptEvent to give transcription, then sends the IntentResultEvent to surface the intent classification results. Subsequently, Amazon Lex sends a response event (TextResponseEvent or AudioResponseEvent) that contains the response to play back to caller. If the caller requests the bot to hold the line, the intent is moved to the Waiting state and Amazon Lex sends another set of TranscriptEvent, IntentResultEvent and a response event. When the caller requests to continue the conversation, the intent is set to the InProgress state and the service sends another set of TranscriptEvent, IntentResultEvent and a response event. While the dialog is in the Waiting state, Amazon Lex responds with a set of IntentResultEvent and response event for every “Still waiting” message (there is no transcript event for server-initiated responses). If the caller interrupts the bot prompt at any time, Amazon Lex returns a PlaybackInterruptionEvent.

Let’s walk through the main elements of the client code:

  1. Create the Amazon Lex client:
    AwsCredentialsProvider awsCredentialsProvider = StaticCredentialsProvider
            .create(AwsBasicCredentials.create(accessKey, secretKey));
    
    LexRuntimeV2AsyncClient lexRuntimeServiceClient = LexRuntimeV2AsyncClient.builder()
            .region(region)
            .credentialsProvider(awsCredentialsProvider)
            .build();

  2. Create a handler to publish data to server:
    EventsPublisher eventsPublisher = new EventsPublisher();
    

  1. Create a handler to process bot responses:
    public class BotResponseHandler implements StartConversationResponseHandler {
    
        private static final Logger LOG = Logger.getLogger(BotResponseHandler.class);
    
    
        @Override
        public void responseReceived(StartConversationResponse startConversationResponse) {
            LOG.info("successfully established the connection with server. request id:" + startConversationResponse.responseMetadata().requestId()); // would have 2XX, request id.
        }
    
        @Override
        public void onEventStream(SdkPublisher<StartConversationResponseEventStream> sdkPublisher) {
    
            sdkPublisher.subscribe(event -> {
                if (event instanceof PlaybackInterruptionEvent) {
                    handle((PlaybackInterruptionEvent) event);
                } else if (event instanceof TranscriptEvent) {
                    handle((TranscriptEvent) event);
                } else if (event instanceof IntentResultEvent) {
                    handle((IntentResultEvent) event);
                } else if (event instanceof TextResponseEvent) {
                    handle((TextResponseEvent) event);
                } else if (event instanceof AudioResponseEvent) {
                    handle((AudioResponseEvent) event);
                }
            });
        }
    
        @Override
        public void exceptionOccurred(Throwable throwable) {
            LOG.error(throwable);
            System.err.println("got an exception:" + throwable);
        }
    
        @Override
        public void complete() {
            LOG.info("on complete");
        }
    
        private void handle(PlaybackInterruptionEvent event) {
            LOG.info("Got a PlaybackInterruptionEvent: " + event);
    
            LOG.info("Done with a  PlaybackInterruptionEvent: " + event);
        }
    
        private void handle(TranscriptEvent event) {
            LOG.info("Got a TranscriptEvent: " + event);
        }
    
    
        private void handle(IntentResultEvent event) {
            LOG.info("Got an IntentResultEvent: " + event);
    
        }
    
        private void handle(TextResponseEvent event) {
            LOG.info("Got an TextResponseEvent: " + event);
    
        }
    
        private void handle(AudioResponseEvent event) {//synthesize speech
            LOG.info("Got a AudioResponseEvent: " + event);
        }
    
    }
    

 

  1. Initiate the connection with the bot:
    StartConversationRequest.Builder startConversationRequestBuilder = StartConversationRequest.builder()
            .botId(botId)
            .botAliasId(botAliasId)
            .localeId(localeId);
    
    // configure the conversation mode with bot (defaults to audio)
    startConversationRequestBuilder = startConversationRequestBuilder.conversationMode(ConversationMode.AUDIO);
    
    // assign a unique identifier for the conversation
    startConversationRequestBuilder = startConversationRequestBuilder.sessionId(sessionId);
    
    // build the initial request
    StartConversationRequest startConversationRequest = startConversationRequestBuilder.build();
    
    CompletableFuture<Void> conversation = lexRuntimeServiceClient.startConversation(
            startConversationRequest,
            eventsPublisher,
            botResponseHandler);

  2. Establish the configurable parameters via ConfigurationEvent:
    public void configureConversation() {
        String eventId = "ConfigurationEvent-" + eventIdGenerator.incrementAndGet();
    
        ConfigurationEvent configurationEvent = StartConversationRequestEventStream
                .configurationEventBuilder()
                .eventId(eventId)
                .clientTimestampMillis(System.currentTimeMillis())
                .responseContentType(RESPONSE_TYPE)
                .build();
    
        eventWriter.writeConfigurationEvent(configurationEvent);
        LOG.info("sending a ConfigurationEvent to server:" + configurationEvent);
    }

  3. Send audio data to server:
    public void writeAudioEvent(ByteBuffer byteBuffer) {
        String eventId = "AudioInputEvent-" + eventIdGenerator.incrementAndGet();
    
        AudioInputEvent audioInputEvent = StartConversationRequestEventStream
                .audioInputEventBuilder()
                .eventId(eventId)
                .clientTimestampMillis(System.currentTimeMillis())
                .audioChunk(SdkBytes.fromByteBuffer(byteBuffer))
                .contentType(AUDIO_CONTENT_TYPE)
                .build();
    
        eventWriter.writeAudioInputEvent(audioInputEvent);
    }

  4. Manage interruptions on the client side:
    private void handle(PlaybackInterruptionEvent event) {
        LOG.info("Got a PlaybackInterruptionEvent: " + event);
    
        callOperator.pausePlayback();
    
        LOG.info("Done with a  PlaybackInterruptionEvent: " + event);
    }

  5. Enter the code to disconnect the connection:
    public void disconnect() {
    
        String eventId = "DisconnectionEvent-" + eventIdGenerator.incrementAndGet();
    
        DisconnectionEvent disconnectionEvent = StartConversationRequestEventStream
                .disconnectionEventBuilder()
                .eventId(eventId)
                .clientTimestampMillis(System.currentTimeMillis())
                .build();
    
        eventWriter.writeDisconnectEvent(disconnectionEvent);
    
        LOG.info("sending a DisconnectionEvent to server:" + disconnectionEvent);
    }

You can now deploy the bot on your desktop to test it out.

Things to know

The following are a couple of important things to keep in mind when you’re using the Amazon Lex V2 Console and APIs:

  • Regions and languages – The Streaming APIs are available in all existing Regions and support all current languages.
  • Interoperability with Lex V1 console – Streaming APIs are only available in the Lex V2 console and APIs.
  • Integration with Amazon Connect – As of this writing, Lex V2 APIs are not supported on Amazon Connect. We plan to provide this integration as part of our near-term roadmap.
  • Pricing – Please see the details on the Lex pricing page.

Try it out

Amazon Lex Streaming API is available now and you can start using it today. Give it a try, design a bot, launch it and let us know what you think! To learn more, please see the Lex streaming API documentation.


About the Authors

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

 

 

 

Swapandeep Singh is an engineer with Amazon Lex team. He works on making interactions with bot smoother and more human-like. Outside of work, he likes to travel and learn about different cultures.

Read More

Model serving in Java with AWS Elastic Beanstalk made easy with Deep Java Library

Deploying your machine learning (ML) models to run on a REST endpoint has never been easier. Using AWS Elastic Beanstalk and Amazon Elastic Compute Cloud (Amazon EC2) to host your endpoint and Deep Java Library (DJL) to load your deep learning models for inference makes the model deployment process extremely easy to set up. Setting up a model on Elastic Beanstalk is great if you require fast response times on all your inference calls. In this post, we cover deploying a model on Elastic Beanstalk using DJL and sending an image through a post call to get inference results on what the image contains.

About DJL

DJL is a deep learning framework written in Java that supports training and inference. DJL is built on top of modern deep learning engines (such as TenserFlow, PyTorch, and MXNet). You can easily use DJL to train your model or deploy your favorite models from a variety of engines without any additional conversion. It contains a powerful model zoo design that allows you to manage trained models and load them in a single line. The built-in model zoo currently supports more than 70 pre-trained and ready-to-use models from GluonCV, HuggingFace, TorchHub, and Keras.

Benefits

The primary benefit of hosting your model using Elastic Beanstalk and DJL is that it’s very easy to set up and provides consistent sub-second responses to a post request. With DJL, you don’t need to download any other libraries or worry about importing dependencies for your chosen deep learning framework. Using Elastic Beanstalk has two advantages:

  • No cold startup – Compared to an AWS Lambda solution, the EC2 instance is running all the time, so any call to your endpoint runs instantly and there isn’t any ovdeeerhead when starting up new containers.
  • Scalable – Compared to a server-based solution, you can allow Elastic Beanstalk to scale horizontally.

Configurations

You need to have the following gradle dependencies set up to run our PyTorch model:

plugins {
    id 'org.springframework.boot' version '2.3.0.RELEASE'
    id 'io.spring.dependency-management' version '1.0.9.RELEASE'
    id 'java'
}

dependencies {
    implementation platform("ai.djl:bom:0.8.0")
    implementation "ai.djl.pytorch:pytorch-model-zoo"
    implementation "ai.djl.pytorch:pytorch-native-auto"
    
    implementation "org.springframework.boot:spring-boot-starter"
    implementation "org.springframework.boot:spring-boot-starter-web"
}

The code

We first create a RESTful endpoint using Java SpringBoot and have it accept an image request. We decode the image and turn it into an Image object to pass into our model. The model is autowired by the Spring framework by calling the model() method. For simplicity, we create the predictor object on each request, where we pass our image for inference (you can optimize this by using an object pool) . When inference is complete, we return the results to the requester. See the following code:

    @Autowired ZooModel<Image, Classifications> model;

    /**
     * This method is the REST endpoint where the user can post their images
     * to run inference against a model of their choice using DJL.
     *
     * @param input the request body containing the image
     * @return returns the top 3 probable items from the model output
     * @throws IOException if failed read HTTP request
     */
    @PostMapping(value = "/doodle")
    public String handleRequest(InputStream input) throws IOException {
        Image img = ImageFactory.getInstance().fromInputStream(input);
        try (Predictor<Image, Classifications> predictor = model.newPredictor()) {
            Classifications classifications = predictor.predict(img);
            return GSON.toJson(classifications.topK(3)) + System.lineSeparator();
        } catch (RuntimeException | TranslateException e) {
            logger.error("", e);
            Map<String, String> error = new ConcurrentHashMap<>();
            error.put("status", "Invoke failed: " + e.toString());
            return GSON.toJson(error) + System.lineSeparator();
        }
    }

    @Bean
    public ZooModel<Image, Classifications> model() throws ModelException, IOException {
        Translator<Image, Classifications> translator =
                ImageClassificationTranslator.builder()
                        .optFlag(Image.Flag.GRAYSCALE)
                        .setPipeline(new Pipeline(new ToTensor()))
                        .optApplySoftmax(true)
                        .build();
        Criteria<Image, Classifications> criteria = Criteria.builder()
                .setTypes(Image.class, Classifications.class)
                .optModelUrls(MODEL_URL)
                .optTranslator(translator)
                .build();
        return ModelZoo.loadModel(criteria);
    }
    

A full copy of the code is available on the GitHub repo.

Building your JAR file

Go into the beanstalk-model-serving directory and enter the following code:

cd beanstalk-model-serving
./gradlew build

This creates a JAR file found in build/libs/beanstalk-model-serving-0.0.1-SNAPSHOT.jar

Deploying to Elastic Beanstalk

To deploy this model, complete the following steps:

  1. On the Elastic Beanstalk console, create a new environment.
  2. For our use case, we name the environment DJL-Demo.
  3. For Platform, select Managed platform.
  4. For Platform settings, choose Java 8 and the appropriate branch and version.

  1. When selecting your application code, choose Choose file and upload the beanstalk-model-serving-0.0.1-SNAPSHOT.jar that was created in your build.
  2. Choose Create environment.

After Elastic Beanstalk creates the environment, we need to update the Software and Capacity boxes in our configuration, located on the Configuration overview page.

  1. For the Software configuration, we add an additional setting in the Environment Properties section with the name SERVER_PORT and value 5000.
  2. For the Capacity configuration, we change the instance type to t2.small to give our endpoint a little more compute and memory.
  3. Choose Apply configuration and wait for your endpoint to update.

 

Calling your endpoint

Now we can call our Elastic Beanstalk endpoint with our image of a smiley face.

See the following code:

curl -X POST -T smiley.png <endpoint>.elasticbeanstalk.com/inference

We get the following response:

[
  {
    "className": "smiley_face",
    "probability": 0.9874626994132996
  },
  {
    "className": "face",
    "probability": 0.004804758355021477
  },
  {
    "className": "mouth",
    "probability": 0.0015588520327582955
  }
]

The output predicts that a smiley face is the most probable item in our image. Success!

Limitations

If your model isn’t called often and there isn’t a requirement for fast inference, we recommend deploying your models on a serverless service such as Lambda. However, this adds overhead due to the cold startup nature of the service. Hosting your models through Elastic Beanstalk may be slightly more expensive because the EC2 instance runs 24 hours a day, so you pay for the service even when you’re not using it. However, if you expect a lot of inference requests a month, we have found the cost of model serving on Lambda is equal to the cost of Elastic Beanstalk using a t3.small when there are about 2.57 million inference requests to the endpoint.

Conclusion

In this post, we demonstrated how to start deploying and serving your deep learning models using Elastic Beanstalk and DJL. You just need to set up your endpoint with Java Spring, build your JAR file, upload that file to Elastic Beanstalk, update some configurations, and it’s deployed!

We also discussed some of the pros and cons of this deployment process, namely that it’s ideal if you need fast inference calls, but the cost is higher when compared to hosting it on a serverless endpoint with lower utilization.

This demo is available in full in the DJL demo GitHub repo. You can also find other examples of serving models with DJL across different JVM tools like Spark and AWS products like Lambda. Whatever your requirements, there is an option for you.

Follow our GitHub, demo repository, Slack channel, and Twitter for more documentation and examples of DJL!

 


About the Author

Frank Liu is a Software Engineer for AWS Deep Learning. He focuses on building innovative deep learning tools for software engineers and scientists. In his spare time, he enjoys hiking with friends and family.

Read More