Train a time series forecasting model faster with Amazon SageMaker Canvas Quick build

Today, Amazon SageMaker Canvas introduces the ability to use the Quick build feature with time series forecasting use cases. This allows you to train models and generate the associated explainability scores in under 20 minutes, at which point you can generate predictions on new, unseen data. Quick build training enables faster experimentation to understand how well the model fits to the data and what columns are driving the prediction, and allows business analysts to run experiments with varied datasets so they can select the best-performing model.

Canvas expands access to machine learning (ML) by providing business analysts with a visual point-and-click interface that allows you to generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.

In this post, we showcase how to to train a time series forecasting model faster with quick build training in Canvas.

Solution overview

Until today, training a time series forecasting model took up to 4 hours via the standard build method. Although that approach has the benefit of prioritizing accuracy over training time, this was leading frequently to long training times, which in turn wasn’t allowing for fast experimentation that business analysts across all sorts of organizations usually seek. Starting today, Canvas allows you to employ the Quick build feature for training a time series forecasting model, adding to the use cases for which it was already available (binary and multi-class classification and numerical regression). Now you can train a model and get explainability information in under 20 minutes, with everything in place to start generating inference.

To use the Quick build feature for time series forecasting ML use cases, all you need to do is upload your dataset to Canvas, configure the training parameters (such as target column), and then choose Quick build instead of Standard build (which was the only available option for this type of ML use case before today). Note that quick build is only available for datasets with fewer than 50,000 rows.

Let’s walk through a scenario of applying the Quick build feature to a real-world ML use case involving time series data and getting actionable results.

Create a Quick build in Canvas

Anyone who has worked with ML, even if they possess no relevant experience or expertise, knows that the end result is only as good as the training dataset. No matter how much of a good fit the algorithm is that you used to train the model, the end result will reflect the quality of the inferencing on unseen data, and won’t be satisfactory if the training data isn’t indicative of the given use case, is biased, or has frequent missing values.

For the purposes of this post , we use a sample synthetic dataset that contains demand and pricing information for various items at a given time period, specified with a timestamp (a date field in the CSV file). The dataset is available on GitHub. The following screenshot shows the first ten rows.

Solving a business problem using no-code ML with Canvas is a four-step process: import the dataset, build the ML model, check its performance, and then use the model to generate predictions (also known as inference in ML terminology). If you’re new to Canvas, a prompt walking you through the process appears. Feel free to spend a couple of minutes with the in-app tutorial if you want, otherwise you can choose Skip for now. There’s also a dedicated Getting Started guide you can follow to immerse yourself fully in the service if you want a more detailed introduction.

We start by uploading the dataset. Complete the following steps:

  1. On the Datasets page, choose Import Data.
  2. Upload data from local disk or other sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Snowflake, to load the sample dataset.The product_demand.csv now shows in the list of datasets.
  3. Open product_demand.csv and choose Create a model to start the model creation process.
    You’re redirected to the Build tab of the Canvas app to start the next step of the Canvas workflow.
  4. First, we select the target variable, the value that we’re trying to predict as a function of the other variables available in the dataset. In our case, that’s the demand variable.
    Canvas automatically infers that this is a time series forecasting problem.
    For Canvas to solve the time series forecasting use case, we need to set up a couple of configuration options.
  5. Specify which column uniquely identifies the items in the dataset, where the timestamps are stored, and the horizon of predictions (how many months into the future we want to look at).
  6. Additionally, we can provide a holiday schedule, which can be helpful in some use cases that benefit from having this information, such as retail or supply chain use cases.
  7. Choose Save.

    Choosing the right prediction horizon is of paramount importance for a good time series forecasting use case. The greater the value, the more into the future we will generate the prediction—however, it’s less likely to be accurate due to the probabilistic nature of the forecast generated. A higher value also means a longer time to train, as well as more resources needed for both training and inference. Finally, it’s best practice to have data points from the past at least 3–5 times the forecast horizon. If you want to predict 6 months into the future (like in our example), you should have at least 18 months’ worth of historical data, up to 30 months.
  8. After you safe these configurations, choose Quick Build.

Canvas launches an in-memory AutoML process that trains multiple time series forecasting models with different hyperparameters. In less than 20 minutes (depending on the dataset), Canvas will output the best model performance in the form of five metrics.

Let’s dive deep into the advanced metrics for time series forecasts in Canvas, and how we can make sense of them:

  • Average weighted quantile loss (wQL) – Evaluates the forecast by averaging the accuracy at the P10, P50, and P90 quantiles. A lower value indicates a more accurate model.
  • Weighted absolute percent error (WAPE) – The sum of the absolute error normalized by the sum of the absolute target, which measures the overall deviation of forecasted values from observed values. A lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
  • Root mean square error (RMSE) – The square root of the average squared errors. A lower RMSE indicates a more accurate model, where RMSE = 0 is a model with no errors.
  • Mean absolute percent error (MAPE) – The percentage error (percent difference of the mean forecasted value versus the actual value) averaged over all time points. A lower value indicates a more accurate model, where MAPE = 0 is a model with no errors.
  • Mean absolute scaled error (MASE) – The mean absolute error of the forecast normalized by the mean absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model, where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse than the baseline.

For more information about advanced metrics, refer to Use advanced metrics in your analyses.

Built-in explainability is part of the value proposition of Canvas, because it provides information about column impact on the Analyze tab. In this use case, we can see that price has a great impact on the value of demand. This makes sense because a very low price would increase demand by a large margin.

Predictions and what-if scenarios

After we’ve analyzed the performance of our model, we can use it to generate predictions and test what-if scenarios.

  1. On the Predict tab, choose Single item.
  2. Choose an item (for this example, item_002).

The following screenshot shows the forecast for item_002.

We can expect an increase in demand in the coming months. Canvas also provides a probabilistic threshold around the expected forecast, so we can decide whether to take the upper bound of the prediction (with the risk of over-allocation) or the lower bound (risking under-allocation). Use these values with caution, and apply your domain knowledge to determine the best prediction for your business.

Canvas also support what-if scenarios, which makes it possible to see how changing values in the dataset can affect the overall forecast for a single item, directly on the forecast plot. For the purposes of this post, we simulate a 2-month campaign where we introduce a 50% discount, cutting the price from $120 to $60.

  1. Choose What if scenario.
  2. Choose the values you want to change (for this example, November and December).
  3. Choose Generate prediction.

    We can see that the changed price introduces a spike with the demand of the product for the months impacted by the discount campaign, and then slowly returns to the expected values from the previous forecast.
    As a final test, we can determine the impact of definitively changing the price of a product.
  4. Choose Try new what-if scenario.
  5. Select Bulk edit all values.
  6. For New Value, enter 70.
  7. Choose Generate prediction.

This is a lower price than the initial $100–120, therefore we expect a sharp increase in product demand. This is confirmed by the forecast, as shown in the following screenshot.

Clean up

To avoid incurring future session charges, log out of SageMaker Canvas.

Conclusion

In this post, we walked you through the Quick build feature for time series forecasting models and the updated metrics analysis view. Both are available as of today in all Regions where Canvas is available. For more information, refer to Build a model and Use advanced metrics in your analyses.

To learn more about Canvas, refer to these links:

To learn more about other use cases that you can solve with Canvas, check out the following posts:

Start experimenting with Canvas today, and build your time series forecasting models in under 20 minutes, using the 2-month Free Tier that Canvas offers.


About the Authors

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Nikiforos Botis is a Solutions Architect at AWS, looking after the public sector of Greece and Cyprus, and is a member of the AWS AI/ML technical community. He enjoys working with customers on architecting their applications in a resilient, scalable, secure, and cost-optimized way.

Read More

Use Amazon SageMaker Canvas for exploratory data analysis

Exploratory data analysis (EDA) is a common task performed by business analysts to discover patterns, understand relationships, validate assumptions, and identify anomalies in their data. In machine learning (ML), it’s important to first understand the data and its relationships before getting into model building. Traditional ML development cycles can sometimes take months and require advanced data science and ML engineering skills, whereas no-code ML solutions can help companies accelerate the delivery of ML solutions to days or even hours.

Amazon SageMaker Canvas is a no-code ML tool that helps business analysts generate accurate ML predictions without having to write code or without requiring any ML experience. Canvas provides an easy-to-use visual interface to load, cleanse, and transform the datasets, followed by building ML models and generating accurate predictions.

In this post, we walk through how to perform EDA to gain a better understanding of your data before building your ML model, thanks to Canvas’ built-in advanced visualizations. These visualizations help you analyze the relationships between features in your datasets and comprehend your data better. This is done intuitively, with the ability to interact with the data and discover insights that may go unnoticed with ad hoc querying. They can be created quickly through the ‘Data visualizer’ within Canvas prior to building and training ML models.

Solution overview

These visualizations add to the range of capabilities for data preparation and exploration already offered by Canvas, including the ability to correct missing values and replace outliers; filter, join, and modify datasets; and extract specific time values from timestamps. To learn more about how Canvas can help you cleanse, transform, and prepare your dataset, check out Prepare data with advanced transformations.

For our use case, we look at why customers churn in any business and illustrate how EDA can help from a viewpoint of an analyst. The dataset we use in this post is a synthetic dataset from a telecommunications mobile phone carrier for customer churn prediction that you can download (churn.csv), or you bring your own dataset to experiment with. For instructions on importing your own dataset, refer to Importing data in Amazon SageMaker Canvas.

Prerequisites

Follow the instructions in Prerequisites for setting up Amazon SageMaker Canvas before you proceed further.

Import your dataset to Canvas

To import the sample dataset to Canvas, complete the following steps:

  1. Log in to Canvas as a business user.First, we upload the dataset mentioned previously from our local computer to Canvas. If you want to use other sources, such as Amazon Redshift, refer to Connect to an external data source.
  2. Choose Import.
  3. Choose Upload, then choose Select files from your computer.
  4. Select your dataset (churn.csv) and choose Import data.
  5. Select the dataset and choose Create model.
  6. For Model name, enter a name (for this post, we have given the name Churn prediction).
  7. Choose Create.

    As soon as you select your dataset, you’re presented with an overview that outlines the data types, missing values, mismatched values, unique values, and the mean or mode values of the respective columns.
    From an EDA perspective, you can observe there are no missing or mismatched values in the dataset. As a business analyst, you may want to get an initial insight into the model build even before starting the data exploration to identify how the model will perform and what factors are contributing to the model’s performance. Canvas gives you the ability to get insights from your data before you build a model by first previewing the model.
  8. Before you do any data exploration, choose Preview model.
  9. Select the column to predict (churn).Canvas automatically detects this is two-category prediction.
  10. Choose Preview model. SageMaker Canvas uses a subset of your data to build a model quickly to check if your data is ready to generate an accurate prediction. Using this sample model, you can understand the current model accuracy and the relative impact of each column on predictions.

The following screenshot shows our preview.

The model preview indicates that the model predicts the correct target (churn?) 95.6% of the time. You can also see the initial column impact (influence each column has on the target column). Let’s do some data exploration, visualization, and transformation, and then proceed to build a model.

Data exploration

Canvas already provides some common basic visualizations, such as data distribution in a grid view on the Build tab. These are great for getting a high-level overview of the data, understanding how the data is distributed, and getting a summary overview of the dataset.

As a business analyst, you may need to get high-level insights on how the data is distributed as well as how the distribution reflects against the target column (churn) to easily understand the data relationship before building the model. You can now choose Grid view to get an overview of the data distribution.

The following screenshot shows the overview of the distribution of the dataset.

We can make the following observations:

  • Phone takes on too many unique values to be of any practical use. We know phone is a customer ID and don’t want to build a model that might consider specific customers, but rather learn in a more general sense what could lead to churn. You can remove this variable.
  • Most of the numeric features are nicely distributed, following a Gaussian bell curve. In ML, you want the data to be distributed normally because any variable that exhibits normal distribution is able to be forecasted with higher accuracy.

Let’s go deeper and check out the advanced visualizations available in Canvas.

Data visualization

As business analysts, you want to see if there are relationships between data elements, and how they’re related to churn. With Canvas, you can explore and visualize your data, which helps you gain advanced insights into your data before building your ML models. You can visualize using scatter plots, bar charts, and box plots, which can help you understand your data and discover the relationships between features that could affect the model accuracy.

To start creating your visualizations, complete the following steps:

  • On the Build tab of the Canvas app, choose Data visualizer.

A key accelerator of visualization in Canvas is the Data visualizer. Let’s change the sample size to get a better perspective.

  • Choose number of rows next to Visualization sample.
  • Use the slider to select your desired sample size.

  • Choose Update to confirm the change to your sample size.

You may want to change the sample size based on your dataset. In some cases, you may have a few hundred to a few thousand rows where you can select the entire dataset. In some cases, you may have several thousand rows, in which case you may select a few hundred or a few thousand rows based on your use case.

A scatter plot shows the relationship between two quantitative variables measured for the same individuals. In our case, it’s important to understand the relationship between values to check for correlation.

Because we have Calls, Mins, and Charge, we will plot the correlation between them for Day, Evening, and Night.

First, let’s create a scatter plot between Day Charge vs. Day Mins.

We can observe that as Day Mins increases, Day Charge also increases.

The same applies for evening calls.

Night calls also have the same pattern.

Because mins and charge seem to increase linearly, you can observe that they have a high correlation with one another. Including these feature pairs in some ML algorithms can take additional storage and reduce the speed of training, and having similar information in more than one column might lead to the model overemphasizing the impacts and lead to undesired bias in the model. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, and Intl Charge from the pair with Intl Mins.

Data balance and variation

A bar chart is a plot between a categorical variable on the x-axis and numerical variable on y-axis to explore the relationship between both variables. Let’s create a bar chart to see the how the calls are distributed across our target column Churn for True and False. Choose Bar chart and drag and drop day calls and churn to the y-axis and x-axis, respectively.

Now, let’s create same bar chart for evening calls vs churn.

Next, let’s create a bar chart for night calls vs. churn.

It looks like there is a difference in behavior between customers who have churned and those that didn’t.

Box plots are useful because they show differences in behavior of data by class (churn or not). Because we’re going to predict churn (target column), let’s create a box plot of some features against our target column to infer descriptive statistics on the dataset such as mean, max, min, median, and outliers.

Choose Box plot and drag and drop Day mins and Churn to the y-axis and x-axis, respectively.

You can also try the same approach to other columns against our target column (churn).

Let’s now create a box plot of day mins against customer service calls to understand how the customer service calls spans across day mins value. You can see that customer service calls don’t have a dependency or correlation on the day mins value.

From our observations, we can determine that the dataset is fairly balanced. We want the data to be evenly distributed across true and false values so that the model isn’t biased towards one value.

Transformations

Based on our observations, we drop Phone column because it is just an account number and Day Charge, Eve Charge, Night Charge columns because they contain overlapping information such as the mins columns, but we can run a preview again to confirm.

After the data analysis and transformation, let’s preview the model again.

You can observe that the model estimated accuracy changed from 95.6% to 93.6% (this could vary), however the column impact (feature importance) for specific columns has changed considerably, which improves the speed of training as well as the columns’ influence on the prediction as we move to next steps of model building. Our dataset doesn’t require additional transformation, but if you needed to you could take advantage of ML data transforms to clean, transform, and prepare your data for model building.

Build the model

You can now proceed to build a model and analyze results. For more information, refer to Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.

Clean up

To avoid incurring future session charges, log out of Canvas.

Conclusion

In this post, we showed how you can use Canvas visualization capabilities for EDA to better understand your data before model building, create accurate ML models, and generate predictions using a no-code, visual, point-and-click interface.


About the Authors

Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.

Rahul Nabera is a Data Analytics Consultant in AWS Professional Services. His current work focuses on enabling customers build their data and machine learning workloads on AWS. In his spare time, he enjoys playing cricket and volleyball.

Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11+ years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.

Read More

Run ensemble ML models on Amazon SageMaker

Model deployment in machine learning (ML) is becoming increasingly complex. You want to deploy not just one ML model but large groups of ML models represented as ensemble workflows. These workflows are comprised of multiple ML models. Productionizing these ML models is challenging because you need to adhere to various performance and latency requirements.

Amazon SageMaker supports single-instance ensembles with Triton Inference Server. This capability allows you to run model ensembles that fit on a single instance. Behind the scenes, SageMaker leverage Triton Inference Server to manage the ensemble on every instance behind the endpoint to maximize throughput and hardware utilization with ultra-low (single-digit milliseconds) inference latency. With Triton, you can also choose from a wide range of supported ML frameworks (including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT) and infrastructure backends, including GPUs, CPUs, and AWS Inferentia.

With this capability on SageMaker, you can optimize your workloads by avoiding costly network latency and reaping the benefits of compute and data locality for ensemble inference pipelines. In this post, we discuss the benefits of using Triton Inference Server along with considerations on if this is the right option for your workload.

Solution overview

Triton Inference Server is designed to enable teams to deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure. In addition, it has been optimized to offer high-performance inference at scale with features like dynamic batching, concurrent runs, optimal model configuration, model ensemble capabilities, and support for streaming inputs.

Workloads should take into account the capabilities that Triton provides to ensure their models can be served. Triton supports a number of popular frameworks out of the box, including TensorFlow, PyTorch, ONNX, XGBoost, and NVIDIA TensorRT. Triton also supports various backends that are required for algorithms to run properly. You should ensure that your models are supported by these backends and in the event that a backend does not, Triton allows you to implement your own and integrate it. You should also verify that your algorithm version is supported as well as ensure that the model artifacts are acceptable by the corresponding backend. To check if your particular algorithm is supported, refer to Triton Inference Server Backend for a list of natively supported backends maintained by NVIDIA.

There may be some scenarios where your models or model ensembles won’t work on Triton without requiring more effort, such as if a natively supported backend doesn’t exist for your algorithm. There are some other considerations to take into account, such as the payload format may not be ideal, especially when your payload size may be large for your request. As always, you should validate your performance after deploying these workloads to ensure that your expectations are met.

Let’s take an image classification neural network model and see how we can accelerate our workloads. In this example, we use the NVIDIA DALI backend to accelerate our preprocessing in the context of our ensemble.

Create Triton model ensembles

Triton Inference Server simplifies the deployment of AI models at scale. Triton Inference Server comes with a convenient solution that simplifies building preprocessing and postprocessing pipelines. The Triton Inference Server platform provides the ensemble scheduler, which you can use to build pipelining ensemble models participating in the inference process while ensuring efficiency and optimizing throughput.

NVIDIA Triton Ensemble

Triton Inference Server serves models from model repositories. Let’s look at the model repository layout for ensemble model containing the DALI preprocessing model, the TensorFlow inception V3 model, and the model ensemble configuration. Each subdirectory contains the repository information for the corresponding models. The config.pbtxt file describes the model configuration for the models. Each directory must have one numeric sub-folder for each version of the model and it’s run by a specific backend that Triton supports.

NVIDIA Triton Model Repository

NVIDIA DALI

For this post, we use the NVIDIA Data Loading Library (DALI) as the preprocessing model in our model ensemble. NVIDIA DALI is a library for data loading and preprocessing to accelerate deep learning applications. It provides a collection of optimized building blocks for loading and processing image, video, and audio data. You can use it as a portable drop-in replacement for built-in data loaders and data iterators in popular deep learning frameworks.

NVIDIA Dali

The following code shows the model configuration for a DALI backend:

name: "dali"
backend: "dali"
max_batch_size: 256
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "DALI_OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 299, 299, 3 ]
  }
]
parameters: [
  {
    key: "num_threads"
    value: { string_value: "12" }
  }
]

Inception V3 model

For this post, we show how DALI is used in a model ensemble with Inception V3. The Inception V3 TensorFlow pre-trained model is saved in GraphDef format as a single file named model.graphdef. The config.pbtxt file has information about the model name, platform, max_batch_size, and input and output contracts. We recommend setting the max_batch_size configuration to less than the inception V3 model batch size. The label file has class labels for 1,000 different classes. We copy the inception classification model labels to the inception_graphdef directory in the model repository. The labels file contains 1,000 class labels of the ImageNet classification dataset.

name: "inception_graphdef"
platform: "tensorflow_graphdef"
max_batch_size: 256
input [
  {
    name: "input"
    data_type: TYPE_FP32
    format: FORMAT_NHWC
    dims: [ 299, 299, 3 ]
  }
]
output [
  {
    name: "InceptionV3/Predictions/Softmax"
    data_type: TYPE_FP32
    dims: [ 1001 ]
    label_filename: "inception_labels.txt"
  }
]

Triton ensemble

The following code shows a model configuration of an ensemble model for DALI preprocessing and image classification:

name: "ensemble_dali_inception"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "INPUT"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_FP32
    dims: [ 1001 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "INPUT"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "inception_graphdef"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "InceptionV3/Predictions/Softmax"
        value: "OUTPUT"
      }
    }
  ]
}

Create a SageMaker endpoint

SageMaker endpoints allow for real-time hosting where millisecond response time is required. SageMaker takes on the undifferentiated heavy lifting of model hosting management and has the ability to auto scale. In addition, a number of capabilities are also provided, including hosting multiple variants of your model, A/B testing of your models, integration with Amazon CloudWatch to gain observability of model performance, and monitoring for model drift.

Let’s create a SageMaker model from the model artifacts we uploaded to Amazon Simple Storage Service (Amazon S3).

Next, we also provide an additional environment variable: SAGEMAKER_TRITON_DEFAULT_MODEL_NAME, which specifies the name of the model to be loaded by Triton. The value of this key should match the folder name in the model package uploaded to Amazon S3. This variable is optional in cases where you’re using a single model. In the case of ensemble models, this key must be specified for Triton to start up in SageMaker.

Additionally, you can set SAGEMAKER_TRITON_BUFFER_MANAGER_THREAD_COUNT and SAGEMAKER_TRITON_THREAD_COUNT for optimizing the thread counts.

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "ensemble_dali_inception"},
}
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

With the preceding model, we create an endpoint configuration where we can specify the type and number of instances we want in the endpoint:

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)
endpoint_config_arn = create_endpoint_config_response["EndpointConfigArn"]

We use this endpoint configuration to create a new SageMaker endpoint and wait for the deployment to finish. The status changes to InService when the deployment is successful.

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)
endpoint_arn = create_endpoint_response["EndpointArn"]

Inference payload

The input payload image goes through the preprocessing DALI pipeline and is used in the ensemble scheduler provided by Triton Inference Server. We construct the payload to be passed to the inference endpoint:

payload = {
    "inputs": [
        {
            "name": "INPUT",
            "shape": rv2.shape,
            "datatype": "UINT8",
            "data": rv2.tolist(),
        }
    ]
}

Ensemble inference

When we have the endpoint running, we can use the sample image to perform an inference request using JSON as the payload format. For the inference request format, Triton uses the KFServing community standard inference protocols.

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/octet-stream", Body=json.dumps(payload)
)
print(json.loads(response["Body"].read().decode("utf8")))

With the binary+json format, we have to specify the length of the request metadata in the header to allow Triton to correctly parse the binary payload. This is done using a custom Content-Type header application/vnd.sagemaker-triton.binary+json;json-header-size={}.

This is different from using an Inference-Header-Content-Length header on a standalone Triton server because custom headers aren’t allowed in SageMaker.

The tritonclient package provides utility methods to generate the payload without having to know the details of the specification. We use the following methods to convert our inference request into a binary format, which provides lower latencies for inference. Refer to the GitHub notebook for implementation details.

response = runtime_sm_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        header_length
    ),
    Body=request_body,
)

Conclusion

In this post, we showcased how you can productionize model ensembles that run on a single instance on SageMaker. This design pattern can be useful for combining any preprocessing and postprocessing logic along with inference predictions. SageMaker uses Triton to run the ensemble inference on a single container on an instance that supports all major frameworks.

For more samples on Triton ensembles on SageMaker, refer the GitHub repo. Try it out!


About the Authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In his spare time, he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.

Vikram Elango is a Senior AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia, US. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization, and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

Host code-server on Amazon SageMaker

Machine learning (ML) teams need the flexibility to choose their integrated development environment (IDE) when working on a project. It allows you to have a productive developer experience and innovate at speed. You may even use multiple IDEs within a project. Amazon SageMaker lets ML teams choose to work from fully managed, cloud-based environments within Amazon SageMaker Studio, SageMaker Notebook Instances, or from your local machine using local mode.

SageMaker provides a one-click experience to Jupyter and RStudio to build, train, debug, deploy, and monitor ML models. In this post, we will also share a solution for hosting code-server on SageMaker.

With code-server, users can run VS Code on remote machines and access it in a web browser. For ML teams, hosting code-server on SageMaker provides minimal changes to a local development experience, and allows you to code from anywhere, on scalable cloud compute. With VS Code, you can also use built-in Conda environments with AWS-optimized TensorFlow and PyTorch, managed Git repositories, local mode, and other features provided by SageMaker to speed up your delivery. For IT admins, it allows you to standardize and expedite the provisioning of managed, secure IDEs in the cloud, to quickly onboard and enable ML teams in their projects.

Solution overview

In this post, we cover installation for both Studio environments (Option A), and notebook instances (Option B). For each option, we walk through a manual installation process that ML teams can run in their environment, and an automated installation that IT admins can set up for them via the AWS Command Line Interface (AWS CLI).

The following diagram illustrates the architecture overview for hosting code-server on SageMaker.

ml-10244-architecture-overview

Our solution speeds up the install and setup of code-server in your environment. It works for both JupyterLab 3 (recommended) and JupyterLab 1 that run within Studio and SageMaker notebook instances. It is made of shell scripts that do the following based on the option.

For Studio (Option A), the shell script does the following:

For SageMaker notebook instances (Option B), the shell script does the following:

  • Installs code-server.
  • Adds a code-server shortcut on the Jupyter notebook file menu and JupyterLab launcher for fast access to the IDE.
  • Creates a dedicated Conda environment for managing dependencies.
  • Installs the Python and Docker extensions on the IDE.

In the following sections, we walk through the solution install process for Option A and Option B. Make sure you have access to Studio or a notebook instance.

Option A: Host code-server on Studio

To host code-server on Studio, complete the following steps:

  1. Choose System terminal in your Studio launcher.
    ml-10244-studio-terminal-click
  2. To install the code-server solution, run the following commands in your system terminal:
    curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
    tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz
    
    cd amazon-sagemaker-codeserver/install-scripts/studio
     
    chmod +x install-codeserver.sh
    ./install-codeserver.sh
    
    # Note: when installing on JL1, please prepend the nohup command to the install command above and run as follows: 
    # nohup ./install-codeserver.sh

    The commands should take a few seconds to complete.

  3. Reload the browser page, where you can see a Code Server button in your Studio launcher.
    ml-10244-code-server-button
  4. Choose Code Server to open a new browser tab, allowing you to access code-server from your browser.
    The Python extension is already installed, and you can get to work in your ML project.ml-10244-vscode

You can open your project folder in VS Code and select the pre-built Conda environment to run your Python scripts.

ml-10244-vscode-conda

Automate the code-server install for users in a Studio domain

As an IT admin, you can automate the installation for Studio users by using a lifecycle configuration. It can be done for all users’ profiles under a Studio domain or for specific ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for more details.

For this post, we create a lifecycle configuration from the install-codeserver script, and attach it to an existing Studio domain. The install is done for all the user profiles in the domain.

From a terminal configured with the AWS CLI and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz

cd amazon-sagemaker-codeserver/install-scripts/studio

LCC_CONTENT=`openssl base64 -A -in install-codeserver.sh`

aws sagemaker create-studio-lifecycle-config 
    --studio-lifecycle-config-name install-codeserver-on-jupyterserver 
    --studio-lifecycle-config-content $LCC_CONTENT 
    --studio-lifecycle-config-app-type JupyterServer 
    --query 'StudioLifecycleConfigArn'

aws sagemaker update-domain 
    --region <your_region> 
    --domain-id <your_domain_id> 
    --default-user-settings 
    '{
    "JupyterServerAppSettings": {
    "DefaultResourceSpec": {
    "LifecycleConfigArn": "arn:aws:sagemaker:<your_region>:<your_account_id>:studio-lifecycle-config/install-codeserver-on-jupyterserver",
    "InstanceType": "system"
    },
    "LifecycleConfigArns": [
    "arn:aws:sagemaker:<your_region>:<your_account_id>:studio-lifecycle-config/install-codeserver-on-jupyterserver"
    ]
    }}'

# Make sure to replace <your_domain_id>, <your_region> and <your_account_id> in the previous commands with
# the Studio domain ID, the AWS region and AWS Account ID you are using respectively.

After Jupyter Server restarts, the Code Server button appears in your Studio launcher.

Option B: Host code-server on a SageMaker notebook instance

To host code-server on a SageMaker notebook instance, complete the following steps:

  1. Launch a terminal via Jupyter or JupyterLab for your notebook instance.
    If you use Jupyter, choose Terminal on the New menu.
  2.  To install the code-server solution, run the following commands in your terminal:
    curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
    tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz
    
    cd amazon-sagemaker-codeserver/install-scripts/notebook-instances
     
    chmod +x install-codeserver.sh
    chmod +x setup-codeserver.sh
    sudo ./install-codeserver.sh
    sudo ./setup-codeserver.sh

    The code-server and extensions installations are persistent on the notebook instance. However, if you stop or restart the instance, you need to run the following command to reconfigure code-server:

    sudo ./setup-codeserver.sh

    The commands should take a few seconds to run. You can close the terminal tab when you see the following.

    ml-10244-terminal-output

  3. Now reload the Jupyter page and check the New menu again.
    The Code Server option should now be available.

You can also launch code-server from JupyterLab using a dedicated launcher button, as shown in the following screenshot.

ml-10244-jupyterlab-code-server-button

Choosing Code Server will open a new browser tab, allowing you to access code-server from your browser. The Python and Docker extensions are already installed, and you can get to work in your ML project.

ml-10244-notebook-vscode

Automate the code-server install on a notebook instance

As an IT admin, you can automate the code-server install with a lifecycle configuration running on instance creation, and automate the setup with one running on instance start.

Here, we create an example notebook instance and lifecycle configuration using the AWS CLI. The on-create config runs install-codeserver, and on-start runs setup-codeserver.

From a terminal configured with the AWS CLI and appropriate permissions, run the following commands:

curl -LO https://github.com/aws-samples/amazon-sagemaker-codeserver/releases/download/v0.1.5/amazon-sagemaker-codeserver-0.1.5.tar.gz
tar -xvzf amazon-sagemaker-codeserver-0.1.5.tar.gz

cd amazon-sagemaker-codeserver/install-scripts/notebook-instances

aws sagemaker create-notebook-instance-lifecycle-config 
    --notebook-instance-lifecycle-config-name install-codeserver 
    --on-start Content=$((cat setup-codeserver.sh || echo "")| base64) 
    --on-create Content=$((cat install-codeserver.sh || echo "")| base64)

aws sagemaker create-notebook-instance 
    --notebook-instance-name <your_notebook_instance_name> 
    --instance-type <your_instance_type> 
    --role-arn <your_role_arn> 
    --lifecycle-config-name install-codeserver

# Make sure to replace <your_notebook_instance_name>, <your_instance_type>,
# and <your_role_arn> in the previous commands with the appropriate values.

The code-server install is now automated for the notebook instance.

Conclusion

With code-server hosted on SageMaker, ML teams can run VS Code on scalable cloud compute, code from anywhere, and speed up their ML project delivery. For IT admins, it allows them to standardize and expedite the provisioning of managed, secure IDEs in the cloud, to quickly onboard and enable ML teams in their projects.

In this post, we shared a solution you can use to quickly install code-server on both Studio and notebook instances. We shared a manual installation process that ML teams can run on their own, and an automated installation that IT admins can set up for them.

To go further in your learnings, visit AWSome SageMaker on GitHub to find all the relevant and up-to-date resources needed for working with SageMaker.


About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for Amazon Web Services. With several years software engineering an ML background, he works with customers of any size to deeply understand their business and technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. He has worked on projects in different domains, including MLOps, Computer Vision, NLP, and involving a broad set of AWS services. In his free time, Giuseppe enjoys playing football.

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

Eric Pena is a Senior Technical Product Manager in the AWS Artificial Intelligence Platforms team, working on Amazon SageMaker Interactive Machine Learning. He currently focuses on IDE integrations on SageMaker Studio . He holds an MBA degree from MIT Sloan and outside of work enjoys playing basketball and football.

Read More

Real estate brokerage firm John L. Scott uses Amazon Textract to strike racially restrictive language from property deeds for homeowners

Founded more than 91 years ago in Seattle, John L. Scott Real Estate’s core value is Living Life as a Contribution®. The firm helps homebuyers find and buy the home of their dreams, while also helping sellers move into the next chapter of their home ownership journey. John L. Scott currently operates over 100 offices with more than 3,000 agents throughout Washington, Oregon, Idaho, and California.

When company operating officer Phil McBride joined the company in 2007, one of his initial challenges was to shift the company’s public website from an on-premises environment to a cloud-hosted one. According to McBride, a world of resources opened up to John L. Scott once the company started working with AWS to build an easily controlled, cloud-enabled environment.

Today, McBride is taking on the challenge of uncovering and modifying decades-old discriminatory restrictions in home titles and deeds. What he didn’t expect was enlisting the help of AWS for the undertaking.

In this post, we share how John L. Scott uses Amazon Textract and Amazon Comprehend to identify racially restrictive language from such documents.

A problem rooted in historic discrimination

Racial covenants restrict who can buy, sell, lease, or occupy a property based on race (see the following example document). Although no longer enforceable since the Fair Housing Act of 1968, racial covenants became pervasive across the country during the post-World War II housing boom and are still present in the titles of millions of homes. Racial covenants are direct evidence of the real estate industry’s complicity and complacency when it came to the government’s racist policies of the past, including redlining.

In 2019, McBride spoke in support of Washington state legislation that served as the next step in correcting the historic injustice of racial language in covenants. In 2021, a bill was passed that required real estate agents to provide notice of any unlawful recorded covenant or deed restriction to purchasers at the time of sale. A year after the legislation passed and homeowners were notified, John L. Scott discovered that only five homeowners in the state of Washington acted on updating their own property deeds.

“The challenge lies in the sheer volume of properties in the state of Washington, and the current system to update your deeds,” McBride said. “The process to update still is very complicated, so only the most motivated homeowners would put in the research and legwork to modify their deed. This just wasn’t going to happen at scale.”

Initial efforts to find restrictive language have found university students and community volunteers manually reading documents and recording findings. But in Washington state alone, millions of documents needed to be analyzed. A manual approach wouldn’t scale effectively.

Machine learning overcomes manual and complicated processes

With the support of AWS Global Impact Computing Specialists and Solutions Architects, John L. Scott has built an intelligent document processing solution that helps homeowners easily identify racially restrictive covenants in their property title documents. This intelligent document processing solution uses machine learning to scan titles, deeds, and other property documents, searching the text for racially restrictive language. The Washington State Association of County Auditors is also working with John L. Scott to provide digitized deeds, titles, and CC&Rs from their database, starting with King County, Washington.

Once these racial covenants are identified, John L. Scott team members guide homeowners through the process of modifying the discriminatory restrictions from their home’s title, with the support of online notary services such as Notarize.

With a goal of building a solution that the lean team at John L. Scott could manage, McBride’s team worked with AWS to evaluate different services and stitch them together in a modular, repeatable way that met the team’s vision and principles for speed and scale. To minimize management overhead and maximize scalability, the team worked together to build a serverless architecture for handling document ingestion and restrictive language identification using several key AWS services:

  • Amazon Simple Storage Service – Documents are stored in an Amazon S3 data lake for secure and highly available storage.
  • AWS Lambda – Documents are processed by Lambda as they arrive in the S3 data lake. Original document images are split into single-page files and analyzed with Amazon Textract (text detection) and Amazon Comprehend (text analysis).
  • Amazon Textract – Amazon Textract automatically converts raw images into text blocks, which are scanned using fuzzy string pattern matching for restrictive language. When restrictive language is identified, Lambda functions create new image files that highlight the language using the coordinates supplied by Amazon Textract. Finally, records of the restrictive findings are stored in an Amazon DynamoDB table.
  • Amazon Comprehend – Amazon Comprehend analyzes the text output from Amazon Textract and identifies useful data (entities) like dates and locations within the text. This information is key to identifying where and when restrictions were in effect.

The following diagram illustrates the architecture of the serverless ingestion and identification pipeline.

Building from this foundation, the team also incorporates parcel information (via GeoJSON and shapefiles) from county governments to identify affected property owners so they can be notified and begin the process of remediation. A forthcoming public website will also soon allow property owners to input their address to see if their property is affected by restrictive documents.

Setting a new example for the 21st Century

When asked about what’s next, McBride said working with Amazon Textract and Amazon Comprehend has helped his team serve as an example to other counties and real estate firms across the country who want to bring the project into their geographic area.

“Not all areas will have robust programs like we do in Washington state, with University of Washington volunteers indexing deeds and notifying the homeowners,” McBride said. “However, we hope offering this intelligent document processing solution in the public domain will help others drive change in their local communities.”

Learn more


About the authors

Jeff Stockamp is a Senior Solutions Architect based in Seattle, Washington. Jeff helps guide customers as they build well architected-applications and migrate workloads to AWS. Jeff is a constant builder and spends his spare time building Legos with his son.

Jarman Hauser is a Business Development and Go-to-Market Strategy leader at AWS. He works with customers on leveraging technology in unique ways to solve some of the worlds most challenging social, environmental, and economic challenges globally.

Moussa Koulbou is a Senior Solutions Architecture leader at AWS. He helps customers shape their cloud strategy and accelerate their digital velocity by creating the connection between intent and action. He leads a high-performing Solutions Architects team to deliver enterprise-grade solutions that leverage AWS cutting-edge technology to enable growth and solve the most critical business and social problems.

Read More

Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints

Amazon SageMaker multi-model endpoint (MME) enables you to cost-effectively deploy and host multiple models in a single endpoint and then horizontally scale the endpoint to achieve scale. As illustrated in the following figure, this is an effective technique to implement multi-tenancy of models within your machine learning (ML) infrastructure. We have seen software as a service (SaaS) businesses use this feature to apply hyper-personalization in their ML models while achieving lower costs.

For a high-level overview of how MME work, check out the AWS Summit video Scaling ML to the next level: Hosting thousands of models on SageMaker. To learn more about the hyper-personalized, multi-tenant use cases that MME enables, refer to How to scale machine learning inference for multi-tenant SaaS use cases.

Multi model endpoint architecture

In the rest of this post, we dive deeper into the technical architecture of SageMaker MME and share best practices for optimizing your multi-model endpoints.

Use cases best suited for MME

SageMaker multi-model endpoints are well suited for hosting a large number of models that you can serve through a shared serving container and you don’t need to access all the models at the same time. Depending on the size of the endpoint instance memory, a model may occasionally be unloaded from memory in favor of loading a new model to maximize efficient use of memory, therefore your application needs to be tolerant of occasional latency spikes on unloaded models.

MME is also designed for co-hosting models that use the same ML framework because they use the shared container to load multiple models. Therefore, if you have a mix of ML frameworks in your model fleet (such as PyTorch and TensorFlow), SageMaker dedicated endpoints or multi-container hosting is a better choice.

Finally, MME is suited for applications that can tolerate an occasional cold start latency penalty, because models are loaded on first invocation and infrequently used models can be offloaded from memory in favor of loading new models. Therefore, if you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings.

We have also seen some scenarios where customers deploy an MME cluster with enough aggregate memory capacity to fit all their models, thereby avoiding model offloads altogether yet still achieving cost savings because of the shared inference infrastructure.

Model serving containers

When you use the SageMaker Inference Toolkit or a pre-built SageMaker model serving container compatible with MME, your container has the Multi Model Server (JVM process) running. The easiest way to have Multi Model Server (MMS) incorporated into your model serving container is to use SageMaker model serving containers compatible with MME (look for those with Job Type=inference and CPU/GPU=CPU). MMS is an open source, easy-to-use tool for serving deep learning models. It provides a REST API with a web server to serve and manage multiple models on a single host. However, it’s not mandatory to use MMS; you could implement your own model server as long as it implements the APIs required by MME.

When used as part of the MME platform, all predict, load, and unload API calls to MMS or your own model server are channeled through the MME data plane controller. API calls from the data plane controller are made over local host only to prevent unauthorized access from outside of the instance. One of the key benefits of MMS is that it enables a standardized interface for loading, unloading, and invoking models with compatibility across a wide range of deep learning frameworks.

Advanced configuration of MMS

If you choose to use MMS for model serving, consider the following advanced configurations to optimize the scalability and throughput of your MME instances.

Increase inference parallelism per model

MMS creates one or more Python worker processes per model based on the value of the default_workers_per_model configuration parameter. These Python workers handle each individual inference request by running any preprocessing, prediction, and post processing functions you provide. For more information, see the custom service handler GitHub repo.

Having more than one model worker increases the parallelism of predictions that can be served by a given model. However, when a large number of models are being hosted on an instance with a large number of CPUs, you should perform a load test of your MME to find the optimum value for default_workers_per_model to prevent any memory or CPU resource exhaustion.

Design for traffic spikes

Each MMS process within an endpoint instance has a request queue that can be configured with the job_queue_size parameter (default is 100). This determines the number of requests MMS will queue when all worker processes are busy. Use this parameter to fine-tune the responsiveness of your endpoint instances after you’ve decided on the optimal number of workers per model.

In an optimal worker per model ratio, the default of 100 should suffice for most cases. However, for those cases where request traffic to the endpoint spikes unusually, you can reduce the size of the queue if you want the endpoint to fail fast to pass control to the application or increase the queue size if you want the endpoint to absorb the spike.

Maximize memory resources per instance

When using multiple worker processes per model, by default each worker process loads its own copy of the model. This can reduce the available instance memory for other models. You can optimize memory utilization by sharing a single model between worker processes by setting the configuration parameter preload_model=true. Here you’re trading off reduced inference parallelism (due to a single model instance) with more memory efficiency. This setting along with multiple worker processes can be a good choice for use cases where model latency is low but you have heavier preprocessing and postprocessing (done by the worker processes) per inference request.

Set values for MMS advanced configurations

MMS uses a config.properties file to store configurations. MMS uses the following order to locate this config.properties file:

  1. If the MMS_CONFIG_FILE environment variable is set, MMS loads the configuration from the environment variable.
  2. If the --mms-config parameter is passed to MMS, it loads the configuration from the parameter.
  3. If there is a config.properties in current folder where the user starts MMS, it loads the config.properties file from the current working directory.

If none of the above are specified, MMS loads the built-in configuration with default values.

The following is a command line example of starting MMS with an explicit configuration file:

multi-model-server --start --mms-config /home/mms/config.properties

Key metrics to monitor your endpoint performance

The key metrics that can help you optimize your MME are typically related to CPU and memory utilization and inference latency. The instance-level metrics are emitted by MMS, whereas the latency metrics come from the MME. In this section, we discuss the typical metrics that you can use to understand and optimize your MME.

Endpoint instance-level metrics (MMS metrics)

From the list of MMS metrics, CPUUtilization and MemoryUtilization can help you evaluate whether or not your instance or the MME cluster is right-sized. If both metrics have percentages between 50–80%, then your MME is right-sized.

Typically, low CPUUtilization and high MemoryUtilization is an indication of an over-provisioned MME cluster because it indicates that infrequently invoked models aren’t being unloaded. This could be because of a higher-than-optimal number of endpoint instances provisioned for the MME and therefore higher-than-optimal aggregate memory is available for infrequently accessed models to remain in memory. Conversely, close to 100% utilization of these metrics means that your cluster is under-provisioned, so you need to adjust your cluster auto scaling policy.

Platform-level metrics (MME metrics)

From the full list of MME metrics, a key metric that can help you understand the latency of your inference request is ModelCacheHit. This metric shows the average ratio of invoke requests for which the model was already loaded in memory. If this ratio is low, it indicates your MME cluster is under-provisioned because there’s likely not enough aggregate memory capacity in the MME cluster for the number of unique model invocations, therefore causing models to be frequently unloaded from memory.

Lessons from the field and strategies for optimizing MME

We have seen the following recommendations from some of the high-scale uses of MME across a number of customers.

Horizontal scaling with smaller instances is better than vertical scaling with larger instances

You may experience throttling on model invocations when running high requests per second (RPS) on fewer endpoint instances. There are internal limits to the number of invocations per second (loads and unloads that can happen concurrently on an instance), and therefore it’s always better to have a higher number of smaller instances. Running a higher number of smaller instances means a higher total aggregate capacity of these limits for the endpoint.

Another benefit of horizontally scaling with smaller instances is that you reduce the risk of exhausting instance CPU and memory resources when running MMS with higher levels of parallelism, along with a higher number of models in memory (as described earlier in this post).

Avoiding thrashing is a shared responsibility

Thrashing in MME is when models are frequently unloaded from memory and reloaded due to insufficient memory, either in an individual instance or on aggregate in the cluster.

From a usage perspective, you should right-size individual endpoint instances and right-size the overall size of the MME cluster to ensure enough memory capacity is available per instance and also on aggregate for the cluster for your use case. The MME platform’s router fleet will also maximize the cache hit.

Don’t be aggressive with bin packing too many models on fewer, larger memory instances

Memory isn’t the only resource on the instance to be aware of. Other resources like CPU can be a constraining factor, as seen in the following load test results. In some other cases, we have also observed other kernel resources like process IDs being exhausted on an instance, due to a combination of too many models being loaded and the underlying ML framework (such as TensorFlow) spawning threads per model that were multiples of available vCPUs.

The following performance test demonstrates an example of CPU constraint impacting model latency. In this test, a single instance endpoint with a large instance, while having more than enough memory to keep all four models in memory, produced comparatively worse model latencies under load when compared to an endpoint with four smaller instances.

single instance endpoint model latency

single instance endpoint CPU & memory utilization

four instance endpoint model latency

four instance endpoint CPU & memory utilization

To achieve both performance and cost-efficiency, right-size your MME cluster with higher number of smaller instances that on aggregate give you the optimum memory and CPU capacity while being relatively at par for cost with fewer but larger memory instances.

Mental model for optimizing MME

There are four key metrics that you should always consider when right-sizing your MME:

  • The number and size of the models
  • The number of unique models invoked at a given time
  • The instance type and size
  • The instance count behind the endpoint

Start with the first two points, because they inform the third and fourth. For example, if not enough instances are behind the endpoint for the number or size of unique models you have, the aggregate memory for the endpoint will be low and you’ll see a lower cache hit ratio and thrashing at the endpoint level because the MME will load and unload models in and out of memory frequently.

Similarly, if the invocations for unique models are higher than the aggregate memory of all instances behind the endpoint, you’ll see a lower cache hit. This can also happen if the size of instances (especially memory capacity) is too small.

Vertically scaling with really large memory instances could also lead to issues because although the models may fit into memory, other resources like CPU and kernel processes and thread limits could be exhausted. Load test horizontal scaling in pre-production to get the optimum number and size of instances for your MME.

Summary

In this post, you got a deeper understanding of the MME platform. You learned which technical use cases MME is suited for and reviewed the architecture of the MME platform. You gained a deeper understanding of the role of each component within the MME architecture and which components you can directly influence the performance of. Finally, you had a deeper look at the configuration parameters that you can adjust to optimize MME for your use case and the metrics you should monitor to maintain optimum performance.

To get started with MME, review Amazon SageMaker Multi-Model Endpoints using XGBoost and Host multiple models in one container behind one endpoint.


About the Author

Syed Jaffry is a Principal Solutions Architect with AWS. He works with a range of companies from mid-sized organizations, large enterprises, financial services and ISVs to help them build and operate cost efficient and scalable AI/ML applications in the cloud.

Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and is motivated by the goal of democratizing machine learning. He focuses on core challenges related to deploying complex ML applications, multi-tenant ML models, cost optimizations, and making deployment of deep learning models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.

Read More

Testing approaches for Amazon SageMaker ML models

This post was co-written with Tobias Wenzel, Software Engineering Manager for the Intuit Machine Learning Platform.

We all appreciate the importance of a high-quality and reliable machine learning (ML) model when using autonomous driving or interacting with Alexa, for examples. ML models also play an important role in less obvious ways—they’re used by business applications, healthcare, financial institutions, amazon.com, TurboTax, and more.

As ML-enabled applications become core to many businesses, models need to follow the same vigor and discipline as software applications. An important aspect of MLOps is to deliver a new version of the previously developed ML model in production by using established DevOps practices such as testing, versioning, continuous delivery, and monitoring.

There are several prescriptive guidelines around MLOps, and this post gives an overview of the process that you can follow and which tools to use for testing. This is based on collaborations between Intuit and AWS. We have been working together to implement the recommendations explained in this post in practice and at scale. Intuit’s goal of becoming an AI-driven expert platform is heavily dependent on a strategy of increasing velocity of initial model development as well as testing of new versions.

Requirements

The following are the main areas of consideration while deploying new model versions:

  1. Model accuracy performance – It’s important to keep track of model evaluation metrics like accuracy, precision, and recall, and ensure that the objective metrics remain relatively the same or improve with a new version of the model. In most cases, deploying a new version of the model doesn’t make sense if the experience of end-users won’t improve.
  2. Test data quality – Data in non-production environments, whether simulated or point-in-time copy, should be representative of the data that the model will receive when fully deployed, in terms of volume or distribution. If not, your testing processes won’t be representative, and your model may behave differently in production.
  3. Feature importance and parity – Feature importance in the newer version of the model should relatively compare to the older model, though there might be new features introduced. This is to ensure that the model isn’t becoming biased.
  4. Business process testing – It’s important that a new version of a model can fulfill your required business objectives within acceptable parameters. For example, one of the business metrics can be that the end-to-end latency for any service must not be more than 100 milliseconds, or the cost to host and retrain a particular model can’t be more than $10,000 per year.
  5. Cost – A simple approach to testing is to replicate the whole production environment as a test environment. This is a common practice in software development. However, such an approach in the case of ML models might not yield the right ROI depending upon the size of data and may impact the model in terms of the business problem it’s addressing.
  6. Security – Test environments are often expected to have sample data instead of real customer data and as a result, data handling and compliance rules can be less strict. Just like cost though, if you simply duplicate the production environment into a test environment, you could introduce security and compliance risks.
  7. Feature store scalability – If an organization decides to not create a separate test feature store because of cost or security reasons, then model testing needs to happen on the production feature store, which can cause scalability issues as traffic is doubled during the testing period.
  8. Online model performance – Online evaluations differ from offline evaluations and can be important in some cases like recommendation models because they measure user satisfaction in real time rather than perceived satisfaction. It’s hard to simulate real traffic patterns in non-production due to seasonality or other user behavior, so online model performance can only be done in production.
  9. Operational performance – As models get bigger and are increasingly deployed in a decentralized manner on different hardware, it’s important to test the model for your desired operational performance like latency, error rate, and more.

Most ML teams have a multi-pronged approach to model testing. In the following sections, we provide ways to address these challenges during various testing stages.

Offline model testing

The goal of this testing phase is to validate new versions of an existing model from an accuracy standpoint. This should be done in an offline fashion to not impact any predictions in the production system that are serving real-time predictions. By ensuring that the new model performs better for applicable evaluation metrics, this testing addresses challenge 1 (model accuracy performance). Also, by using the right dataset, this testing can address challenges 2 and 3 (test data quality, feature importance and parity), with the additional benefit of tackling challenge 5 (cost).

This phase is done in the staging environment.

You should capture production traffic, which you can use to replay in offline back testing. It’s preferable to use past production traffic instead of synthetic data. The Amazon SageMaker Model Monitor capture data feature allows you to capture production traffic for models hosted on Amazon SageMaker. This allows model developers to test their models with data from peak business days or other significant events. The captured data is then replayed against the new model version in a batch fashion using Sagemaker batch transform. This means that the batch transform run can tests with data that has been collected over weeks or months in just a few hours. This can significantly speed up the model evaluation process compared to running two or more versions of a real-time model side by side and sending duplicate prediction requests to each endpoint. In addition to finding a better-performing version faster, this approach also uses the compute resources for a shorter amount of time, reducing the overall cost.

A challenge with this approach to testing is that the feature set changes from one model version to another. In this scenario, we recommend creating a feature set with a superset of features for both versions so that all features can be queried at once and recorded through the data capture. Each prediction call can then work on only those features necessary for the current version of the model.

As an added bonus, by integrating Amazon SageMaker Clarify in your offline model testing, you can check the new version of model for bias and also compare feature attribution with the previous version of the model. With pipelines, you can orchestrate the entire workflow such that after training, a quality check step can take place to perform an analysis of the model metrics and feature importance. These metrics are stored in the SageMaker model registry for comparison in the next run of training.

Integration and performance testing

Integration testing is needed to validate end-to-end business processes from a functional as well as a runtime performance perspective. Within this process, the whole pipeline should be tested, including fetching, and calculating features in the feature store and running the ML application. This should be done with a variety of different payloads to cover a variety of scenarios and requests and achieve high coverage for all possible code runs. This addresses challenges 4 and 9 (business process testing and operational performance) to ensure none of the business processes are broken with the new version of the model.

This testing should be done in a staging environment.

Both integration testing and performance testing need to be implemented by individual teams using their MLOps pipeline. For the integration testing, we recommend the tried and tested method of maintaining a functionally equivalent pre-production environment and testing with a few different payloads. The testing workflow can be automated as shown in this workshop. For the performance testing, you can use Amazon SageMaker Inference Recommender, which offers a great starting point to determine which instance type and how many of those instances to use. For this, you’ll need to use a load generator tool, such as the open-source projects perfsizesagemaker and perfsize that Intuit has developed. Perfsizesagemaker allows you to automatically test model endpoint configurations with a variety of payloads, response times, and peak transactions per second requirements. It generates detailed test results that compare different model versions. Perfsize is the companion tool that tries different configurations given only the peak transactions per second and the expected response time.

A/B testing

In many cases where user reaction to the immediate output of the model is required, such as ecommerce applications, offline model functional evaluation isn’t sufficient. In these scenarios, you need to A/B test models in production before making the decision of updating models. A/B testing also has its risks because there could be real customer impact. This testing method serves as the final ML performance validation, a lightweight engineering sanity check. This method also addresses challenges 8 and 9 (online model performance and operational excellence).

A/B testing should be performed in a production environment.

With SageMaker, you can easily perform A/B testing on ML models by running multiple production variants on an endpoint. Traffic can be routed in increments to the new version to reduce the risk that a badly behaving model could have on production. If results of the A/B test look good, traffic is routed to the new version, eventually taking over 100% of traffic. We recommend using deployment guardrails to transition from model A to B. For a more complete discussion on A/B testing using Amazon Personalize models as an example, refer to Using A/B testing to measure the efficacy of recommendations generated by Amazon Personalize.

Online model testing

In this scenario, the new version of a model is significantly different to the one already serving live traffic in production, so the offline testing approach is no longer suitable to determine the efficacy of the new model version. The most prominent reason for this is a change in features required to produce the prediction, so that previously recorded transactions can’t be used to test the model. In this scenario, we recommend using shadow deployments. Shadow deployments offer the capability to deploy a shadow (or challenger) model alongside the production (or champion) model that is currently serving predictions. This lets you evaluate how the shadow model performs in production traffic. The predictions of the shadow model aren’t served to the requesting application; they’re logged for offline evaluation. With the shadow approach for testing, we address challenges 4, 5, 6, and 7 (business process testing, cost, security, and feature store scalability).

Online model testing should be done in staging or production environments.

This method of testing new model versions should be used as a last resort if all the other methods can’t be used. We recommend it as a last resort because duplexing calls to multiple models generates additional load on all downstream services in production, which can lead to performance bottlenecks as well as increased cost in production. The most obvious impact this has is on the feature serving layer. For use cases that share features from a common pool of physical data, we need to be able to simulate multiple use cases concurrently accessing the same data table to ensure no resource contention exists before transitioning to production. Wherever possible, duplicate queries to the feature store should be avoided, and features needed for both versions of the model should be reused for the second inference. Feature stores based on Amazon DynamoDB, as the one Intuit has built, can implement Amazon DynamoDB Accelerator(DAX) to cache and avoid doubling the I/O to the database. These and other caching options can mitigate challenge 7 (feature store scalability).

To address challenge 5 (cost) as well as 7, we propose using shadow deployments to sample the incoming traffic. This gives model owners another layer of control to minimize impact on the production systems.

Shadow deployment should be onboarded to the Model Monitor offerings just like the regular production deployments in order to observe the improvements of the challenger version.

Conclusion

This post illustrates the building blocks to create a comprehensive set of processes and tools to address various challenges with model testing. Although every organization is unique, this should help you get started and narrow down your considerations when implementing your own testing strategy.


About the authors

Tobias Wenzel is a Software Engineering Manager for the Intuit Machine Learning Platform in Mountain View, California. He has been working on the platform since its inception in 2016 and has helped design and build it from the ground up. In his job, he has focused on the operational excellence of the platform and bringing it successfully through Intuit’s seasonal business. In addition, he is passionate about continuously expanding the platform with the latest technologies.

Shivanshu Upadhyay is a Principal Solutions Architect in the AWS Business Development and Strategic Industries group. In this role, he helps most advanced adopters of AWS transform their industry by effectively using data and AI.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Read More