Create Amazon SageMaker model building pipelines and deploy R models using RStudio on Amazon SageMaker

In November 2021, in collaboration with RStudio PBC, we announced the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to Amazon SageMaker in just a few simple steps.

RStudio is one of the most popular IDEs among R developers for machine learning (ML) and data science projects. RStudio provides open-source tools for R and enterprise-ready professional software for data science teams to develop and share their work in the organization. Bringing RStudio on SageMaker not only gives you access to the AWS infrastructure in a fully managed way, but it also gives you native access to SageMaker.

In this post, we explore how you can use SageMaker features via RStudio on SageMaker to build a SageMaker pipeline that builds, processes, trains and registers your R models. We also explore using SageMaker for our model deployment, all using R.

Solution overview

The following diagram shows the architecture used in our solution. All code used in this example can be found in the GitHub repository.

Prerequisites

To follow this post, access to RStudio on SageMaker is required. If you’re new to using RStudio on SageMaker, review Get started with RStudio on Amazon SageMaker.

We also need to build custom Docker containers. We use AWS CodeBuild to build these containers, so you need a few extra AWS Identity and Access Management (IAM) permissions that you might not have by default. Before you proceed, make sure that the IAM role that you’re using has a trust policy with CodeBuild:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "codebuild.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following permissions are also required in the IAM role to run a build in CodeBuild and push the image to Amazon Elastic Container Registry (Amazon ECR):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "codebuild:DeleteProject",
                "codebuild:CreateProject",
                "codebuild:BatchGetBuilds",
                "codebuild:StartBuild"
            ],
            "Resource": "arn:aws:codebuild:*:*:project/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogStream",
            "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:GetLogEvents",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:BatchGetImage",
                "ecr:CompleteLayerUpload",
                "ecr:DescribeImages",
                "ecr:DescribeRepositories",
                "ecr:UploadLayerPart",
                "ecr:ListImages",
                "ecr:InitiateLayerUpload", 
                "ecr:BatchCheckLayerAvailability",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/sagemaker-studio*"
        },
        {
            "Sid": "ReadAccessToPrebuiltAwsImages",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
            ],
            "Resource": [
                "arn:aws:ecr:*:763104351884:repository/*",
                "arn:aws:ecr:*:217643126080:repository/*",
                "arn:aws:ecr:*:727897471807:repository/*",
                "arn:aws:ecr:*:626614931356:repository/*",
                "arn:aws:ecr:*:683313688378:repository/*",
                "arn:aws:ecr:*:520713654638:repository/*",
                "arn:aws:ecr:*:462105765813:repository/*"
            ]
        },
        {
            "Sid": "EcrAuthorizationTokenRetrieval",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
              "s3:GetObject",
              "s3:DeleteObject",
              "s3:PutObject"
              ],
            "Resource": "arn:aws:s3:::sagemaker-*/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket"
            ],
            "Resource": "arn:aws:s3:::sagemaker*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:ListRoles"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringLikeIfExists": {
                    "iam:PassedToService": "codebuild.amazonaws.com"
                }
            }
        }
    ]
}

Create baseline R containers

To use our R scripts for processing and training on SageMaker processing and training jobs, we need to create our own Docker containers containing the necessary runtime and packages. The ability to use your own container, which is part of the SageMaker offering, gives great flexibility to developers and data scientists to use the tools and frameworks of their choice, with virtually no limitations.

We create two R-enabled Docker containers: one for processing jobs and one for training and deployment of our models. Processing data typically requires different packages and libraries than modeling, so it makes sense here to separate the two stages and use different containers.

For more details about using containers with SageMaker, refer to Using Docker containers with SageMaker.

The container used for processing is defined as follows:

FROM public.ecr.aws/docker/library/r-base:4.1.2

# Install tidyverse
RUN apt update && apt-get install -y --no-install-recommends 
    r-cran-tidyverse
    
RUN R -e "install.packages(c('rjson'))"

ENTRYPOINT ["Rscript"]

For this post, we use a simple and relatively lightweight container. Depending on your or your organization’s needs, you may want to pre-install several more R packages.

The container used for training and deployment is defined as follows:

FROM public.ecr.aws/docker/library/r-base:4.1.2

RUN apt-get -y update && apt-get install -y --no-install-recommends 
    wget 
    apt-transport-https 
    ca-certificates 
    libcurl4-openssl-dev 
    libsodium-dev
    
RUN apt-get update && apt-get install -y python3-dev python3-pip 
RUN pip3 install boto3
RUN R -e "install.packages(c('readr','plumber', 'reticulate'),dependencies=TRUE, repos='http://cran.rstudio.com/')"

ENV PATH="/opt/ml/code:${PATH}"

WORKDIR /opt/ml/code

COPY ./docker/run.sh /opt/ml/code/run.sh
COPY ./docker/entrypoint.R /opt/ml/entrypoint.R

RUN /bin/bash -c 'chmod +x /opt/ml/code/run.sh'

ENTRYPOINT ["/bin/bash", "run.sh"]

The RStudio kernel runs on a Docker container, so you won’t be able to build and deploy the containers using Docker commands directly on your Studio session. Instead, you can use the very useful library sagemaker-studio-image-build, which essentially outsources the task of building containers to CodeBuild.

With the following commands, we create two Amazon ECR registries: sagemaker-r-processing and sagemaker-r-train-n-deploy, and build the respective containers that we use later:

if (!py_module_available("sagemaker-studio-image-build")){py_install("sagemaker-studio-image-build", pip=TRUE)}
system("cd pipeline-example ; sm-docker build . —file ./docker/Dockerfile-train-n-deploy —repository sagemaker-r-train-and-deploy:1.0")
system("cd pipeline-example ; sm-docker build . —file ./docker/Dockerfile-processing —repository sagemaker-r-processing:1.0")

Create the pipeline

Now that the containers are built and ready, we can create the SageMaker pipeline that orchestrates the model building workflow. The full code of this is under the file pipeline.R in the repository. The easiest way to create a SageMaker pipeline is by using the SageMaker SDK, which is a Python library that we can access using the library reticulate. This gives us access to all functionalities of SageMaker without leaving the R language environment.

The pipeline we build has the following components:

  • Preprocessing step – This is a SageMaker processing job (utilizing the sagemaker-r-processing container) responsible for preprocessing the data and splitting the data into train and test datasets.
  • Training step – This is a SageMaker training job (utilizing the sagemaker-r-train-n-deploy container) responsible for training the model. In this example, we train a simple linear model.
  • Evaluation step – This is a SageMaker processing job (utilizing the sagemaker-r-processing container) responsible for performing evaluation of the model. Specifically in this example, we’re interested in the RMSE (root mean square error) on the test dataset, which we want to use in the next step as well as to associate with the model itself.
  • Conditional step – This is a conditional step, native to SageMaker pipelines, that allows us to branch the pipeline logic based on some parameter. In this case, the pipeline branches based on the value of RMSE that is calculated in the previous step.
  • Register model step – If the preceding conditional step is True, and the performance of the model is acceptable, then the model is registered in the model registry. For more information, refer to Register and Deploy Models with Model Registry.

First call the upsert function to create (or update) the pipeline and then call the start function to actually start running the pipeline:

source("pipeline-example/pipeline.R")
my_pipeline <- get_pipeline(input_data_uri=s3_raw_data)

upserted <- my_pipeline$upsert(role_arn=role_arn)
started <- my_pipeline$start()

Inspect the pipeline and model registry

One of the great things about using RStudio on SageMaker is that by being on the SageMaker platform, you can use the right tool for the right job and swiftly switch between them based on what you need to do.

As soon as we start the pipeline run, we can switch to Amazon SageMaker Studio, which allows us to visualize the pipeline and monitor current and previous runs of it.

To view details about the pipeline we just created and ran, navigate to the Studio IDE interface, choose SageMaker resources, choose Pipelines on the drop-down menu, and choose the pipeline (in this case, AbalonePipelineUsingR).

This reveals details of the pipeline, including all current and previous runs. Choose the latest one to bring up a visual representation of the pipeline, as per the following screenshot.

The DAG of the pipeline is created automatically by the service based on the data dependencies between steps, as well as based on custom added dependencies (not added any in this example).

When the run is complete, if successful, you should see all the steps turn green.

Choosing any of the individual steps brings up details about the specific step, including inputs, outputs, logs, and initial configuration settings. This allows you to drill down in the pipeline and investigate any failed steps.

Similarly, when the pipeline has finished running, a model is saved in the model registry. To access it, in the SageMaker resources pane, choose Model registry on the drop-down and choose your model. This reveals the list of registered models, as shown in the following screenshot. Choose one to open the details page for that particular model version.

After you open a version of the model, choose Update Status and Approve to approve the model.

At this point, based on your use case, you can set up this approval to trigger further actions, including the deployment of the model as per your needs.

Serverless deployment of the model

After you’ve trained and registered a model on SageMaker, deploying the model on SageMaker is straightforward.

There are several options of how you can deploy a model, such as batch inference, real-time endpoints, or asynchronous endpoints. Each method comes with several required configurations, including choosing the instance type you want as well as the scaling mechanism.

For this example, we use the recently announced feature of SageMaker, Serverless Inference (in preview mode as of the time of writing), to deploy our R model on a serverless endpoint. For this type of endpoint, we only define the amount of RAM that we want to be allocated to the model for inference, as well as the maximum number of allowed concurrent invocations of the model. SageMaker takes care of hosting the model and auto scaling as needed. You’re only charged for the exact number of seconds and data used by the model, with no cost for idle time.

You can deploy the model to a serverless endpoint with the following code:

model_package_arn <- 'ENTER_MODEL_PACKAGE_ARN_HERE'
model <- sagemaker$ModelPackage(
                        role=role_arn, 
                        model_package_arn=model_package_arn, 
                        sagemaker_session=session)
serverless_config <- sagemaker$serverless$ServerlessInferenceConfig(
                        memory_size_in_mb=1024L, 
                        max_concurrency=5L)
model$deploy(serverless_inference_config=serverless_config, 
             endpoint_name="serverless-r-abalone-endpoint")

If you see the error ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Invalid approval status "PendingManualApproval" the model you want to deploy hasn’t been approved. Follow the steps from the previous section to approve your model.

Invoke the endpoint by sending a request to the HTTP endpoint we deployed, or instead use the SageMaker SDK. In the following code, we invoke the endpoint on some test data:

library(jsonlite)
x = list(features=format_csv(abalone_t[1:3,1:11]))
x = toJSON(x)

# test the endpoint
predictor <- sagemaker$predictor$Predictor(endpoint_name="serverless-r-abalone-endpoint", sagemaker_session=session)
predictor$predict(x)

The endpoint we invoked was a serverless endpoint, and as such we’re charged for the exact duration and data used. You might notice that the first time you invoke the endpoint it takes about a second to respond. This is due to the cold start time of the serverless endpoint. If you make another invocation soon after, the model returns the prediction in real time because it’s already warm.

When you finish experimenting with the endpoint, you can delete it with the following command:

predictor$delete_endpoint(delete_endpoint_config=TRUE)

Conclusion

In this post, we walked through the process of creating a SageMaker pipeline using R in our RStudio environment and showcased how to deploy our R model on a serverless endpoint on SageMaker using the SageMaker model registry.

With the combination of RStudio and SageMaker, you can now create and orchestrate complete end-to-end ML workflows on AWS using our preferred language of choice, R.

To dive deeper into this solution, I encourage you to review the source code of this solution, as well as other examples, on GitHub.


About the Author

Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.

Read More

MLOps at the edge with Amazon SageMaker Edge Manager and AWS IoT Greengrass

Internet of Things (IoT) has enabled customers in multiple industries, such as manufacturing, automotive, and energy, to monitor and control real-world environments. By deploying a variety of edge IoT devices such as cameras, thermostats, and sensors, you can collect data, send it to the cloud, and build machine learning (ML) models to predict anomalies, failures, and more. However, if the use case requires real-time prediction, you need to enrich your IoT solution with ML at the edge (ML@Edge) capabilities. ML@Edge is a concept that decouples the ML model’s lifecycle from the app lifecycle and allows you to run an end-to-end ML pipeline that includes data preparation, model building, model compilation and optimization, model deployment (to a fleet of edge devices), model execution, and model monitoring and governing. You deploy the app once and run the ML pipeline as many times as you need.

As you can imagine, to implement all the steps proposed by the ML@Edge concept is not trivial. There are many questions that developers need to address in order to implement a complete ML@Edge solution, for example:

  • How do I operate ML models on a fleet (hundreds, thousands, or millions) of devices at the edge?
  • How do I secure my model while deploying and running it at the edge?
  • How do I monitor my model’s performance and retrain it, if needed?

In this post, you learn how to answer all these questions and build an end-to-end solution for automating your ML@Edge pipeline. You’ll see how to use Amazon SageMaker Edge Manager, Amazon SageMaker Studio, and AWS IoT Greengrass v2 to create an MLOps (ML Operations) environment that automates the process of building and deploying ML models to large fleets of edge devices.

In the next sections, we present a reference architecture that details all the components and workflows required to build a complete solution for MLOps focused on edge workloads. Then we dive deep into the steps this solution runs automatically to build and prepare a new model. We also show you how to prepare the edge devices to start deploying, running, and monitoring ML models, and demonstrate how to monitor and maintain the ML models deployed to your fleet of devices.

Solution overview

Productionization of robust ML models requires the collaboration of multiple personas, such as data scientists, ML engineers, data engineers, and business stakeholders, under a semi-automate infrastructure following specific operations (MLOps). Also, the modularization of the environment is important in order to give all these different personas the flexibility and agility to develop or improve (independently of the workflow) the component for which they are responsible. An example of such an infrastructure consists of multiple AWS accounts that enable this collaboration and productionization of the ML models both in the cloud and to the edge devices. In the following reference architecture, we show how we organized the multiple accounts and services that compose this end-to-end MLOps platform for building ML models and deploying them at the edge.

This solution consists of the following accounts:

  • Data lake account – Data engineers ingest, store, and prepare data from multiple data sources, including on-premise databases and IoT devices.
  • Tooling account – IT operators manage and check CI/CD pipelines for automated continuous delivery and deployment of ML model packages across the pre-production and production accounts for remote edge devices. Runs of CI/CD pipelines are automated through the usage of Amazon EventBridge, which monitors change status events of ML models and targets AWS CodePipeline.
  • Experimentation and development account – Data scientists can conduct research and experiment with multiple modeling techniques and algorithms to solve business problems based on ML, creating proof of concept solutions. ML engineers and data scientists collaborate to scale a proof of concept, creating automated workflows using Amazon SageMaker Pipelines to prepare data and build, train, and package ML models. The deployment of the pipelines is driven via CI/CD pipelines, while the version control of the models is achieved using the Amazon SageMaker model registry. Data scientists evaluate the metrics of multiple model versions and request the promotion of the best model to production by triggering the CI/CD pipeline.
  • Pre-production account – Before the promotion of the model to the production environment, the model needs to be tested to ensure robustness in a simulation environment. Therefore, the pre-production environment is a simulator of the production environment, in which SageMaker model endpoints are deployed and tested automatically. Test methods might include an integration test, stress test, or ML-specific tests on inference results. In this case, the production environment isn’t a SageMaker model endpoint but an edge device. To simulate an edge device in pre-production, two approaches are possible: use an Amazon Elastic Compute Cloud (Amazon EC2) instance with the same hardware characteristics, or use an in-lab testbed consisting of the actual devices. With this infrastructure, the CI/CD pipeline deploys the model to the corresponding simulator and conducts the multiple tests automatically. After the tests run successfully, the CI/CD pipeline requires manual approval (for example, from the IoT stakeholder to promote the model to production).
  • Production account – In the case of hosting the model on the AWS Cloud, the CI/CD pipeline deploys a SageMaker model endpoint on the production account. In this case, the production environment consists of multiple fleets of edge devices. Therefore, the CI/CD pipeline uses Edge Manager to deploy the models to the corresponding fleet of devices.
  • Edge devices – Remote edge devices are hardware devices that can run ML models using Edge Manager. It allows the application on those devices to manage the models, run inference against the models, and capture data securely into Amazon Simple Storage Service (Amazon S3).

SageMaker projects help you to automate the process of provisioning resources inside each of these accounts. We don’t dive deep into this feature, but to learn more about how to build a SageMaker project template that deploys ML models across accounts, check out Multi-account model deployment with Amazon SageMaker Pipelines.

Pre-production account: Digital twin

After the training process, the resulting model needs to be evaluated. In the pre-production account, you have a simulated Edge device. It represents the digital twin of the edge device on which the ML model runs in production. This environment has the dual purpose of performing the classic tests (such as unit, integration, and smoke) and to be a playground for the development team. This device is simulated using an EC2 instance where all the components needed to manage the ML model were deployed.

The involved services are as follows:

  • AWS IoT Core – We use AWS IoT Core to create AWS IoT thing objects, create a device fleet, register the device fleet so it can interact with the cloud, create X.509 certificates to authenticate edge devices to AWS IoT Core, associate the role alias with AWS IoT Core that was generated when the fleet has created, get an AWS account-specific endpoint for the credential provider, get an official Amazon Root CA file, and upload the Amazon CA file to Amazon S3.
  • Amazon Sagemaker Neo – Sagemaker Neo automatically optimizes machine learning models for inference to run faster with no loss in accuracy. It supports machine learning model already built with DarkNet, Keras, MXNet, PyTorch, TensorFlow, TensorFlow-Lite, ONNX, or XGBoost and trained in Amazon SageMaker or anywhere else. Then you choose your target hardware platform, which can be a SageMaker hosting instance or an edge device based on processors from Ambarella, Apple, ARM, Intel, MediaTek, Nvidia, NXP, Qualcomm, RockChip, Texas Instruments, or Xilinx.
  • Edge Manager – We use Edge Manager to register and manage the edge device within the Sagemaker fleets. Fleets are collections of logically grouped devices you can use to collect and analyze data. Besides, Edge Manager packager, packages the optimized model and create an AWS IoT Greengrass V2 component that can directly be deployed. You can use Edge Manager to operate ML models on a fleet of smart cameras, smart speakers, robots, and other SageMaker device fleets.
  • AWS IoT Greengrass V2AWS IoT Greengrass allows you to deploy components into the simulated devices using an EC2 instance. By using the AWS IoT Greengrass V2 agent in the EC2 instances, we can simplify the access, management, and deployment of the Edge Manager agent and model to devices. Without AWS IoT Greengrass V2, setting up devices and fleets to use Edge Manager requires you to manually copy the agent from an S3 release bucket. With AWS IoT Greengrass V2 and Edge Manager integration, it’s possible to use AWS IoT Greengrass V2 components. Components are pre-built software modules that can connect edge devices to AWS services or third-party service via AWS IoT Greengrass.
  • Edge Manager agent – The Edge Manager agent is deployed via AWS IoT Greengrass V2 in the EC2 instance. The agent can load multiple models at a time and make inference with loaded models on edge devices. The number of models the agent can load is determined by the available memory on the device.
  • Amazon S3 – We use an S3 bucket to store the inference captured data from the Edge Manager agent.

We can define a pre-production account as a digital twin for testing ML models before moving them into real edge devices. This offers the following benefits:

  • Agility and flexibility – Data scientists and ML engineers need to quickly validate if the ML model and associated scripts (preprocessing and inference scripts) will work on the device edge. However, IoT and data science departments in large enterprises may be different entities. By identically replicating the technology stack in the cloud, data scientists and ML engineers can iterate and consolidate artifacts prior to deployment.
  • Accelerated risk assessment and production time – Deployment on the edge device is the final stage of the process. After you validate everything in an isolated and self-contained environment, secure it to be adherent to the specifications required by the edge in terms of quality, performance, and integration. This helps avoid further involvement of other people in the IoT department to fix and iterate on artifact versions.
  • Improved team collaboration and enhanced quality and performance – Development team can immediately assess the impact of the ML model by analyzing edge hardware metrics and measuring the level of interactions with third-party tools (eg. I/O rate). Then, the IoT team is only responsible for deployment to the production environment, and can be confident that the artifacts are accurate for a production environment.
  • Integrated playground for testing – Given the target of ML models, the pre-production environment in a traditional workflow should be represented by an edge device outside the cloud environment. This introduces another level of complexity. Integrations are needed to collect metrics and feedback. Instead, by using the digital twin simulated environment, interactions are reduced and time to market is shortened.

Production account and edge environment

After the tests are complete and the artifact stability is achieved, you can proceed to production deployment through the pipelines. Artifact deployment occurs programmatically after an operator has approved the artifact. However, access to the AWS Management Console is granted to operators in read-only mode to be able to monitor metadata associated with the fleets and therefore have insight into the version of the deployed ML model and other metrics associated with the lifecycle.

Edge device fleets belong to the AWS production account. This account has specific security and networking configurations to allow communication between the cloud and edge devices. The main AWS services deployed in the production account are Edge Manager, which is responsible for managing all the device fleets, collecting data, and operating ML models, and AWS IoT Core, which manages IoT thing objects, certificates, role alias, and endpoints.

At the same time, we need to configure an edge device with the services and components to manage ML models. The main components are as follows:

  • AWS IoT Greengrass V2
  • An Edge Manager agent
  • AWS IoT certificates
  • Application.py, which is responsible for orchestrating the inference process (retrieving information from the edge data source and performing inference using the Edge Manager agent and loaded ML model)
  • A connection to Amazon S3 or the data lake account to store inferenced data

Automated ML pipeline

Now that you know more about the organization and the components of the reference architecture, we can dive deeper into the ML pipeline that we use to build, train, and evaluate the ML model inside the development account.

A pipeline (built using Amazon SageMaker Model Building Pipelines) is a series of interconnected steps that is defined by a JSON pipeline definition. This pipeline definition encodes a pipeline using a Directed Acyclic Graph (DAG). This DAG gives information on the requirements for and relationships between each step of your pipeline. The structure of a pipeline’s DAG is determined by the data dependencies between steps. These data dependencies are created when the properties of a step’s output are passed as the input to another step.

To enable data science teams to easily automate the creation of new versions of ML models, it’s important to introduce validation steps and automated data for continuously feeding and improving ML models, as well as model monitoring strategies for enabling pipeline triggering. The following diagram shows an example pipeline.

For enabling automations and MLOps capabilities, it’s important to create modular components for creating reusable code artifacts that can be sharable across different steps and ML use cases. This enables you to quickly move the implementation from an experimentation phase to a production phase by automating the transition.

The steps for defining an ML pipeline for enabling the continuous training and versioning of ML models are as follows:

  • Preprocessing – The process of data cleaning, feature engineering, and dataset creation for training the ML algorithm
  • Training – The process of training the developed ML algorithm for generating a new version of the ML model artifact
  • Evaluation – The process of evaluation of the generated ML model, for extracting key metrics related to the model behavior on new data not seen during the training phase
  • Registration – The process of versioning the new trained ML model artifact by linking the metrics extracted with the generated artifact

You can see more details of how to build a SageMaker pipeline in the following notebook.

Trigger CI/CD pipelines using EventBridge

When you finish building the model, you can start the deployment process. The last step of the SageMaker pipeline defined in the previous section registers a new version of the model in the specific SageMaker model registry group. The deployment of a new version of the ML model is managed using the model registry status. By manually approving or rejecting an ML model version, this step raises an event that is captured by EventBridge. This event can then start a new pipeline (CI/CD this time) for creating a new version of the AWS IoT Greengrass component that is then deployed to the pre-production and production accounts. The following screenshot shows our defined EventBridge rule.

This rule monitors the SageMaker model package group by looking for updates of model packages in the status Approved or Rejected.

The EventBridge rule is then configured to target CodePipeline, which starts the workflow of creating a new AWS IoT Greengrass component by using Amazon SageMaker Neo and Edge Manager.

Optimize ML models for the target architecture

Neo allows you to optimize ML models for performing inference on edge devices (and in the cloud). It automatically optimizes the ML models for better performance based on the target architecture, and decouples the model from the original framework, allowing you to run it on a lightweight runtime.

Refer to the following notebook for an example of how to compile a PyTorch Resnet18 model using Neo.

Build the deployment package by including the AWS IoT Greengrass component

Edge Manager allows you to manage, secure, deploy, and monitor models to a fleet of edge devices. In the following notebook, you can see more details of how to build a minimalist fleet of edge devices and run some experiments with this feature.

After you configure the fleet and compile the model, you need to run an Edge Manager packaging job, which prepares the model to be deployed to the fleet. You can start a packaging job by using the Boto3 SDK. For our parameters, we use the optimized model and model metadata. By adding the following parameters to OutputConfig, the job also prepares an AWS IoT Greengrass V2 component with the model:

  • PresetDeploymentType
  • PresetDeploymentConfig

See the following code:

import boto3
import time

SageMaker_client = boto3.client('SageMaker')

SageMaker_client.create_edge_packaging_job(
    EdgePackagingJobName="mlops-edge-packaging-{}".format(int(time.time()*1000)),
    CompilationJobName=compilation_job_name,
    ModelName="PytorchMLOpsEdgeModel",
    ModelVersion="1.0.0",
    RoleArn=role,
    OutputConfig={
        'S3OutputLocation': 's3://{}/model/'.format(bucket_name),
        "PresetDeploymentType": "GreengrassV2Component",
        "PresetDeploymentConfig": json.dumps(
            {"ComponentName": component_name, "ComponentVersion": component_version}
        ),
    }
)

Deploy ML models at the edge at scale

Now it’s time to deploy the model to your fleet of edge devices. First, we need to ensure that we have the necessary AWS Identity and Access Management (IAM) permissions to provision our IoT devices and are able to deploy components to it. We require two basic elements to start onboarding devices into our IoT platform:

  • IAM policy – This policy allows for the automatic provisioning of such devices, attached to the user or role performing the provisioning. It should have IoT write permissions to create the IoT thing and group, as well as to attach the necessary policies to the device. For more information, refer to Minimal IAM policy for installer to provision resources.
  • IAM role – this role is attached to the IoT things and groups that we create. You can create this role at provisioning time with basic permissions, but it will lack features like access to Amazon S3 or AWS Key Management Service (AWS KMS) that might be needed later. You can create this role beforehand and reuse it when we provision the device. For more information, refer to Authorize core devices to interact with AWS.

AWS IoT Greengrass installation and provisioning

After we have the IAM policy and role in place, we’re ready to install AWS IoT Greengrass Core software with automatic resource provisioning. Although it’s possible to provision the IoT resources following manual steps, there is the convenient procedure of automatically provisioning these resources during the installation of the AWS IoT Greengrass v2 nucleus. This is the preferred option to quickly onboard new devices into the platform. Besides default-jdk, other packages are required to be installed, such as curl, unzip, and python3.

When we provision our device, the IoT thing name must be exactly the same as the edge device defined in Edge Manager, otherwise data won’t be captured to the destination S3 bucket.

The installer can create the AWS IoT Greengrass role and alias during the installation if they don’t exist. However, they’ll be created with minimal permissions and will require manually adding more policies to interact with other services such as Amazon S3. We recommend creating these IAM resources beforehand as shown earlier, and then reuse them as you onboard new devices into the account.

Model and inference component packaging

After our code has been developed, we can deploy both the code (for inference) and our ML models as components into our devices.

After the ML model is trained in SageMaker, you can optimize the model with Neo using a Sagemaker compilation job. The resulting compiled model artifacts, can then be packaged into a GreenGrass V2 component using the Edge Manager packager. Then, it can be registered as a custom component in the My Components section on the AWS IoT Greengrass console. This component already contains the necessary lifecycle commands to download and decompress the model artifact in our device, so that the inference code can load it up to send the images captured through it.

Regarding the inference code, we must create a component using the console or AWS Command Line Interface (AWS CLI). First, we pack our source inference code and necessary dependencies to Amazon S3. After we upload the code, we can create our component using a recipe in .yaml or JSON like the following example:

---
RecipeFormatVersion: 2020-01-25
ComponentName: dummymodel.inference
ComponentVersion: 0.0.1
ComponentDescription: Deploys inference code to a client
ComponentPublisher: Amazon Web Services, Inc.
ComponentDependencies:
  aws.GreenGrass.TokenExchangeService:
    VersionRequirement: '>=0.0.0'
    DependencyType: HARD
  dummymodel:
    VersionRequirement: '>=0.0.0'
    DependencyType: HARD
Manifests:
  - Platform:
      os: linux
      architecture: "*"
    Lifecycle:
      install: |-
        apt-get install python3-pip
        pip3 install numpy
        pip3 install sysv_ipc
        pip3 install boto3
        pip3 install grpcio-tools
        pip3 install grpcio
        pip3 install protobuf
        pip3 install SageMaker
        tar xf {artifacts:path}/sourcedir.tar.gz
      run:
        script: |-
          sleep 5 && sudo python3 {work:path}/inference.py 
    Artifacts:
      - URI: s3://BUCKET-NAME/path/to/inference/sourcedir.tar.gz
        Permission:
          Execute: OWNER

This example recipe shows the name and description of our component, as well as the necessary prerequisites before our run script command. The recipe unpacks the artifact in a work folder environment in the device, and we use that path to run our inference code. The AWS CLI command to create such recipe is:

aws greengrassv2 create-component-version --region $REGION 
                                          --inline-recipe fileb://path/to/recipe.yaml

You can now see this component created on the AWS IoT Greengrass console.

Beware of the fact that the component version matters, and it must be specified in the recipe file. Repeating the same version number will return an error.

After our model and inference code have been set up as components, we’re ready to deploy them.

Deploy the application and model using AWS IoT Greengrass

In the previous sections, you learned how to package the inference code and the ML models. Now we can create a deployment with multiple components that include both components and configurations needed for our inference code to interact with the model in the edge device.

The Edge Manager agent is the component that should be installed on each edge device in order enable all the Edge Manager capabilities. On the SageMaker console, we have a device fleet defined, which has an associated S3 bucket. All edge devices associated with the fleet will capture and report their data to this S3 path. The agent can be deployed as a component in AWS IoT Greengrass v2, which makes it easier to install and configure than if the agent were deployed in standalone mode. When deploying the agent as a component, we need to specify its configuration parameters, namely the device fleet and S3 path.

We create a deployment configuration with the custom components for the model and code we just created. This setup is defined in a JSON file that lists the deployment name and target, as well as the components in the deployment. We can add and update the configuration parameters of each component, such as in the Edge Manager agent, where we specify the fleet name and bucket.

{
    "targetArn": "targetArn",
    "deploymentName": "dummy-deployment",
    "components": {
        "aws.GreenGrass.Nucleus": {
            "version": "2.5.3",
        },
        "aws.GreenGrass.Cli": {
            "version": "2.5.3"
        },
        "aws.GreenGrass.SageMakerEdgeManager": {
            "version": 1.1.0,
            "configurationUpdate": {
                "merge": {
                "DeviceFleetName": "FLEET-NAME",
                "BucketName": "BUCKET-NAME-URI"
                }
            }
        },
        "dummymodel.inference": {
            "version": "0.0.1"
        },
        "dummymodel": {
            "version": "0.0.1"
        }
    }
}

It’s worth noting that we have added not only the model, inference components, and agent, but also the AWS IoT Greengrass CLI and nucleus as components. The former can help debug certain deployments locally on the device. The latter is added into the deployment to configure the necessary network access from the device itself if needed (for example, proxy settings), and also in case you want to perform an OTA upgrade of the AWS IoT Greengrass v2 nucleus. The nucleus isn’t deployed because it’s installed in the device, and only the configuration update will be applied (unless an upgrade is in place). To deploy, we simply need to run the following command over the preceding configuration. Remember to set up the target ARN to which the deployment will be applied (an IoT thing or IoT group). We can also deploy these components from the console.

aws greengrassv2 create-deployment --region $REGION 
                                   --cli-input-json file://path/to/deployment.json

Monitor and manage ML models deployed to the edge

Now that your application is running on the edge devices, it’s time to understand how to monitor the fleet to improve governance, maintenance, and visibility. On the SageMaker console, choose Edge Manager in the navigation pane, then choose Edge device fleets. From here, choose your fleet.

On the fleet’s detail page, you can see some metadata of the models that are running on each device of your fleet. Fleet report is generated every 24 hours.

Data captured by each device through the Edge Agent is sent to an S3 bucket in json lines format (JSONL). The process of sending captured data is managed from an application standpoint. You are therefore free to decide whether to send this data, how and how often.

You can use this data for many things, such as monitoring data drift and model quality, building a new dataset, enriching a data lake, and more. A simple example of how to utilize this data is when you identify some data drift in the way users are interacting with your application and you need to train a new model. You then build a new dataset with the captured data and copy it back to the development account. This can automatically start a new run of your environment that builds a new model and redeploys it to the whole fleet to keep the performance of the deployed solution.

Conclusion

In this post, you learned how to build a complete solution that combines MLOps and ML@Edge using AWS services. Building such a solution is not trivial, but we hope the reference architecture presented in this post can inspire and help you build a solid architecture for your own business challenges. You can also use just the parts or modules of this architecture that integrate with your existing MLOps environment. By prototyping one single module at a time and using the appropriate AWS services to address each piece of this challenge, you can learn how to build a robust MLOps environment and also further simplify the final architecture.

As a next step, we encourage you to try out Sagemaker Edge Manager to manage your ML at edge lifecycle. For more information on how Edge Manager works, see Deploy models at the edge with SageMaker Edge Manager .


About the authors

Bruno Pistone is an AI/ML Specialist Solutions Architect for AWS based in Milan. He works with customers of any size on helping them to to deeply understand their technical needs and design AI and Machine Learning solutions that make the best use of the AWS Cloud and the Amazon Machine Learning stack. His field of expertice are Machine Learning end to end, Machine Learning Industrialization and MLOps. He enjoys spending time with his friends and exploring new places, as well as travelling to new destinations.

Matteo Calabrese is an AI/ML Customer Delivery Architect in AWS Professional Services team. He works with EMEA large enterprises on AI/ML projects, helping them in proposition, design, deliver, scale, and optimize ML production workloads. His main expertise are ML Operation (MLOps) and Machine Learning at Edge. His goal is shortening their time to value and accelerate business outcomes by providing AWS best practices. In his spare time, he enjoys hiking and traveling.

Raúl Díaz García is a Sr Data Scientist in AWS Professional Services team. He works with large enterprise customers across EMEA, where he helps them enable solutions related to Computer Vision and Machine Learning in the IoT space.

Sokratis Kartakis is a Senior Machine Learning Specialist Solutions Architect for Amazon Web Services. Sokratis focuses on enabling enterprise customers to industrialize their Machine Learning (ML) solutions by exploiting AWS services and shaping their operating model, i.e. MLOps foundation, and transformation roadmap leveraging best development practices. He has spent 15+ years on inventing, designing, leading, and implementing innovative end-to-end production-level ML and Internet of Things (IoT) solutions in the domains of energy, retail, health, finance/banking, motorsports etc. Sokratis likes to spend his spare time with family and friends, or riding motorbikes.

Samir Araújo is an AI/ML Solutions Architect at AWS. He helps customers creating AI/ML solutions which solve their business challenges using AWS. He has been working on several AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. He likes playing with hardware and automation projects in his free time, and he has a particular interest for robotics.

Read More

Optimal pricing for maximum profit using Amazon SageMaker

This is a guest post by Viktor Enrico Jeney, Senior Machine Learning Engineer at Adspert.

Adspert is a Berlin-based ISV that developed a bid management tool designed to automatically optimize performance marketing and advertising campaigns. The company’s core principle is to automate maximization of profit of ecommerce advertising with the help of artificial intelligence. The continuous development of advertising platforms paves the way for new opportunities, which Adspert expertly utilizes for their customers’ success.

Adspert’s primary goal is to simplify the process for users while optimizing ad campaigns across different platforms. This includes the use of information gathered across the various platforms balanced against the optimum budget set on a level above each platform. Adspert’s focus is to optimize a customer’s goal attainment, regardless of what platform is used. Adspert continues to add platforms as necessary to give our customers significant advantages.

In this post, we share how Adspert created the pricing tool from scratch using different AWS services like Amazon SageMaker and how Adspert collaborated with the AWS Data Lab to accelerate this project from design to build in record time.

The pricing tool reprices a seller-selected product on an ecommerce marketplace based on the visibility and profit margin to maximize profits on the product level.

As a seller, it’s essential that your products are always visible because this will increase sales. The most important factor in ecommerce sales is simply if your offer is visible to customers instead of a competitor’s offer.

Although it certainly depends on the specific ecommerce platform, we’ve found that product price is one of the most important key figures that can affect visibility. However, prices change often and fast; for this reason the pricing tool needs to act in near-real time to increase the visibility.

Overview of solution

The following diagram illustrates the solution architecture.

The solution contains the following components:

  1. Amazon Relational Database Service (Amazon RDS) for PostgreSQL is the main source of data, containing product information that is stored in an RDS for Postgres database.
  2. Product listing changes information arrives in real time in an Amazon Simple Queue Service (Amazon SQS) queue.
  3. Product information stored in Amazon RDS is ingested in near-real time into the raw layer using the change data capture (CDC) pattern available in AWS Database Migration Service (AWS DMS).
  4. Product listing notifications coming from Amazon SQS are ingested in near-real time into the raw layer using an AWS Lambda function.
  5. The original source data is stored in the Amazon Simple Storage Service (Amazon S3) raw layer bucket using Parquet data format. This layer is the single source of truth for the data lake. The partitioning used on this storage supports the incremental processing of data.
  6. AWS Glue extract, transform, and load (ETL) jobs clean the product data, removing duplicates, and applying data consolidation and generic transformations not tied to a specific business case.
  7. The Amazon S3 stage layer receives prepared data that is stored in Apache Parquet format for further processing. The partitioning used on the stage store supports the incremental processing of data.
  8. The AWS Glue jobs created in this layer use the data available in the Amazon S3 stage layer. This includes application of use case-specific business rules and calculations required. The results data from these jobs are stored in the Amazon S3 analytics layer.
  9. The Amazon S3 analytics layer is used to store the data that is used by the ML models for training purposes. The partitioning used on the curated store is based on the data usage expected. This may be different to the partitioning used on the stage layer.
  10. The repricing ML model is a Scikit-Learn Random Forest implementation in SageMaker Script Mode, which is trained using data available in the S3 bucket (the analytics layer).
  11. An AWS Glue data processing job prepares data for the real-time inference. The job processes data ingested in the S3 bucket (stage layer) and invokes the SageMaker inference endpoint. The data is prepared to be used by the SageMaker repricing model. AWS Glue was preferred to Lambda, because the inference requires different complex data processing operations like joins and window functions on a high volume of data (billions of daily transactions). The result from the repricing model invocations are stored in the S3 bucket (inference layer).
  12. The SageMaker training job is deployed using a SageMaker endpoint. This endpoint is invoked by the AWS Glue inference processor, generating near-real-time price recommendations to increase product visibility.
  13. The predictions generated by the SageMaker inference endpoint are stored in the Amazon S3 inference layer.
  14. The Lambda predictions optimizer function processes the recommendations generated by the SageMaker inference endpoint and generates a new price recommendation that focuses on maximizing the seller profit, applying a trade-off between sales volume and sales margin.
  15. The price recommendations generated by the Lambda predictions optimizer are submitted to the repricing API, which updates the product price on the marketplace.
  16. The updated price recommendations generated by the Lambda predictions optimizer are stored in the Amazon S3 optimization layer.
  17. The AWS Glue prediction loader job reloads into the source RDS for Postgres SQL database the predictions generated by the ML model for auditing and reporting purposes. AWS Glue Studio was used to implement this component; it’s a graphical interface that makes it easy to create, run, and monitor ETL jobs in AWS Glue.

Data preparation

The dataset for Adspert’s visibility model is created from an SQS queue and ingested into the raw layer of our data lake in real time with Lambda. Afterwards, the raw data is sanitized by performing simple transformations, like removing duplicates. This process is implemented in AWS Glue. The result is stored in the staging layer of our data lake. The notifications provide the competitors for a given product, with their prices, fulfilment channels, shipping times, and many more variables. They also provide a platform-dependent visibility measure, which can be expressed as a Boolean variable (visible or not visible). We receive a notification any time an offer change happens, which adds up to several million events per month over all our customers’ products.

From this dataset, we extract the training data as follows: for every notification, we pair the visible offers with every non visible offer, and vice versa. Every data point represents a competition between two sellers, in which there is a clear winner and loser. This processing job is implemented in an AWS Glue job with Spark. The prepared training dataset is pushed to the analytics S3 bucket to be used by SageMaker.

Train the model

Our model classifies for each pair of offers, if a given offer will be visible. This model enables us to calculate the best price for our customers, increase visibility based on competition, and maximize their profit. On top of that, this classification model can give us deeper insights into the reasons for our listings being visible or not visible. We use the following features:

  • Ratio of our price to competitors’ prices
  • Difference in fulfilment channels
  • Amount of feedback for each seller
  • Feedback rating of each seller
  • Difference in minimum shipping times
  • Difference in maximum shipping times
  • Availability of each seller’s product

Adspert uses SageMaker to train and host the model. We use Scikit-Learn Random Forest implementation in SageMaker Script Mode. We also include some feature preprocessing directly in the Scikit-Learn pipeline in the training script. See the following code:

import numpy as np

def transform_price(X):
    X = X.to_numpy()
    return np.log(
        X[:, 0] / np.nanmin([X[:, 1], X[:, 2]], axis=0),
    ).reshape(-1, 1)

def difference(X):
    X = X.to_numpy()
    return (X[:, 0] - X[:, 1]).reshape(-1, 1)

def fulfillment_difference(X):
    X = X.astype(int)
    return difference(X)

One of the most important preprocessing functions is transform_price, which divides the price by the minimum of the competitor price and an external price column. We’ve found that this is feature has a relevant impact on the model accuracy. We also apply the logarithm to let the model decide based on relative price differences, not absolute price differences.

In the training_script.py script, we first define how to build the Scikit-Learn ColumnTransformer to apply the specified transformers to the columns of a dataframe:

import argparse
import os
from io import StringIO

import joblib
import numpy as np
import pandas as pd
from custom_transformers import difference
from custom_transformers import fulfillment_difference
from custom_transformers import transform_price
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder

def make_preprocessor():
    return ColumnTransformer([
        ('price_by_smallest_cp', FunctionTransformer(transform_price),
         ['price', 'competitor_price', 'external_price']),
        (fulfillment_difference, FunctionTransformer(fulfillment_difference),
         ['fulfillment', 'competitor_'fulfillment']),
        ('feedback_count', 'passthrough',
         ['feedback_count', 'competitor_feedback_count']),
        ('feedback_rating', 'passthrough',
         ['feedback_rating', 'competitor_feedback_rating']),
        (
            'availability_type',
            OneHotEncoder(categories=[['NOW'], ['NOW']],
                          handle_unknown='ignore'),
            ['availability_type', 'competitor_availability_type'],
        ),
        ('min_shipping', FunctionTransformer(difference),
         ['minimum_shipping_hours', 'competitor_min_shipping_hours']),
        ('max_shipping', FunctionTransformer(difference),
         ['maximum_shipping_hours', 'competitor_max_shipping_hours']),
    ], remainder='drop')

In the training script, we load the data from Parquet into a Pandas dataframe, define the pipeline of the ColumnTranformer and the RandomForestClassifier, and train the model. Afterwards, the model is serialized using joblib:

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--output-data-dir', type=str,
                        default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str,
                        default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str,
                        default=os.environ['SM_CHANNEL_TRAIN'])

    args = parser.parse_args()

    # load training data
    input_files = [os.path.join(args.train, file)
                   for file in os.listdir(args.train)]
    if len(input_files) == 0:
        raise ValueError
    raw_data = [pd.read_parquet(file) for file in input_files]
    train_data = pd.concat(raw_data)

    # split data set into x and y values
    train_y = train_data.loc[:, 'is_visible']

    if train_y.dtype != 'bool':
        raise ValueError(f'Label 'is_visible' has to be dtype bool but is'
                         f' {train_y.dtype}')

    train_X = train_data.drop('is_visible', axis=1)

    # fit the classifier pipeline and store the fitted model
    clf = Pipeline([
        ('preprocessor', make_preprocessor()),
        ('classifier', RandomForestClassifier(random_state=1)),
    ])
    clf.fit(train_X, train_y)
    joblib.dump(clf, os.path.join(args.model_dir, 'model.joblib'))

In the training script, we also have to implement functions for inference:

  • input_fn – Is responsible for parsing the data from the request body of the payload
  • model_fn – Loads and returns the model that has been dumped in the training section of the script
  • predict_fn – Contains our implementation to request a prediction from the model using the data from the payload
  • predict_proba – In order to draw predicted visibility curves, we return the class probability using the predict_proba function, instead of the binary prediction of the classifier

See the following code:

def input_fn(request_body, request_content_type):
    """Parse input data payload"""
    if request_content_type == 'text/csv':
        df = pd.read_csv(StringIO(request_body))
        return df
    else:
        raise ValueError(f'{request_content_type} not supported by script!')


def predict_fn(input_data, model):
    """Predict the visibilities"""
    classes = model.classes_

    if len(classes) != 2:
        raise ValueError('Model has more than 2 classes!')

    # get the index of the winning class
    class_index = np.where(model.classes_ == 1)[0][0]

    output = model.predict_proba(input_data)
    return output[:, class_index]


def model_fn(model_dir):
    """Deserialized and return fitted model

    Note that this should have the same name as the serialized model in the
    main method
    """
    clf = joblib.load(os.path.join(model_dir, 'model.joblib'))
    return clf

The following figure shows the impurity-based feature importances returned by the Random Forest Classifier.

With SageMaker, we were able to train the model on a large amount of data (up to 14 billion daily transactions) without putting load on our existing instances or having to set up a separate machine with enough resources. Moreover, because the instances are immediately shut down after the training job, training with SageMaker was extremely cost-efficient. The model deployment with SageMaker worked without any additional workload. A single function call in the Python SDK is sufficient to host our model as an inference endpoint, and it can be easily requested from other services using the SageMaker Python SDK as well. See the following code:

from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"
script_path = 'training_script.py'
output_location = f's3://{bucket}/{folder}/output'
source_dir = 'source_dir'

sklearn = SKLearn(
    entry_point=script_path,
    source_dir=source_dir,
    framework_version=FRAMEWORK_VERSION,
    instance_type='ml.m5.large',
    role=role,
    sagemaker_session=sagemaker_session,
    output_path=output_location)

sklearn.fit({'train': training_path})

The model artefact is stored in Amazon S3 by the fit function. As seen in the following code, the model can be loaded as a SKLearnModel object using the model artefact, script path, and some other parameters. Afterwards, it can be deployed to the desired instance type and number of instances.

model = sagemaker.sklearn.model.SKLearnModel(
    model_data=f'{output_location}/sagemaker-scikit-learn-2021-02-23-11-13-30-036/output/model.tar.gz',
    source_dir=source_dir,
    entry_point=script_path,
    framework_version=FRAMEWORK_VERSION,
    sagemaker_session=sagemaker_session,
    role=role
)
ENDPOINT_NAME = 'visibility-model-v1'
model.deploy(
    initial_instance_count=1,
    instance_type='ml.m5.large',
    endpoint_name=ENDPOINT_NAME
)

Evaluate the model in real time

Whenever a new notification is sent for one of our products, we want to calculate and submit the optimal price. To calculate optimal prices, we create a prediction dataset in which we compare our own offer with each competitor’s offer for a range of possible prices. These data points are passed to the SageMaker endpoint, which returns the predicted probability of being visible against each competitor for each given price. We call the probability of being visible the predicted visibility. The result can be visualized as a curve for each competitor, portraying the relationship between our price and the visibility, as shown in the following figure.

In this example, the visibility against Competitor 1 is almost a piecewise constant function, suggesting that we mainly have to decrease the price below a certain threshold, roughly the price of the competitor, to become visible. However, the visibility against Competitor 2 doesn’t decrease as steeply. On top of that, we still have a 50% chance of being visible even with a very high price. Analyzing the input data revealed that the competitor has a low amount of ratings, which happen to be very poor. Our model learned that this specific ecommerce platform gives a disadvantage to sellers with poor feedback ratings. We discovered similar effects for the other features, like fulfilment channel and shipping times.

The necessary data transformations and inferences against the SageMaker endpoint are implemented in AWS Glue. The AWS Glue job work in micro-batches on the real-time data ingested from Lambda.

Calculate optimal prices using predicted visibilities

Finally, we want to calculate the aggregated visibility curve, which is the predicted visibility for each possible price. Our offer is visible if it’s better than all other sellers’ offers. Assuming independence between the probabilities of being visible against each seller given our price, the probability of being visible against all sellers is the product of the respective probabilities. That means the aggregated visibility curve can be calculated by multiplying all curves.

The following figures show the predicted visibilities returned from the SageMaker endpoint.

The following figure shows the aggregated visibility curve.

To calculate the optimal price, the visibility curve is first smoothed and then multiplied by the margin. To calculate the margin, we use the costs of goods and the fees. The cost of goods sold and fees are the static product information synced via AWS DMS. Based on the profit function, Adspert calculates the optimal price and submits it to the ecommerce platform through the platform’s API.

This is implemented in the AWS Lambda prediction optimizer.

The following figure shows the relation between predicted visibility and price.

The following figure shows the relation between price and profit.

Conclusion

Adspert’s existing approach to profit maximization is focused on bid management to increase the returns from advertising. To achieve superior performance on ecommerce marketplaces, however, sellers have to consider both advertising and competitive pricing of their products. With this new ML model to predict visibility, we can extend our functionality to also adjust customer’s prices.

The new pricing tool has to be capable of automated training of the ML model on a large amount of data, as well as real-time data transformations, predictions, and price optimizations. In this post, we walked through the main steps of our price optimization engine, and the AWS architecture we implemented in collaboration with the AWS Data Lab to achieve those goals.

Taking ML models from concept to production is typically complex and time-consuming. You have to manage large amounts of data to train the model, choose the best algorithm for training it, manage the compute capacity while training it, and then deploy the model into a production environment. SageMaker reduced this complexity by making it much more straightforward to build and deploy the ML model. After we chose the right algorithms and frameworks from the wide range of choices available, SageMaker managed all of the underlying infrastructure to train our model and deploy it to production.

If you’d like start familiarizing yourself with SageMaker, the Immersion Day workshop can help you gain an end-to-end understanding of how to build ML use cases from feature engineering, the various in-built algorithms, and how to train, tune, and deploy the ML model in a production-like scenario. It guides you to bring your own model and perform an on-premise ML workload lift-and-shift to the SageMaker platform. It further demonstrates advanced concepts like model debugging, model monitoring, and AutoML, and helps you evaluate your ML workload through the AWS ML Well-Architected lens.

If you’d like help accelerating the implementation of use cases that involve data, analytics, AI and ML, serverless, and container modernization, please contact the AWS Data Lab.


About the authors

Viktor Enrico Jeney is a Senior Machine Learning Engineer at Adspert based in Berlin, Germany. He creates solutions for prediction and optimization problems in order to increase customers’ profits. Viktor has a background in applied mathematics and loves working with data. In his free time, he enjoys learning Hungarian, practicing martial arts, and playing the guitar.

Ennio Pastore is a data architect on the AWS Data Lab team. He is an enthusiast of everything related to new technologies that have a positive impact on businesses and general livelihood. Ennio has over 9 years of experience in data analytics. He helps companies define and implement data platforms across industries, such as telecommunications, banking, gaming, retail, and insurance.

Read More

Amazon Comprehend announces lower annotation limits for custom entity recognition

Amazon Comprehend is a natural-language processing (NLP) service you can use to automatically extract entities, key phrases, language, sentiments, and other insights from documents. For example, you can immediately start detecting entities such as people, places, commercial items, dates, and quantities via the Amazon Comprehend console, AWS Command Line Interface, or Amazon Comprehend APIs. In addition, if you need to extract entities that aren’t part of the Amazon Comprehend built-in entity types, you can create a custom entity recognition model (also known as custom entity recognizer) to extract terms that are more relevant for your specific use case, like names of items from a catalog of products, domain-specific identifiers, and so on. Creating an accurate entity recognizer on your own using machine learning libraries and frameworks can be a complex and time-consuming process. Amazon Comprehend simplifies your model training work significantly. All you need to do is load your dataset of documents and annotations, and use the Amazon Comprehend console, AWS CLI, or APIs to create the model.

To train a custom entity recognizer, you can provide training data to Amazon Comprehend as annotations or entity lists. In the first case, you provide a collection of documents and a file with annotations that specify the location where entities occur within the set of documents. Alternatively, with entity lists, you provide a list of entities with their corresponding entity type label, and a set of unannotated documents in which you expect your entities to be present. Both approaches can be used to train a successful custom entity recognition model; however, there are situations in which one method may be a better choice. For example, when the meaning of specific entities could be ambiguous and context-dependent, providing annotations is recommended because this might help you create an Amazon Comprehend model that is capable of better using context when extracting entities.

Annotating documents can require quite a lot of effort and time, especially if you consider that both the quality and quantity of annotations have an impact on the resulting entity recognition model. Imprecise or too few annotations can lead to poor results. To help you set up a process for acquiring annotations, we provide tools such as Amazon SageMaker Ground Truth, which you can use to annotate your documents more quickly and generate an augmented manifest annotations file. However, even if you use Ground Truth, you still need to make sure that your training dataset is large enough to successfully build your entity recognizer.

Until today, to start training an Amazon Comprehend custom entity recognizer, you had to provide a collection of at least 250 documents and a minimum of 100 annotations per entity type. Today, we’re announcing that, thanks to recent improvements in the models underlying Amazon Comprehend, we’ve reduced the minimum requirements for training a recognizer with plain text CSV annotation files. You can now build a custom entity recognition model with as few as three documents and 25 annotations per entity type. You can find further details about new service limits in Guidelines and quotas.

To showcase how this reduction can help you getting started with the creation of a custom entity recognizer, we ran some tests on a few open-source datasets and collected performance metrics. In this post, we walk you through the benchmarking process and the results we obtained while working on subsampled datasets.

Dataset preparation

In this post, we explain how we trained an Amazon Comprehend custom entity recognizer using annotated documents. In general, annotations can be provided as a CSV file, an augmented manifest file generated by Ground Truth, or a PDF file. Our focus is on CSV plain text annotations, because this is the type of annotation impacted by the new minimum requirements. CSV files should have the following structure:

File, Line, Begin Offset, End Offset, Type
documents.txt, 0, 0, 13, ENTITY_TYPE_1
documents.txt, 1, 0, 7, ENTITY_TYPE_2

The relevant fields are as follows:

  • File – The name of the file containing the documents
  • Line – The number of the line containing the entity, starting with line 0
  • Begin Offset – The character offset in the input text (relative to the beginning of the line) that shows where the entity begins, considering that the first character is at position 0
  • End Offset – The character offset in the input text that shows where the entity ends
  • Type – The name of the entity type you want to define

Additionally, when using this approach, you have to provide a collection of training documents as .txt files with one document per line, or one document per file.

For our tests, we used the SNIPS Natural Language Understanding benchmark, a dataset of crowdsourced utterances distributed among seven user intents (AddToPlaylist, BookRestaurant, GetWeather, PlayMusic, RateBook, SearchCreativeWork, SearchScreeningEvent). The dataset was published in 2018 in the context of the paper Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces by Coucke, et al.

The SNIPS dataset is made of a collection of JSON files condensing both annotations and raw text files. The following is a snippet from the dataset:

{
   "annotations":{
      "named_entity":[
         {
            "start":16,
            "end":36,
            "extent":"within the same area",
            "tag":"spatial_relation"
         },
         {
            "start":40,
            "end":51,
            "extent":"Lawrence St",
            "tag":"poi"
         },
         {
            "start":67,
            "end":70,
            "extent":"one",
            "tag":"party_size_number"
         }
      ],
      "intent":"BookRestaurant"
   },
   "raw_text":"I'd like to eat within the same area of Lawrence St for a party of one"
}

Before creating our entity recognizer, we transformed the SNIPS annotations and raw text files into a CSV annotations file and a .txt documents file.

The following is an excerpt from our annotations.csv file:

File, Line, Begin Offset, End Offset, Type
documents.txt, 0, 16, 36, spatial_relation
documents.txt, 0, 40, 51, poi
documents.txt, 0, 67, 70, party_size_number

The following is an excerpt from our documents.txt file:

I'd like to eat within the same area of Lawrence St for a party of one
Please book me a table for three at an american gastropub 
I would like to book a restaurant in Niagara Falls for 8 on June nineteenth
Can you book a table for a party of 6 close to DeKalb Av

Sampling configuration and benchmarking process

For our experiments, we focused on a subset of entity types from the SNIPS dataset:

  • BookRestaurant – Entity types: spatial_relation, poi, party_size_number, restaurant_name, city, timeRange, restaurant_type, served_dish, party_size_description, country, facility, state, sort, cuisine
  • GetWeather – Entity types: condition_temperature, current_location, geographic_poi, timeRange, state, spatial_relation, condition_description, city, country
  • PlayMusic – Entity types: track, artist, music_item, service, genre, sort, playlist, album, year

Moreover, we subsampled each dataset to obtain different configurations in terms of number of documents sampled for training and number of annotations per entity (also known as shots). This was done by using a custom script designed to create subsampled datasets in which each entity type appears at least k times, within a minimum of n documents.

Each model was trained using a specific subsample of the training datasets; the nine model configurations are illustrated in the following table.

Subsampled dataset name Number of documents sampled for training Number of documents sampled for testing Average number of annotations per entity type (shots)
snips-BookRestaurant-subsample-A 132 17 33
snips-BookRestaurant-subsample-B 257 33 64
snips-BookRestaurant-subsample-C 508 64 128
snips-GetWeather-subsample-A 91 12 25
snips-GetWeather-subsample-B 185 24 49
snips-GetWeather-subsample-C 361 46 95
snips-PlayMusic-subsample-A 130 17 30
snips-PlayMusic-subsample-B 254 32 60
snips-PlayMusic-subsample-C 505 64 119

To measure the accuracy of our models, we collected evaluation metrics that Amazon Comprehend automatically computes when training an entity recognizer:

  • Precision – This indicates the fraction of entities detected by the recognizer that are correctly identified and labeled. From a different perspective, precision can be defined as tp / (tp + fp), where tp is the number of true positives (correct identifications) and fp is the number of false positives (incorrect identifications).
  • Recall – This indicates the fraction of entities present in the documents that are correctly identified and labeled. It’s calculated as tp / (tp + fn), where tp is the number of true positives and fn is the number of false negatives (missed identifications).
  • F1 score – This is a combination of the precision and recall metrics, which measures the overall accuracy of the model. The F1 score is the harmonic mean of the precision and recall metrics, and is calculated as 2 * Precision * Recall / (Precision + Recall).

For comparing performance of our entity recognizers, we focus on F1 scores.

Considering that, given a dataset and a subsample size (in terms of number of documents and shots), you can generate different subsamples, we generated 10 subsamples for each one of the nine configurations, trained the entity recognition models, collected performance metrics, and averaged them using micro-averaging. This allowed us to get more stable results, especially for few-shot subsamples.

Results

The following table shows the micro-averaged F1 scores computed on performance metrics returned by Amazon Comprehend after training each entity recognizer.

Subsampled dataset name Entity recognizer micro-averaged F1 score (%)
snips-BookRestaurant-subsample-A 86.89
snips-BookRestaurant-subsample-B 90.18
snips-BookRestaurant-subsample-C 92.84
snips-GetWeather-subsample-A 84.73
snips-GetWeather-subsample-B 93.27
snips-GetWeather-subsample-C 93.43
snips-PlayMusic-subsample-A 80.61
snips-PlayMusic-subsample-B 81.80
snips-PlayMusic-subsample-C 85.04

The following column chart shows the distribution of F1 scores for the nine configurations we trained as described in the previous section.

Column chart showing the distribution of micro-averaged F1 scores for the nine configurations trained.

We can observe that we were able to successfully train custom entity recognition models even with as few as 25 annotations per entity type. If we focus on the three smallest subsampled datasets (snips-BookRestaurant-subsample-A, snips-GetWeather-subsample-A, and snips-PlayMusic-subsample-A), we see that, on average, we were able to achieve a F1 score of 84%, which is a pretty good result considering the limited number of documents and annotations we used. If we want to improve the performance of our model, we can collect additional documents and annotations and train a new model with more data. For example, with medium-sized subsamples (snips-BookRestaurant-subsample-B, snips-GetWeather-subsample-B, and snips-PlayMusic-subsample-B), which contain twice as many documents and annotations, we obtained on average a F1 score of 88% (5% improvement with respect to subsample-A datasets). Finally, larger subsampled datasets (snips-BookRestaurant-subsample-C, snips-GetWeather-subsample-C, and snips-PlayMusic-subsample-C), which contain even more annotated data (approximately four times the number of documents and annotations used for subsample-A datasets), provided a further 2% improvement, raising the average F1 score to 90%.

Conclusion

In this post, we announced a reduction of the minimum requirements for training a custom entity recognizer with Amazon Comprehend, and ran some benchmarks on open-source datasets to show how this reduction can help you get started. Starting today, you can create an entity recognition model with as few as 25 annotations per entity type (instead of 100), and at least three documents (instead of 250). With this announcement, we’re lowering the barrier to entry for users interested in using Amazon Comprehend custom entity recognition technology. You can now start running your experiments with a very small collection of annotated documents, analyze preliminary results, and iterate by including additional annotations and documents if you need a more accurate entity recognition model for your use case.

To learn more and get started with a custom entity recognizer, refer to Custom entity recognition.

Special thanks to my colleagues Jyoti Bansal and Jie Ma for their precious help with data preparation and benchmarking.


About the author

Luca Guida is a Solutions Architect at AWS; he is based in Milan and supports Italian ISVs in their cloud journey. With an academic background in computer science and engineering, he started developing his AI/ML passion at university. As a member of the natural language processing (NLP) community within AWS, Luca helps customers be successful while adopting AI/ML services.

Read More