Build MLOps workflows with Amazon SageMaker projects, GitLab, and GitLab pipelines

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.

In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Prerequisites

For this walkthrough, you should have the following prerequisites:

This post provides a detailed explanation of the SageMaker projects, GitLab, and GitLab pipelines integration. We review the code and discuss the components of the solution. To deploy the solution, reference the GitHub repo, which provides step-by-step instructions for implementing a MLOps workflow using a SageMaker project template with GitLab and GitLab pipelines.

Solution overview

The following diagram shows the architecture we build using a custom SageMaker project template.

Let’s review the components of this architecture to understand the end-to-end setup:

  • GitLab – Acts as our code repository and enables CI/CD using GitLab pipelines. The custom SageMaker project template creates two repositories (model build and model deploy) in your GitLab account.
    • The first repository (model build) provides code to create a multi-step model building pipeline. This includes steps for data processing, model training, model evaluation, and conditional model registration based on accuracy. It trains a linear regression model using the XGBoost algorithm on the well-known UCI Machine Learning Abalone dataset.
    • The second repository (model deploy) contains the code and configuration files for model deployment, as well as the test scripts required to pass the quality benchmark. These are code stubs that must be defined for your use case.
    • Each repository also has a GitLab CI pipeline. The model build pipeline automatically triggers and runs the pipeline from end to end whenever a new commit is made to the model build repository. The model deploy pipeline is triggered whenever a new model version is added to the model registry, and the model version status is marked as Approved.
  • SageMaker Pipelines – Contains the directed acyclic graph (DAG) that includes data preparation, model training, and model evaluation.
  • Amazon S3 – An Amazon Simple Storage Service (Amazon S3) bucket stores the output model artifacts that are generated from the pipeline.
  • AWS Lambda – Two AWS Lambda functions are created, which we review in more detail later in this post:
    • One function seeds the code into your two GitLab repositories.
    • One function triggers the model deployment pipeline after the new model is registered in the model registry.
  • SageMaker Model Registry – Tracks the model versions and respective artifacts, including the lineage and metadata. A model package group is created that contains the group of related model versions. The model registry also manages the approval status of the model version for downstream deployment.
  • Amazon EventBridge Amazon EventBridge monitors all changes to the model registry. It also contains a rule that triggers the Lambda function for the model deploy pipeline, when the model package version state changes from PendingManualApproval to Approved in the model registry.
  • AWS CloudFormation AWS CloudFormation deploys the model and creates the SageMaker endpoints when the model deploy pipeline is triggered by the approval of the trained model.
  • SageMaker hosting – Creates two HTTPS real-time endpoints to perform inference. The hosting option is configurable, for example, for batch transform or asynchronous inference. The staging endpoint is created when the model deploy pipeline is triggered by the approval of the trained model. This endpoint is used to evaluate the deployed model by confirming it’s generating predictions that meet our target accuracy requirements. When the model is ready to be deployed in production, a production endpoint is provisioned by manually starting the job in the GitLab model deploy pipeline.

Use the new MLOps project template with GitLab and GitLab pipelines

In this section, we review the parameters required for the MLOps project template (see the following screenshot). This template allows you to utilize GitLab pipelines as your orchestrator.

The template has the following parameters:

  • GitLab Server URL – The URL of the GitLab server in https:// format. The GitLab accounts under your organization may contain a different customized server URL (domain). The server URL is required to authorize access to the python-gitlab API. You use the personal access token you created to allow permission to the Lambda functions to push the seed code into your GitLab repositories. We discuss the Lambda function code in more detail in the next section.
  • Base URL for your GitLab Repositories – The URL for your GitLab account to create the model build and deploy repositories in the format of https://<gitlab server>/<username> or https://<gitlab server><group>/<project>. You must create a personal access token under your GitLab user account in order to authenticate with the GitLab API.
  • Model Build Repository Name – The name of the repository mlops-gitlab-project-seedcode-model-build of the model build and training seed code.
  • Model Deploy Repository Name – The name of the repository mlops-gitlab-project-seedcode-model-deploy of the model deploy seed code.
  • GitLab Group ID – GitLab groups are important for managing access and permissions for projects. Enter the ID of the group that repositories are created for. In this example, we enter None, because we’re using the root group.
  • GitLab Secret Name (Secrets Manager) – The secret in AWS Secrets Manager contains the value of the GitLab personal access token that is used by the Lambda function to populate the seed code in the repositories. Enter the name of the secret you created in Secrets Manager.

Lambda functions code overview

As discussed earlier, we create two Lambda functions. The first function seeds the code into your GitLab repositories. The second function triggers your model deployment. Let’s review these functions in more detail.

Seedcodecheckin Lambda function

This function helps create the GitLab projects and repositories and pushes the code files into these repositories. These files are needed to set up the ML CI/CD pipelines.

The Secrets Manager secret is created to allow the function to retrieve the stored GitLab personal access token. This token allows the function to communicate with GitLab to create repositories and push the seed code. It also allows the environment variables to be passed in through the project.yml file. See the following code:

def get_secret():
    ''' '''
    secret_name = os.environ['SecretName']
    region_name = os.environ['Region']
    
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

The Secrets Manager secret was created when you ran the init.sh file earlier as part of the code repo prerequisites.

The deployment package for the function contains several libraries, including python-gitlab and cfn-response. Because our function’s source code is packaged as a .zip file and interacts with AWS CloudFormation, we use cfn-response. We use the python-gitlab API and the Amazon SDK for Python (Boto3) to download the seed code files and upload them to Amazon S3 to be pushed to our GitLab repositories. See the following code:

    # Configure SDKs for GitLab and S3
    gl = gitlab.Gitlab(gitlab_server_uri, private_token=gitlab_private_token)
    s3 = boto3.client('s3')
 
    model_build_filename = f'/tmp/{str(uuid.uuid4())}-model-build-seed-code.zip'
    model_deploy_filename = f'/tmp/{str(uuid.uuid4())}-model-deploy-seed-code.zip'
    model_build_directory = f'/tmp/{str(uuid.uuid4())}-model-build'
    model_deploy_directory = f'/tmp/{str(uuid.uuid4())}-model-deploy'

    # Get Model Build Seed Code from S3 for Gitlab Repo
    with open(model_build_filename, 'wb') as f:
        s3.download_fileobj(sm_seed_code_bucket, model_build_sm_seed_code_object_name, f)

    # Get Model Deploy Seed Code from S3 for Gitlab Repo
    with open(model_deploy_filename, 'wb') as f:
        s3.download_fileobj(sm_seed_code_bucket, model_deploy_sm_seed_code_object_name, f)

Two projects (repositories) are created in GitLab, and the seed code files are pushed into the repositories (model build and model deploy) using the python-gitlab API:

# Create the GitLab Project
    try:
        if group_id is None:
            build_project = gl.projects.create({'name': gitlab_project_name_build})
        else:
            build_project = gl.projects.create({'name': gitlab_project_name_build, 'namespace_id': int(group_id)})
    ....
    try:
        if group_id is None:
            deploy_project = gl.projects.create({'name': gitlab_project_name_deploy})
        else:
            deploy_project = gl.projects.create({'name': gitlab_project_name_deploy, 'namespace_id': int(group_id)})
    ....
    
    # Commit to the above created Repo all the files that were in the seed code Zip
    try:
        build_project.commits.create(build_data)
    except Exception as e:
        logging.error("Code could not be pushed to the model build repo.")
        logging.error(e)
        cfnresponse.send(event, context, cfnresponse.FAILED, response_data)
        return { 
            'message' : "GitLab seedcode checkin failed."
        }

    try:
        deploy_project.commits.create(deploy_data)
    except Exception as e:
        logging.error("Code could not be pushed to the model deploy repo.")
        logging.error(e)
        cfnresponse.send(event, context, cfnresponse.FAILED, response_data)
        return { 
            'message' : "GitLab seedcode checkin failed."
        }

The following screenshot shows the successful run of the Lambda function pushing the required seed code files into both projects in your GitLab account.

gitlab-trigger Lambda function

This Lambda function is triggered by EventBridge. The project.yml CloudFormation template contains an EventBridge rule that triggers the function when the model package state changes in the SageMaker model registry. See the following code:

ModelDeploySageMakerEventRule:
    Type: AWS::Events::Rule
    Properties:
      # Max length allowed: 64
      Name: !Sub sagemaker-${SageMakerProjectName}-${SageMakerProjectId}-event-rule # max: 10+33+15+5=63 chars
      Description: "Rule to trigger a deployment when SageMaker Model registry is updated with a new model package. For example, a new model package is registered with Registry"
      EventPattern:
        source:
          - "aws.sagemaker"
        detail-type:
          - "SageMaker Model Package State Change"
        detail:
          ModelPackageGroupName:
            - !Sub ${SageMakerProjectName}-${SageMakerProjectId}
      State: "ENABLED"
      Targets:
        -
          Arn: !GetAtt GitLabPipelineTriggerLambda.Arn
          Id: !Sub sagemaker-${SageMakerProjectName}-trigger

The following screenshot contains a subset of the function code that triggers the GitLab pipeline in the .gitlab-ci.yml file. It deploys the SageMaker model endpoints using the CloudFormation template endpoint-config-template.yml in your model deploy repository.

To better understand the solution, review the entire code for the functions as needed.

GitLab and GitLab pipelines overview

As described earlier, GitLab plays a key role as the source code repo and enabling CI/CD pipelines in this solution. Let’s look into our GitLab account to understand the components.

After the project is successfully created, using our custom template in SageMaker projects per the steps in the code repo, navigate to your GitLab account to see two new repositories. Each repository has a GitLab CI pipeline associated with it that runs as soon as the project is created.

The first run of each pipeline fails because GitLab doesn’t have the AWS credentials. For each repository, navigate to Settings, CI/CD, Variables. Create two new variables, AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, with the associated information for your GitLab role.

Model build pipeline in GitLab

Let’s review the GitLab pipelines, starting with the model build pipeline. We define the pipelines in GitLab by creating the .gitlab-ci.yml file, where we define the various stages and related jobs. As shown in the following screenshot, this pipeline has only one stage (training) and the related script shows how a SageMaker pipeline file is triggered. (You can learn more about the SageMaker pipeline by exploring the pipeline.py file on GitHub.)

When this GitLab pipeline is triggered, it starts the Abalone SageMaker pipeline to build your model.

When the model build is complete, you can locate this model in the model registry in SageMaker Studio.

Use this template for your custom use case

The model build repository contains code for preprocessing, training, and evaluating the model for the UCI Abalone dataset. You need to modify the files to address your custom use case.

  1. Navigate to the pipelines folder in your model build repository.

  1. Upload your dataset to a S3 bucket. Replace the bucket URL in this section of your pipeline.py file.

  1. Navigate to .gitlab-ci.yml and modify this section with the folder and file of your use case.

Model deployment pipeline in GitLab

When the SageMaker pipeline that trains the model is complete, a model is added to the SageMaker model registry. If that model is approved, the GitLab pipeline in the model deploy repository starts and the model deployment process begins.

To approve the model in the model registry, complete the following steps:

  1. Choose the Components and registries icon.
  2. Choose Model registry, and choose (right-click) the model version.
  3. Choose Update model version status.
  4. Change the status from Pending to Approved.

This triggers the deploy pipeline.

Now, let’s review the .gitlab-ci.yml file in the model deploy repository. As shown in the following screenshot, this model deploy pipeline has four stages: build, staging deploy, test staging, and production deploy. This pipeline uses AWS CloudFormation to deploy the model and create the SageMaker endpoints.

A manual step in the GitLab pipeline exists for model promotion from staging to production that creates an endpoint with the suffix -prod. If you choose manual, this job runs and upon completion deploys the SageMaker endpoint.

To verify that the endpoints were created, navigate to the Endpoints page on the SageMaker console. You should see two endpoints: <model_name>-staging and <model_name>-prod.

GitLab implementation patterns

In this section, we discuss two patterns for implementing GitLab: hosting with Amazon Virtual Private Cloud (Amazon VPC), or with two-factor authentication.

Hosting GitLab in an Amazon VPC

You may choose to deploy GitLab in an Amazon VPC to use a private network and provide access to AWS resources. In this scenario, the Lambda functions also must be deployed in a VPC to access the GitLab API. We accomplish this by updating the project.yml file and the AWS Identity and Access Management (IAM) role AmazonSageMakerServiceCatalogProductsUseRole.

The IAM user that you used to create the VPC requires the following user permissions for Lambda to verify network resources:

  • ec2:DescribeSecurityGroups
  • ec2:DescribeSubnets
  • ec2:DescribeVpcs

The Lambda functions’ execution role requires the following permissions to create and manage network interfaces:

  • ec2:CreateNetworkInterface
  • ec2:DescribeNetworkInterfaces
  • ec2:DeleteNetworkInterface
  1. On the IAM console, search for AmazonSageMakerServiceCatalogProductsUseRole.
  2. Choose Attach policies.
  3. Search for the AWSLambdaVPCAccessExecutionRole managed policy.
  4. Choose Attach policy.

Next, we update project.yml to configure the functions to deploy in a VPC by providing the VPC security groups and subnets.

    1. Add the subnet IDs and security group IDs to the Parameters section, for example:
      SubnetId1:
      Type: AWS::EC2::Subnet::Id
      Description: Subnet Id for Lambda function
      
      SubnetId2:
      Type: AWS::EC2::Subnet::Id
      Description: Subnet Id for Lambda function
      
      SecurityGroupId:
      Type: AWS::EC2::SecurityGroup::Id
      Description: Security Group Id for Lambda function to Execute
      

    2. Add the VpcConfig information under Properties for the GitSeedCodeCheckinLambda and GitLabPipelineTriggerLambda functions, for example:
      SubnetId1:
      GitSeedCodeCheckinLambda:
      Type: 'AWS::Lambda::Function'
      Properties:
      Description: To trigger the codebuild project for the seedcode checkin
      .....
      VpcConfig:
      SecurityGroupIds:
      - !Ref SecurityGroupId
      SubnetIds:
      - !Ref SubnetId1
      - !Ref SubnetId2
      

Two-factor authentication enabled

If you enabled two-factor authentication on your GitLab account, you need to use your personal access token to clone the repositories in SageMaker Studio. The token requires the read_repository and write_repository flags. To clone the model build and model deploy repositories, enter the following commands:

git clone https://oauth2:PERSONAL_ACCESS_TOKEN@gitlab.com/username/gitlab-project-seedcode-model-build-<project-id>
git clone https://oauth2:PERSONAL_ACCESS_TOKEN@gitlab.com/username/gitlab-project-seedcode-model-deploy-<project-id>

Because you previously created a secret for your personal access token, no changes are required to the code when two-factor authentication is enabled.

Summary

In this post, we walked through using a custom SageMaker MLOps project template to automatically build and configure a CI/CD pipeline. This pipeline incorporated your existing CI/CD tooling with SageMaker features for data preparation, model training, model evaluation, and model deployment. In our use case, we focused on using GitLab and GitLab pipelines with SageMaker projects and pipelines. For more detailed implementation information, review the GitHub repo. Try it out and let us know if you have any questions in the comments section!


About the Authors

Kirit Thadaka is an ML Solutions Architect working in the Amazon SageMaker Service SA team. Prior to joining AWS, Kirit spent time working in early stage AI startups followed by some time in consulting in various roles in AI research, MLOps, and technical leadership.

Lauren Mullennex is a Solutions Architect based in Denver, CO. She works with customers to help them architect solutions on AWS. In her spare time, she enjoys hiking and cooking Hawaiian cuisine.

Indrajit Ghosalkar is a Sr. Solutions Architect at Amazon Web Services based in Singapore. He loves helping customers achieve their business outcomes through cloud adoption and realize their data analytics and ML goals through adoption of DataOps / MLOps practices and solutions. In his spare time, he enjoys playing with his son, traveling and meeting new people.

Read More

Simplified MLOps with Deep Java Library

This is a guest post by Lucas Baker, Andrea Duque, and Viet Yen Nguyen of Hypefactors.  

At Hypefactors, we build tech for media intelligence and reputation management. The solution is a software as a service (SaaS) product that does large-scale media monitoring of social media, news sites, TV, radio, and reviews across the world. The tracked data is streamed continuously and enriched in real time. This yields insights that can reveal early business opportunities (for example, GameStop hype), track the success of product launches, and preempt disasters.

To this end, over a hundred million network requests are made daily from data pipelines for web crawling, social media firehoses, and other REST-based media data integrations. This yields millions of new articles and posts each day. This data can be segmented into three classes (as illustrated with the following examples):

  • Owned – Articles or posts written by a company and published on their own website or social media feed.
  • Paid – Information written by a company and published on third-party websites or social media. This is known colloquially as advertisement.
  • Earned – Information written by a third party and published on that party’s website or social media.
Owned media Earned media Paid media

Differentiating between earned articles and owned or paid ones is of existential importance. Earned information is more independent and therefore interpreted as more trustworthy—no matter if it’s positive or negative for the company. Advertisement, on the other hand, is written by the company and portrays the best interests of the company. Therefore, to accurately track reputation, we must filter out advertisements.

This post goes deeper into our deep learning natural language processing (NLP) based advertisement predictor, how we integrated the predictor into one of our pipelines using Deep Java Library (DJL), and how that change made our architecture simpler and MLOps easier. DJL is an open source Java framework for deep learning created by AWS.

Printed newspapers and magazines: Challenges

We receive thousands of different magazines and newspapers directly from publishing houses in the form of digital files. One of the data teams within Hypefactors has developed a data pipeline, which we call the Print-ETL. The Print-ETL processes the raw data and ingests it into a database. The ingested data is made searchable in a user-friendly way by the Hypefactors web platform.

Processing and realigning data from different data providers is generally challenging. This is also the case with handling different publishing houses as data providers. The challenges are technical, organizational, and a combination thereof. That is partly because media houses are legacy both in their data delivery and data formats.

Organizational challenges include disagreement between different media houses on how media data should be delivered, and the lack of a common schema. A common strategy media houses use is to provide print data via an SFTP server. This can be consumed by periodically connecting and fetching the data. Most of the time we retrieve only the digital PDF files of the editions, but they can also arrive in other formats, such as XML or ZIP. On top of that, files often come with no relevant metadata about the publication. Such metadata is useful, for example, to identify the title of the newspaper or the magazine.

The technical challenges are various. However, when it comes to PDFs, one of the biggest challenges is that a PDF may or may not be vectorized. A vectorized PDF, as opposed to a bitmapped one, is one that contains all the raw data that appears on the page. When a PDF is vectorized, it’s easy to retrieve its text. But when it’s not, all we have are bitmapped images. To make articles searchable for users, the content of a bitmapped PDF needs to be transformed to a text format using optical character recognition (OCR) solutions.

Another big challenge is that PDFs can have any number of pages. Typically, there is no information telling us which pages constitute an article. There can be several articles sharing one PDF page, or several PDF pages containing a single article. Advertisements also appear anywhere—they can cover the whole page, several pages, or just a small section close to an article.

To mitigate these difficulties, we developed elaborate development and operations procedures. These are assisted by automated procedures, such as automated unit and end-to-end testing, as well as automated testing, staging, and production rollouts. Operations therefore play an essential role to keep the overall solution running.

Print-ETL architecture

The data pipeline processes events, in which each event contains a file retrieved from a media house. These events are processed in a distributed and concurrent manner by subscribing to a message topic. We use Monix, a Scala library for asynchronous computation, to process the events with high performance. Ideally, we process data as soon as it arrives, but we don’t have control over when data is released. Therefore, we have periodic peak loads of these events. At other times, there are no events at all. The whole system is deployed in the cloud to make use of its elasticity. Cloud instances are auto scaled proportionally to the number of events received, so naturally the more data we receive, the more resources we use to process that data.

The Print-ETL uses deep learning and other AI techniques to solve most print media challenges and extract the relevant information out of the raw print data. There are several AI and machine learning (ML) models in place. These include computer vision models (for page segmentation) and NLP models (for ad prediction, headline detection, and next sentence prediction).

Today’s practices are that deploying deep learning models incurs complexity by itself. Correspondingly, new practices come into the spotlight for managing the ML lifecycle in production reliably and efficiently—the emerging field of MLOps. In our use case, we use Deep Java Library (DJL) to integrate ML models into our data pipelines written in Scala. We found that this strategy simplifies model deployment and maintenance alike. In this post, we focus on the model we use to filter paid advertisements: the ad predictor.

The following diagram illustrates the Print-ETL architecture.

First version ad predictor: Serverless inference

We approached the advertisement classification challenge as a supervised binary text classification problem. We fine-tuned a BERT (Bidirectional Encoder Representations from Transformers) pre-trained multilingual base model with a binary classification layer on top of the transformer output. For training, we used a custom-built dataset containing advertisement data that we collected. The input of the model is a sequence of tokens, and the output is a classification score from 0–1, which is the probability of being an ad. This score is calculated by applying a sigmoid function to the linear layer prediction outputs (logits).

On our first iteration, we deployed a standalone ad predictor endpoint on an external service. This made operations harder. Predictions had a higher latency because of network calls and boot up times, causing timeouts and issues resulting from predictor unavailability due to instance interruptions. We also had to auto scale both the data pipeline and the prediction service, which was non-trivial given the unpredictable load of events. However, this strategy also had a few benefits. The service was packaged separately as an API and developed in Python, a language more familiar to data scientists than Scala. Also, the predictor wasn’t integrated into the Print-ETL system, so it wasn’t necessary to be familiar with the system to maintain the predictor.

The following diagram illustrates our BERT model for text classification.

The following is an example of our ads data.

Second version with DJL

Our solution to these challenges centered on combining the benefits of two frameworks: Open Neural Network Exchange (ONNX) and Deep Java Library.

With ONNX and DJL, we deployed a new multilingual ad predictor model directly in our pipeline. This replaced our first solution, the serverless ad predictor. The new model was fine-tuned on a new, larger set of data that contained over 450,000 sentences in Danish, English, and Portuguese. They reflect a sample of the production data being processed at the moment.

When deploying the model, DJL enabled us to adopt an API-free strategy. This strategy improved our data processing in myriad ways. For instance, it helped us achieve our latency requirements and use ML inferences in real time. Also, by replacing our standalone ad predictor, we no longer needed to mock an external service API in our tests. That allowed us to simplify our test suite. This in turn led to more test stability. Following our successful deployment, DJL allowed us to integrate other ML models that improved data processing even further.

Let’s go into the details of ONNX and DJL.

ONNX

ONNX is an open-source ecosystem of AI and ML tools designed to provide extensive interoperability between different deep learning frameworks. It manages models from different languages and environments. Their tools and common file format enable us to train a model using one framework, dynamically quantize it using tools from another, and deploy that model using yet another framework. That increased interoperability, along with help from DJL, allowed us to easily integrate our model with the JVM—and consequently our Scala pipeline as well.

More specifically, we used a tool called ONNX Runtime. We converted our original PyTorch model to the standard ONNX file format, and then applied dynamic quantization techniques using ONNX Runtime. This shrunk our original model size by about a factor of four with little to no loss in model performance. It also gave our model a speed boost on CPU-based inferences. In particular, from prior rollouts we had simple, yet cost-effective performance with 8 bits quantization when running a CPU with AVX-512 instructions. We were confident that this strategy would give us the results we were looking for.

Deep Java Library

DJL presented the other half of our solution. DJL is an open-source library that defines a Java-based deep learning framework. DJL abstracts away complexities involved with deep learning deployments, making training and inference a breeze. It’s engine agnostic, and is therefore compatible with a wide variety of deep learning engines. Those engines include PyTorch, TensorFlow, and MXNet, among others. Most importantly for us, DJL supports the ONNX Runtime engine.

Our DJL-based deployment brought several advantages over our original ad predictor deployment. First and foremost, from an engineering perspective, it was simpler. The direct native integration of ad prediction with our Scala data pipeline streamlined our architecture considerably. It allowed us to avoid the computational overhead of serializing and deserializing data, as well as the latency of making network calls to an external service.

Additionally, this meant that there was no longer any need for complicated autoscaling of an external service—the pipeline’s existing autoscaling infrastructure was sufficient to meet all our data processing requirements. Moreover, DJL’s predictor architecture worked well with Monix’s concurrent data processing, allowing us to make multiple inferences simultaneously across different threads.

Those simplifications led us to eliminate our standalone ad predictor service entirely. This eliminated all operational costs associated with running and maintaining that service.

Another consequence of those simplifications was the further streamlining of our test suite. For example, we no longer needed to mock our ad predictor. We could instead directly ensure the correctness and performance of our model on every commit using our continuous integration (CI). Upon every new commit pushed to the Print-ETL, our CI would run our suite of tests, which included tests for the DJL-based ad predictor. This maintains our confidence that our deep learning model works properly whenever we change our code base.

The following screenshot is a snippet of our ad detection CI in action.

Our testing strategy is now twofold: first, we use tests to determine the validity of our ad predictor model’s output; namely, the model should detect the same ads with the same, or higher, level of accuracy as previous iterations of the model. Second, the model’s robustness is stressed by passing particularly long, short, strange, or fragmented text samples. End-to-end performance tests that take advantage of the ad predictor’s services add a second layer of accountability. This makes sure that current and future deployments of our ad predictor function as intended. If the ad predictor isn’t performing as expected, our tests immediately reflect that incapability. The following code is an example of some sample test cases:

  /** Some sample test cases */
  it should "detect ads in danish, english, and portuguese" in {
    val daAdSentence = "Lidt bedre end andre gode oste"
    val daAdLikelihood = AdDetector.predict(daAdSentence)
    daAdLikelihood.success.value should be > 0.9d

    val enAdSentence = "Save 10% when you buy in the next ten minutes!"
    val enAdLikelihood = AdDetector.predict(enAdSentence)
    enAdLikelihood.success.value should be > 0.9d

    val ptAdSentence = "Defenda a sua saúde, tomando YOGHURT"
    val ptAdLikelihood = AdDetector.predict(ptAdSentence)
    ptAdLikelihood.success.value should be > 0.9d
  }

This, in turn, simplified our operations strategy as well. It’s now easier to spot, track, and reproduce inference errors if and when they occur. Such an error immediately tells us which input the model failed to predict on, the exact error message given by ONNX Runtime, along with relevant information for reproducing the error. Also, because our ad predictor is now integrated with our data pipeline, we only need to consult one log stream when analyzing error messages. After the associated bug is reproduced and fixed, we can add a new test case to ensure the same bug doesn’t occur again.

Conclusion and next steps

We have been happy with our DJL-based deployment. Our success with DJL has empowered us to utilize the same strategy to deploy other deep learning models for other purposes, such as headline detection and next sentence prediction. In each of those cases, we experienced similar results as with our ad predictor—deployment was easy, simple, and economical.

In the future, one avenue we would be excited to explore with DJL is GPU-based inference. Our current DJL deployments are exclusively CPU based—partially due to its cost-effectiveness, and partially due to its simplicity when compared to a GPU-based alternative. Given our experiences with DJL, however, we believe that DJL could drastically streamline any GPU-based deployment that we pursue. To learn more and get started on DJL, visit the website. You can also visit the GitHub repodemo repository, examples, Slack channel, and Twitter for more documentation and examples of DJL!

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Lukas Baker works in the intersection of data engineering and applied machine learning. At Hypefactors, he occasionally builds a data pipeline and designs and trains a model in between.

Andrea Duque is an all-round engineer and scientist with a history of connecting the dots with MLOps. At Hypefactors, she designs and rollouts ML-heavy data pipelines end-to-end.

Viet Yen Nguyen is the CTO of Hypefactors and leads the teams on data science, web app development and data engineering. Prior to Hypefactors, he developed technology for designing mission-critical systems, including the European Space Agency.

Read More

How Careem is detecting identity fraud using graph-based deep learning and Amazon Neptune

This post was co-written with Kevin O’Brien, Senior Data Scientist in Careem’s Integrity Team.

Dubai-based Careem became the Middle East’s first unicorn when it was acquired by Uber for $3.1 billion in 2019. A pioneer of the region’s ride-hailing economy, Careem is now expanding its services to include mass transportation, delivery, and payments as an everyday super app.

But its size and popularity—it has around 50 million customer accounts—have also made it a prime target for fraudsters constantly looking for new loopholes to exploit and different ways to hijack genuine accounts.

In this post, we share how Careem detects identity fraud using graph-based deep learning and Amazon Neptune.

The challenge

Due to Careem’s massive popularity, fraudsters are constantly looking for new loopholes to exploit, create identity-faked accounts (first-party fraud), and different ways to hijack genuine accounts—also known as account takeover (third-party fraud). In Careem’s data science and analytics backed Integrity team, they needed more advanced ways to detect and stop losses from fraud that may be damaging to both their revenue and brand reputation. This solution would ideally cover both first- and third-party fraud.

Traditionally, tackling these different kinds of fraudulent activities was a never-ending game of cat and mouse. Careem’s Integrity team would often create rules or machine learning (ML) models for each specific type of fraud, but this was sometimes problematic on two levels:

  • It only allowed them to identify and block an account after the fraud had been committed and detected, which means the money had already been lost
  • Fraudsters were quickly able to find a new loophole to exploit once an existing fraud pattern had been detected

As a result, instead of continuously creating overly specific tools to detect very specific fraud patterns, they wanted to build an intelligent system that was almost a blanket detection mechanism over all users, wherever they were performing actions on the platform.

The new approach

Careem needed to be proactive rather than reactive. A smarter and faster way to detect fraudulent activities and stop them before the act was committed was required.

After much experimentation, Careem decided to focus on the identity of users, and came up with a powerful way to outsmart any efforts of identity fraud. They opted to use a graph structure as a way of mapping different aspects and data points of each user’s identity together, and more importantly, characteristics shared across the identities of different users. This would allow them to detect potentially fraudulent patterns in real time across user and account activity.

Architecture overview

Before we dive deep into how Careem used Neptune an identity graph for fraud detection, let’s look at the current architecture underpinning the solution. Careem chose AWS and its automated real-time analysis and monitoring capabilities due to the existing integrated cloud setup they already had.

Data ingestion

Data ingestion comprises two stages: a one-time extract, transform, and load (ETL) for all historical data, and a live streaming service of real-time data.

  • Historical data – Careem uses Apache Hive running on Amazon Simple Storage Service (Amazon S3) to extract data and push it to Amazon EMR with PySpark. Amazon EMR pushes this historical data to Neptune.
  • Real-time data – Careem uses their existing event processor to feed the data from all actions performed by users through Amazon Simple Queue Service (Amazon SQS). These events are consumed by a Python interface running on AWS Elastic Beanstalk, which takes these events and writes them to Neptune in real time.

Data querying

The data ingested from these sources is then queried, again using the Python interface running on Elastic Beanstalk. A simple set of logical rules is used to process the data returned for a query on a particular user, and a decision is made on whether the action performed was likely to be done by a fraudster. Based on the value of the user’s historical transaction, the fraudulent account is either blocked automatically, if it’s a low-value customer, or sent for manual review, if they’re a high-value customer.

Data consumption

The Integrity team at Careem developed a data consumption API that is used by the other teams at Careem to query users in the graph to retrieve data about their identities.

Implementing the graph data model on Neptune

The basic building blocks of any directed graph are vertices (or nodes) and edges. A vertex is an object that represents an entity in your data. For example, a customer can be a node, and the features and information about this customer are called node properties. An edge represents a connection between different nodes. For example, we may have an edge with a label called has_device that connects a customer node to a device node. A large collection of different nodes and edges are called a graph, as illustrated in the following diagram.

One type of graph architecture is called an identity graph. Identity graphs provide a single unified view of different identities by linking multiple node identifiers such as device IDs, IP addresses, emails, or credit cards to a known person or anonymous profile using privacy-compliant methods. Typically, identity graphs are part of a larger identity resolution architecture. Identity resolution is the process of matching a human identity across a set of devices used by the same person or a household of persons for the purposes of building a representative identity, or known attributes. We can then use this identity graph to find patterns in our data that could indicate fraud activities. We can evaluate identities in the context of other identities or transactions and determine if constellations of data in the graph represent fraudulent activity.

The task we are solving in this case is called node classification. Node classification is a supervised ML approach whereby we predict the categorical feature of a node property. In this case, we decided to build a graph model to predict the is_fraud property of customer nodes using Amazon Neptune ML. Neptune ML is a feature in Neptune that makes it easy to build and train ML models on large graphs using graph neural networks (GNNs). It uses Amazon SageMaker and the Deep Graph Library (DGL) to scale the training and tuning of the graph model.

Data labeling strategy and maturity

In addition to building the graph from different data sources, we needed a robust data labeling and data maturity strategy for the supervised learning task. Data maturity is the process of making sure that the fraud labels have had sufficient time to mature. In other words, enough time has passed to ensure legitimate and fraud records have been correctly and accurately identified. The maturity period can vary depending on the business. For example, for chargeback fraud, it can take somewhere between 30 days and 2 months to accurately identify fraudulent events.

Careem’s customer nodes in the graph were labeled as fraudulent if they had historically been blocked for fraud either manually or by another one of Careem’s automated fraud detection systems that are rule based. These labels are added to the graph either in the historical ETL, for users who are already blocked, or in live streaming, which blocks users in real time. They ensured the maturity of these labels by only using fraud labels for blocked users who hadn’t contacted customer care requesting for their block to be reviewed within a period of time after being blocked.

One issue that arose was that there were many fraudulent accounts that had gone undetected. The volume of these mislabeled customer nodes was substantial enough to affect training performance of the model. To combat this, a strict set of heuristics, based on domain knowledge of the platform, was applied to the customers in the graph, which allowed a large number of these labels to be corrected using a script in the training dataset with high confidence. This allowed more accurate learning of the model due to a reduction in noisy labels.

Collaboration with AWS on Neptune ML

Throughout this project, Careem’s Integrity team worked closely with the AWS ML Specialist and Neptune ML teams to develop this project with maximum efficiency and effectiveness. This included first-hand, on-call support and troubleshooting, as well as working together to build, scale, and optimize our graph.

In addition, Careem has a large volume of properties on the edges in their graph, which were previously not being used in the model’s training and predictions. Careem provided input on the development of a modified version of the RGCN architecture in Neptune ML, which uses edge properties from the graph to learn representations, not just node properties alone, which is what the traditional RGCN model does. Throughout this process, the Neptune ML team also worked on critical features that enabled Careem to train and optimize the graph at scale. These features include multi-GPU training, custom performance metrics, training instance size estimation, scalable and parallel processing, and hyperparameters custom tuning. All of these features are available now in the latest Neptune ML release, which became generally available as of July 2021.

Looking to the future

Careem is currently working with the AWS team to build and train a deep learning model to more accurately detect fraud on their user identity graph. Testing results for the initial phase are looking promising so far, with a precision of around 85% and a recall of over 50%. In other words, the model is able to correctly identify over 50% of all users that have ever historically been blocked for fraud on the platform, with an accuracy of 85%. All of this without knowing anything about the user’s transaction history, bookings, food and grocery orders, and other details—just data about their identity.

Work is now being done to deploy this trained model to production, allowing it to detect fraud in cases such as when a fraudster sets up a new account or compromises the account of an existing genuine user. This will all be done as users perform actions in real time.

In the future, Careem also plans to add Captains (what Careem’s drivers are known as) to the graph to similarly detect fraudulent Captains, or even fraudulent activity produced by collusion between users and Captains. To learn more about Amazon Neptune ML, visit the website.


About the Authors

Kevin O’Brien is a Senior Data Scientist at Careem. He is a member of the Integrity team, whose goal is to detect and prevent fraud on the platform, through data science and analytics. Kevin leads the Identity Risk squad of the Integrity team.

Waleed (Will) Badr is a Principal AI/ML Specialist Solutions Architect who works as part of the global Amazon Machine Learning team. Will has an extensive experience in fraud detection and prevention systems and is passionate about using technology in innovative ways to positively impact the community.

Kamran Habib is a Senior Solutions Architect who works with our Digital Native Business (DNB) customers in the Middle East and North Africa (MENA) region. Kamran’s technical expertise focuses on Containers, Networking and Security and he is passionate about solving customer’s business problems with innovative technical solutions. In his spare time, he enjoys travel, listening to podcasts and cricket.

Read More

Bring Your Amazon SageMaker model into Amazon Redshift for remote inference

Amazon Redshift, a fast, fully managed, widely used cloud data warehouse, natively integrates with Amazon SageMaker for machine learning (ML). Tens of thousands of customers use Amazon Redshift to process exabytes of data every day to power their analytics workloads. Data analysts and database developers want to use this data to train ML models, which can then be used to generate insights for use cases such as forecasting revenue, predicting customer churn, and detecting anomalies.

Amazon Redshift ML makes it easy for SQL users to create, train, and deploy ML models using familiar SQL commands. In a previous post, we covered how Amazon Redshift ML allows you to use your data in Amazon Redshift with SageMaker, a fully managed ML service, without requiring you to become an expert in ML. We also discussed how Amazon Redshift ML enables ML experts to create XGBoost or MLP models in an earlier post. Additionally, Amazon Redshift ML allows data scientists to either import existing SageMaker models into Amazon Redshift for in-database inference or remotely invoke a SageMaker endpoint.

This post shows how you can enable your data warehouse users to use SQL to invoke a remote SageMaker endpoint for prediction. We first train and deploy a Random Cut Forest model in SageMaker, and demonstrate how you can create a model with SQL to invoke that SageMaker predictions remotely. Then, we show how end users can invoke the model.

Prerequisites

To get started, we need an Amazon Redshift cluster with the Amazon Redshift ML feature enabled. For an introduction to Amazon Redshift ML and instructions on setting it up, see Create, train, and deploy machine learning models in Amazon Redshift using SQL with Amazon Redshift ML.

You also have to make sure that the SageMaker model is deployed and you have the endpoint. You can use the following AWS CloudFormation template to provision all the required resources in your AWS accounts automatically.

Solution overview

Amazon Redshift ML supports text and CSV inference formats. For more information about various SageMaker algorithms and their inference formats, see Random Cut Forest (RCF) Algorithm.

Amazon SageMaker Random Cut Forest (RCF) is an algorithm designed to detect anomalous data points within a dataset. Examples of anomalies that are important to detect include when website activity uncharacteristically spikes, when temperature data diverges from a periodic behavior, or when changes to public transit ridership reflect the occurrence of a special event.

In this post, we use the SageMaker RCF algorithm to train an RCF model using the Notebook generated by the CloudFormation template on the Numenta Anomaly Benchmark (NAB) NYC Taxi dataset.

We downloaded the data and stored it in an Amazon Simple Storage Service (Amazon S3) bucket. The data consists of the number of New York City taxi passengers over the course of 6 months aggregated into 30-minute buckets. We naturally expect to find anomalous events occurring during the NYC marathon, Thanksgiving, Christmas, New Year’s Day, and on the day of a snowstorm.

We then use this model to predict anomalous events by generating an anomaly score for each data point.

The following figure illustrates how we use Amazon Redshift ML to create a model using the SageMaker endpoint.

Deploy the model

To deploy the model, go to the SageMaker console and open the notebook that was created by the CloudFormation template.

Then choose bring-your-own-model-remote-inference.ipynb.

Set up parameters as shown in the following screenshot and then run all cells.

Get the SageMaker model endpoint

On the Amazon SageMaker console, under Inference in the navigation pane, choose Endpoints to find your model name. You use this when you create the remote inference model in Amazon Redshift.

Prepare data to create a remote inference model using Amazon Redshift ML

Create the schema and load the data in Amazon Redshift using the following SQL:

DROP TABLE IF EXISTS public.rcf_taxi_data CASCADE;
CREATE TABLE public.rcf_taxi_data
(
ride_timestamp timestamp,
nbr_passengers int
);
COPY public.rcf_taxi_data
FROM 's3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi/NAB_nyc_taxi.csv'
iam_role 'arn:aws:iam:::<accountid>:role/RedshiftML' ignoreheader 1 csv delimiter ',';

Amazon Redshift now supports attaching the default IAM role. If you have enabled the default IAM role in your cluster, you can use the default IAM role as follows.

COPY public.rcf_taxi_data
FROM 's3://sagemaker-sample-files/datasets/tabular/anomaly_benchmark_taxi/NAB_nyc_taxi.csv'
iam_role default ignoreheader 1 csv delimiter ',';

You can use the Amazon Redshift query editor v2 to run these commands.

Create a model

Create a model in Amazon Redshift ML using the SageMaker endpoint you previously captured:

CREATE MODEL public.remote_random_cut_forest
FUNCTION remote_fn_rcf(int)
RETURNS decimal(10,6)
SAGEMAKER 'randomcutforest-xxxxxxxxx'
IAM_ROLE 'arn:aws:iam::<accountid>:role/RedshiftML';
You can also use the default IAM role with your CREATE MODEL command as follows:
CREATE MODEL public.remote_random_cut_forest
FUNCTION remote_fn_rcf(int)
RETURNS decimal(10,6)
SAGEMAKER 'randomcutforest-xxxxxxxxx'
IAM_ROLE  default;

Check model status

You can use the show model command to view the status of the model:

show model public.remote_random_cut_forest

You get output like the following screenshot, which shows the endpoint and function name.

Compute anomaly scores across the entire taxi dataset

Now, run the inference query using the function name from the create model statement:

select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score
from public.rcf_taxi_data;

The following screenshot shows our results.

Now that we have our anomaly scores, we need to check for higher-than-normal anomalies.

Amazon Redshift ML has batching optimizations to minimize the communication cost with SageMaker and offers high-performance remote inference.

Check for high anomalies

The following code runs a query for any data points with scores greater than three standard deviations (approximately 99.9th percentile) from the mean score:

with score_cutoff as
(select stddev(public.remote_fn_rcf(nbr_passengers)) as std, avg(public.remote_fn_rcf(nbr_passengers)) as mean, ( mean + 3 * std ) as score_cutoff_value
from public.rcf_taxi_data)

select ride_timestamp, nbr_passengers, public.remote_fn_rcf(nbr_passengers) as score
from public.rcf_taxi_data
where score > (select score_cutoff_value from score_cutoff)
order by 2 desc;

The data in the following screenshot shows that the biggest spike in ridership occurs on November 2, 2014, which was the annual NYC marathon. We also see spikes on Labor Day weekend, New Year’s Day and the July 4th holiday weekend.

Conclusion

In this post, we used SageMaker Random Cut Forest to detect anomalous data points in a taxi ridership dataset. In this data, the anomalies occurred when ridership was uncharacteristically high or low. However, the RCF algorithm is also capable of detecting when, for example, data breaks periodicity or uncharacteristically changes global behavior.

We then used Amazon Redshift ML to demonstrate how you can make inferences on unsupervised algorithms (such as Random Cut Forest). This allows you to democratize ML by making predictions with Amazon Redshift SQL commands.

For more information about building different models with Amazon Redshift ML see the Amazon Redshift ML documentation.


About the Authors

Phil Bates is a Senior Analytics Specialist Solutions Architect at AWS with over 25 years of data warehouse experience.

Debu Panda, a principal product manager at AWS, is an industry leader in analytics, application platform, and database technologies and has more than 25 years of experience in the IT world.

Nikos Koulouris is a Software Development Engineer at AWS. He received his PhD from University of California, San Diego and he has been working in the areas of databases and analytics.

Murali Narayanaswamy is a principal machine learning scientist in AWS. He received his PhD from Carnegie Mellon University and works at the intersection of ML, AI, optimization, learning and inference to combat uncertainty in real-world applications including personalization, forecasting, supply chains and large scale systems.

Read More

Run distributed hyperparameter and neural architecture tuning jobs with Syne Tune

Today we announce the general availability of Syne Tune, an open-source Python library for large-scale distributed hyperparameter and neural architecture optimization. It provides implementations of several state-of-the-art global optimizers, such as Bayesian optimization, Hyperband, and population-based training. Additionally, it supports constrained and multi-objective optimization, and allows you to bring your own global optimization algorithm.

With Syne Tune, you can run hyperparameter and neural architecture tuning jobs locally on your machine or remotely on Amazon SageMaker by changing just one line of code. The former is a well-suited backend for smaller workloads and fast experimentation on local CPUs or GPUs. The latter is well-suited for larger workloads, which come with a substantial amount of implementation overhead. Syne Tune makes it easy to use SageMaker as a backend to reduce wall clock time by evaluating a large number of configurations on parallel Amazon Elastic Compute Cloud (Amazon EC2) instances, while taking advantage of SageMaker’s rich set of functionalities (including pre-built Docker deep learning framework images, EC2 Spot Instances, experiment tracking, and virtual private networks).

By open-sourcing Syne Tune, we hope to create a community that brings together academic and industrial researchers in machine learning (ML). Our goal is to create synergies between these two groups by enabling academics to easily validate small-scale experiments at larger scale and industrials to use a broader set of state-of-the-art optimizers.

In this post, we discuss hyperparameter and architecture optimization in ML, and show you how to launch tuning experiments on your local machine and also on SageMaker for large-scale experiments.

Hyperparameter and architecture optimization in machine learning

Every ML algorithm comes with a set of hyperparameters that control the training algorithm or the architecture of the underlying statistical model. Typical examples of such hyperparameters for deep neural networks are the learning rate or the number of units per layer. Setting these hyperparameters correctly is crucial to obtain top-notch predictive performances.

To overcome the daunting process of trial and error, hyperparameter and architecture optimization aims to automatically find the specific configuration that maximizes the validation performance of our ML algorithm. Arguably, the easiest method to solve this global optimization problem is random search, where configurations are sampled from a predefined probability distribution. A more sample-efficient technique is Bayesian optimization, which maintains a probabilistic model of the objective function (here, the validation performance) to guide the search toward the global optimum in a sequential manner.

Unfortunately, with ever-increasing dataset sizes and ever-deeper models, training deep neural networks can be prohibitively slow to tune. Recent advances in hyperparameter optimization, such as Hyperband or MoBster, early stop the evaluation of configurations that are unlikely to achieve a good performance and reallocate the resources that would have been consumed to the evaluation of other candidate configurations. You can obtain further gains by using distributed resources to parallelize the tuning process. Because the time to train a deep neural network can vary widely across hyperparameter and architecture configurations, optimal resource allocation requires our optimizer to asynchronously decide which configuration to run next by taking the pending evaluation of other configurations into account. Next, we see how this works in practice and how we can run this either on a local machine or on SageMaker.

Tune hyperparameters with Syne Tune

We now detail how to tune hyperparameters with Syne Tune. First, you need a script that takes hyperparameters as arguments and reports results as soon as they are observed. Let’s look at a simplified example of a script that exposes the learning rate, dropout rate, and momentum as hyperparameters, and reports the validation accuracy after each training epoch:

from argparse import ArgumentParser
from syne_tune.report import Reporter

if __name__ == '__main__':
    parser = ArgumentParser()
    parser.add_argument('--lr', type=float)
    parser.add_argument('--dropout_rate', type=float)
    parser.add_argument('--momentum', type=float)

    args, _ = parser.parse_known_args()
    report = Reporter()

    for epoch in range(1, args.epochs + 1):
        # ... train model and get validation accuracy        
        val_acc = compute_accuracy()
        
        # Feed the score back to Syne Tune.
        report(epoch=epoch, val_acc=val_acc)

The important part is the call to report. It enables you to transmit results to a scheduler that decides whether to continue the evaluation of a configuration, or trial, and later potentially uses this data to select new configurations. In our case, we use a common use case that trains a computer vision model adapted from SageMaker examples on GitHub.

We define the search space for the hyperparameters (dropout, learning rate, momentum) that we want to optimize by specifying the ranges:

from syne_tune.search_space import loguniform, uniform

max_epochs = 27
config_space = {
    "epochs": max_epochs,
    "lr": loguniform(1e-5, 1e-1),
    "momentum": uniform(0.8, 1.0),
    "dropout_rate": loguniform(1e-5, 1.0),
}

We also specify the scheduler we want to use, Hyperband in our case:

from syne_tune.optimizer.schedulers.hyperband import HyperbandScheduler

scheduler = HyperbandScheduler(
    config_space,
    max_t=max_epochs,
    resource_attr='epoch',
    searcher='random',
    metric="val_acc",
    mode="max",
)

Hyperband is a method that randomly samples configurations and early stops evaluation trials if they’re not performing well enough after a few epochs. We use this particular scheduler for our example, but many others are available; for example, switching searcher=bayesopt enables us to use MoBster, which uses a surrogate model to sample new configurations to evaluate.

We’re now ready to define and launch a hyperparameter tuning job. First, we define the number of workers that evaluate trials concurrently and how long the optimization should run in seconds. Importantly, we use the local backend to evaluate our training script “train_cifar100.py” (see the full code). This means that the tuning happens on the local machine with one Python subprocess per worker. See the following code:

from syne_tune.backend.local_backend import LocalBackend
from syne_tune.tuner import Tuner
from syne_tune.stopping_criterion import StoppingCriterion

tuner = Tuner(
    backend=LocalBackend(entry_point="train_cifar100.py"),
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=7200),
    n_workers=4,
)

tuner.run()

As soon as the tuning starts, Syne Tune outputs the following line:

INFO:syne_tune.tuner:results of trials will be saved on /home/ec2-user/syne-tune/train-cifar100-2021-11-05-13-29-01-468

The log of the trials is stored in the aforementioned folder for further analysis. At any time during the tuning job, we can easily get the results obtained so far by calling load_experiment(“train-cifar100-2021-11-05-15-22-27-531”) and plotting the best result obtained since the start of the tuning job:

from syne_tune.experiments import load_experiment
tuning_experiment = load_experiment("train-cifar100-2021-11-05-15-22-27-531")
tuning_experiment.plot()

The following graph shows our results.

More fine-grained information is available if desired; the results obtained during tuning are stored as well as the scheduler and tuner state—namely, the state of the optimization process. For instance, we can plot the metric obtained for each trial over time (recall that we run four trials asynchronously). In the following figure, each trace represents the evaluation of a configuration as a function of the wall clock time; a dot is a trial stopped after one epoch.

We clearly see the effect of early stopping—only the most promising configurations are evaluated fully and poor performing configurations are stopped early, often after just evaluating a single epoch.

We can also easily switch to another scheduler, for example, random search or MoBster:

from syne_tune.optimizer.schedulers.fifo import FIFOScheduler

scheduler = FIFOScheduler(
    config_space,
    searcher='random',
    metric="val_acc",
    mode="max",
)
scheduler = HyperbandScheduler(
    config_space,
    max_t=max_epochs,
    resource_attr='epoch',
    searcher='bayesopt',
    metric="val_acc",
    mode="max",
)

If we then run the same code with the new schedulers, we can compare all three methods. We see in the following figure that Hyperband only continues well-performing trials, and early stops poorly performing configurations.

Therefore, Hyperband evaluates many more configurations than random search (see the following figure), which uses resources to evaluate every configuration until the end. This can lead to drastic speedups of the tuning process in practice.

MoBster further improves over Hyperband by using a probabilistic surrogate model of the objective function.

The following figure show all configurations that Hyperband samples during the tuning job.

In comparison, MoBster samples more promising configurations around the well-performing range (brighter color being better) of the search space instead of sampling them uniformly at random like Hyperband.

Run large-scale tuning jobs with Syne Tune and SageMaker

The previous example showed how to tune hyperparameters on a local machine. Sometimes, we need more powerful machines or a large number or workers, which motivates the use of a cloud infrastructure. Syne Tune provides a very simple way to run tuning jobs on SageMaker. Let’s look at how this can be achieved with Syne Tune.

We first upload the cifar100 dataset to Amazon Simple Storage Service (Amazon S3) so that it’s available on EC2 instances:

import sagemaker

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-cnn-cifar100"
role = sagemaker.get_execution_role()
inputs = sagemaker_session.upload_data(path="data", bucket=bucket, key_prefix="data/cifar100")

Next, we specify that we want trials to be run on the SageMaker backend. We use the SageMaker framework (PyTorch) in this particular example because we have a PyTorch training script, but you can use any SageMaker framework (such as XGBoost, TensorFlow, Scikit-learn, or Hugging Face).

A SageMaker framework is a Python wrapper that allows you to run ML code easily by providing a pre-made Docker image that works seamlessly on CPU and GPU for many framework versions. In this particular example, all we need to do is to instantiate the wrapper PyTorch with our training script:

from sagemaker.pytorch import PyTorch
from syne_tune.backend.sagemaker_backend.sagemaker_utils import get_execution_role
from syne_tune.backend.sagemaker_backend.sagemaker_backend import SagemakerBackend

backend = SagemakerBackend(
    sm_estimator=PyTorch(
        entry_point="./train_cifar100.py",
        instance_type="ml.g4dn.xlarge",
        instance_count=1,
        role=get_execution_role(),
        framework_version='1.7.1',
        py_version='py3',
    ),
    inputs=inputs,
)

We can now run our tuning job again, but this time we use 20 workers, each having their own GPU:

tuner = Tuner(
    backend=backend,
    scheduler=scheduler,
    stop_criterion=StoppingCriterion(max_wallclock_time=7200, max_cost=20.0),
    n_workers=20,
    tuner_name="cifar100-on-sagemaker"
)

tuner.run()

After each instance initiates a training job, you see the status update as in the local case. An important difference to the local backend is that the total estimated dollar cost is displayed as well the cost of workers.

trial_id      status  iter  dropout_rate  epochs        lr  momentum  epoch  val_acc  worker-time  worker-cost
        0  InProgress     1      0.003162      30  0.001000  0.900000    1.0   0.4518         50.0     0.010222
        1  InProgress     1      0.037723      30  0.000062  0.843500    1.0   0.1202         50.0     0.010222
        2  InProgress     1      0.000015      30  0.000865  0.821807    1.0   0.4121         50.0     0.010222
        3  InProgress     1      0.298864      30  0.006991  0.942469    1.0   0.2283         49.0     0.010018
        4  InProgress     0      0.000017      30  0.028001  0.911238      -        -                  -
        5  InProgress     0      0.000144      30  0.000080  0.870546      -        -            -            -
6 trials running, 0 finished (0 until the end), 387.53s wallclock-time, 0.04068444444444444$ estimated cost

Because we specified max_wallclock_time=7200 and max_cost=20.0, the tuning job stops when the wall clock time or the estimated cost goes above the specified bound. In addition to providing an estimate of the cost, it can be optimized with our multi-objective optimizers (see the GitHub repo for an example). As shown in the following figures, the SageMaker backend allows you to evaluate many more configurations of hyperparameters and architectures in the same wall clock time than the local one and, as a result, increases the likelihood of finding a better configuration.

Conclusion

In this post, we saw how to use Syne Tune to launch tuning experiments on your local machine and also on SageMaker for large-scale experiments. To learn more about the library, check out our GitHub repo for documentation and examples that show, for instance, how to run model-based Hyperband, tune multiple objectives, or run with your own scheduler. We look forward to your contributions and seeing how this solution can address everyday tuning of ML pipelines and models.


About the Author

David Salinas is a Sr Applied Scientist at AWS.

 Aaron Klein is an Applied Scientist at AWS.

Matthias Seeger is a Principal Applied Scientist at AWS.

Cedric Archambeau is a Principal Applied Scientist at AWS and Fellow of the European Lab for Learning and Intelligent Systems.

Read More

Your guide to AI and ML at AWS re:Invent 2021

It’s almost here! Only 9 days until AWS re:Invent 2021, and we’re very excited to share some highlights you might enjoy this year. The AI/ML team has been working hard to serve up some amazing content and this year, we have more session types for you to enjoy. Back in person, we now have chalk talks, workshops, builders’ sessions, and our traditional breakout sessions. Last year we hosted the first-ever machine learning (ML) keynote, and we are continuing the tradition. We also have more interactive and fun events happening with our AWS DeepRacer League and AWS BugBust Challenge. There are over 200 AI/ML sessions, including breakout sessions with customers such as Aon Corporation, Qualtrics, Shutterstock, and Bloomberg.

To help you plan your agenda for this year’s re:Invent, here are some highlights of the AI/ML track. You can also get the scoop from some of our AI/ML Community Heroes. So buckle up, and start registering for your favorite sessions.

Swami Sivasubramanian keynote

Wednesday, December 1, 8:30 am PT

Join Swami Sivasubramanian, Vice President, Machine Learning, AWS on an exploration of what it takes to put data in action with an end-to-end data strategy including the latest news on databases, analytics, and ML.

AI/ML leadership session with Bratin Saha

Wednesday, December 1, 4:00 pm PT

With the rise in compute power and data proliferation, ML has moved from the peripheral to being a core part of businesses and organizations across industries. AWS customers use ML and AI services to make accurate predictions, get deeper insights from their data, reduce operational overhead, improve customer experiences, and create entirely new lines of business. In this session, hear from Bratin Saha, Vice President, Machine Leaning, AWS and explore how AWS services can help you move from idea to production with ML.

AI/ML session preview

Here’s a preview of some of the different sessions we’re offering this year by session type. You can always log in to the event portal to favorite or register for any of these sessions, or search the catalog for over 200 other sessions available.

Breakout sessions

Prepare data for ML with ease, speed, and accuracy (AIM319)

Join this session to learn how to prepare data for ML in minutes using Amazon SageMaker. SageMaker offers tools to simplify data preparation so that you can label, prepare, and understand your data. Walk through a complete data-preparation workflow, including how to label training datasets using SageMaker Ground Truth, as well as how to extract data from multiple data sources, transform it using the prebuilt visualization templates in SageMaker Data Wrangler, and create model features. Also, learn how to improve efficiency by using SageMaker Feature Store to create a repository to store, retrieve, and share features.

Achieve high performance and cost-effective model deployment (AIM408)

To maximize your ML investments, high performance and cost-effective techniques are needed to scale model deployments. In this session, learn about the deployment options available in Amazon SageMaker, including optimized infrastructure choices; real-time, asynchronous, and batch inferences; multi-container endpoints; multi-model endpoints; auto scaling; model monitoring; and CI/CD integration for your ML workloads. Discover how to choose a better inference option for your ML use case. Then, hear from Goldman Sachs about how they use SageMaker for fast, low-latency, and scalable deployments to provide relevant research content recommendations for their clients.

Implementing MLOps practices with Amazon SageMaker, featuring Vanguard (AIM320)

Implementing MLOps practices helps data scientists and operations engineers collaborate to prepare, build, train, deploy, and manage models at scale. During this session, explore the breadth of MLOps features in Amazon SageMaker that help you provision consistent model development environments, automate ML workflows, implement CI/CD pipelines for ML, monitor models in production, and standardize model governance capabilities. Then, hear from Vanguard as they share their journey enabling MLOps to achieve ML at scale for their polyglot model development platforms using SageMaker features, including SageMaker projects, SageMaker Pipelines, SageMaker Model Registry, and SageMaker Model Monitor.

Enhancing the customer experience with Amazon Personalize (AIM204)

Personalizing content for a customer online is key to breaking through the noise. Yet, brands face challenges that often prevent them from providing these seamless, relevant experiences. Learn how easy it is to use Amazon Personalize to tailor product and content recommendations to ensure that your users are getting the content they want, leading to increased engagement and retention.

AI/ML for sustainability innovation: Insight at the edge (AIM207)

As climate change, wildlife conservation, public health, racial and economic equity, and new energy solutions become increasingly interdependent, scalable solutions are needed for actionable analysis at the intersection of these fields. In this session, learn how the power of AI/ML and IoT can be brought as close as possible to the challenging edge environments that provide data to create these insights. Also learn how AWS puts AI/ML in the hands of the largest-scale fisheries on the planet, and how organizations can leverage data to support more sustainable, resilient supply chains.

Get started with AWS computer vision services (AIM202)

This session provides an overview of AWS computer vision services and demonstrates how these pretrained and customizable ML capabilities can help you get started quickly—no ML expertise required. Learn how to deploy these models onto the device of your choice to run an inference locally or use cloud APIs for your specific computing needs. Learn first-hand how Shutterstock uses AWS computer vision services to create performance at scale for media analysis, content moderation, and quality inspection use cases.

Chalk talk sessions

Build an ML-powered demand planning system using Amazon Forecast (AIM310)

This chalk talk explores how you can use Amazon Forecast to build an ML-powered, fully automated demand planning system for your business or your multi-tenant SaaS platform without needing any ML expertise. Forecast automatically generates highly accurate forecasts using ML, explains the drivers behind those forecasts, and keeps your ML models always up to date to capture new trends.

Hello, is it conversational AI you’re looking for? (AIM305)

Customers calling in for support expect a personalized experience and a quick resolution to their issue. With chatbots, you can provide automated and human-like conversational experiences for your customers. In this chalk talk, discuss strategies to design personalized experiences using Amazon Lex and Amazon Polly. Explore how to design conversation paths, customize responses, integrate with your applications, and enable self-service use cases to scale your customer support functions.

Harness the power of ML to protect your business with Amazon Fraud Detector (AIM308)

How does more than 20 years of Amazon experience fighting fraud translate into an AI service that can help companies detect more online fraud faster? In this session, learn how Amazon Fraud Detector transforms raw data into highly accurate ML-based fraud detection models. Then, discover how the service does data preparation and validation, feature engineering, data enrichment, and model training and tuning. Finally, with actual customer examples across a wide range of industries and fraud use cases, find out how the service makes deployment easy.

Deep learning applications with PyTorch (AIM404)

By using PyTorch in Amazon SageMaker, you have a flexible deep learning framework combined with a fully managed ML solution that allows you to transition seamlessly from research prototyping to production deployment. In this session, hear from the PyTorch team on the latest features and library releases. Also, learn how to develop with PyTorch using SageMaker for key use cases, such as using a BERT model for natural language processing (NLP) and instance segmentation for fine-grained computer vision with distributed training and model parallelism.

Explore, analyze, and process data using Jupyter notebooks (AIM324)

Before using a dataset to train a model, you need to explore, analyze, and preprocess it. During this chalk talk, learn how to use Amazon SageMaker to complete these tasks in a Jupyter notebook environment.

Machine learning at the edge with Amazon SageMaker (AIM410)

More ML models are being deployed on edge devices such as robots and smart cameras. In this chalk talk, dive into building computer vision (CV) applications at the edge for predictive maintenance, industrial IoT, and more. Learn how to operate and monitor multiple models across a fleet of devices. Also walk through the process to build and train CV models with Amazon SageMaker and how to package, deploy, and manage them with SageMaker Edge Manager. The chalk talk also covers edge device setup and MLOps lifecycle with over-the-air model updates and data capture to the cloud.

Builders’ sessions

Build and deploy a custom computer vision model in 60 minutes (AIM314)

Amazon Rekognition Custom Labels is an automated ML feature that enables customers to quickly train their own custom models for detecting business-specific objects and scenes from images—no ML expertise is required. In this builders’ session, learn how to use Amazon Rekognition Custom Labels to build and deploy your own computer vision model and push it to an application to showcase inference on images from a camera feed. Bring your laptop and an AWS account.

Easily label training data for machine learning at scale (AIM406)

Join this session to learn how to create high-quality labels while also reducing your data labeling costs by up to 70%. This builders’ session walks through the different workflow options in Amazon SageMaker Ground Truth, such as automatic labeling and assistive labeling features like auto-segmentation and image label verification. It also details how to build highly accurate training datasets for company brand logos, so you can build an ML model for company brand protection.

Workshop sessions

Develop your ML project with Amazon SageMaker (AIM402)

In this workshop, learn how to develop a full ML project end to end with Amazon SageMaker. Start with data exploration and analysis, data cleansing, and feature engineering with SageMaker Data Wrangler. Then, store features in SageMaker Feature Store, extract features for training with SageMaker Processing, train a model with SageMaker training, and then deploy it with SageMaker hosting. Also, learn how to use SageMaker Studio as an IDE and SageMaker Pipelines for orchestrating the ML workflow.

End-to-end 3D machine learning on Amazon SageMaker (AIM414)

As lidar sensors become more accessible and cost-effective, customers increasingly use point cloud data in new spaces like autonomous driving, robotics, and augmented reality. The growing availability of lidar sensors has increased use of point cloud data for ML tasks like 3D object detection, segmentation, object synthesis, and reconstruction. This workshop features Amazon SageMaker Ground Truth and explains how to ingest raw 3D point cloud data, label it, train a 3D object detection model, and deploy the model. The model in this session will be trained on an autonomous vehicle dataset.

AI workflow automation for document processing (AIM316)

Mortgage packets have hundreds of documents in various layouts and formats. With ML, you can set up a document-processing pipeline to automate mortgage application workflows like extracting text from W2s, paystubs, and deeds; classifying documents; or using custom entity recognition to pull out specific data points. In this workshop, learn various ways to use optical character recognition (OCR), NLP, and human-in-the-loop services to build a document-processing pipeline to automate mortgage applications—saving time, reducing manual effort, and improving ROI for your organization.

Boost the value of your media content with ML-powered search (AIM315)

Consumers rely on content not only to entertain but also to educate and facilitate purchasing decisions. To meet this demand, media content production is exploding. However, the process of producing, distributing, and monetizing this content is often complex, expensive, and time-consuming. Applying artificial intelligence and ML capabilities like image and video analysis, audio transcription, machine translation, and text analytics can solve many of these problems. In this workshop, utilize ML to extract detailed metadata from content and make it available for search, discovery, and editing use cases.

Instantly detect and diagnose anomalies within your business data (AIM302)

Anomalies in business data often indicate potential issues or even opportunities. ML can help you detect anomalies and then act on them proactively. In this workshop, learn how Amazon Lookout for Metrics automatically detects anomalies across thousands of metrics in near-real time and reduces false alarms.

Join the first annual AWS BugBust re:Invent Challenge and help set a Guinness record

The largest code fixing challenge is here! Python and Java developers of all skill levels can compete to fix software bugs, earn points, and win an array of prizes including Amazon Echo Dots, hoodies, and the grand prize of $1,500 USD. As you bust bugs, you also become part of an attempt to set the record for the largest bug fixing challenge with the Guinness World Records. All registered participants who fix even one bug will receive exclusive prizes and a certificate from AWS and Guinness to commemorate their contribution. Let the bug busting begin! You can join the challenge virtually or in-person at the AWS BugBust Hub in the main expo. Register now for free.

AWS DeepRacer: The fastest way to get rolling with machine learning

Developers of all skill levels from beginners to experts can get hands-on with ML by using AWS DeepRacer to train models in a cloud-based 3D racing simulator. Racers from virtually anywhere in the world can compete in the AWS DeepRacer League, the first global autonomous racing league driven by reinforcement learning. The race is on now! Sign in to AWS DeepRacer and compete in the AWS re:Invent Open for prizes and glory now through December 31, 2021. Tune in to the AWS DeepRacer League Championships on Twitch November 19 and 22 to see the 40 fastest developers of the 2021 season compete live. Learn from the best as they vie for a chance to advance to the Championship Cup Finale during Swami Sivasubramanian’s keynote on December 1, where they will race for their shot at $20,000 USD in cash prizes and the right to hoist the Championship Cup!

For those attending re:Invent in Las Vegas, don’t miss out on the opportunity to take your model from Sim2Real (simulation to reality) on the AWS DeepRacer Speedway inside the content hub at Caesar’s Forum. Upload your model and race a 1/18th scale autonomous RC car on a physical track. Stop by Tuesday afternoon to participate in the livestreamed wildcard race for a chance to win a trip back for re:Invent 2022. No model? No problem! The all-new AWS DeepRacer Arcade is available in the expo, where you can get literally get in the driver’s seat and take the wheel in this educational racing game. Take a spin on the virtual track and then compete against a featured AWS DeepRacer autonomous model in this arcade racing experience, with prizes and giveaways galore. Shift into the fast lane on your ML learning journey with AWS DeepRacer.

Head over to the re:Invent portal to build your schedule so you’re ready to hit the ground running. Be sure to stop by and talk to our experts at the AI/ML booth, or chat with the speakers after sessions. We can’t wait to see you in Las Vegas!


About the Authors

Andrea Youmans is a Product Marketing Manager on the AI Services team at AWS. Over the past 10 years she has worked in the technology and telecommunications industries, focused on developer storytelling and marketing campaigns. In her spare time, she enjoys heading to the lake with her husband and Aussie dog Oakley, tasting wine and enjoying a movie from time to time.

Read More

AWS AI/ML Community attendee guides to AWS re:Invent 2021

The AWS AI/ML Community has compiled a series of session guides to AWS re:Invent 2021 to help you get the most out of re:Invent this year. They covered four distinct categories relevant to AI/ML. With a number of our guide authors attending re:Invent virtually, you will find a balance between virtually accessible sessions and sessions available in-person.

The AWS AI/ML Community is a vibrant group of developers, data scientists, researchers, and business decision-makers that dive deep into artificial intelligence and machine learning (ML) concepts, contribute with real-world experiences, and collaborate on building projects together.

Community guides for developers new to machine learning

From AWS ML Hero Mike Chambers AWS reInvent 2021: How To, tips, and my session selection (video). In this video—which should be required viewing for anyone new to re:Invent—Mike dives deep, beyond simply recommending sessions, with loads of tips and advice for how to make the most of your re:Invent experience—in-person or virtual.

AWS ML Hero Cyrus Wong’s top five AL/ML newbies should attend! For folks new to ML on AWS, spend your time leaning and making use of Amazon AI/ML services with Cyrus’s top five re:Invent sessions.

AWS re:Invent 2021: How to maximize your in-person learning experience as a new Machine Learning practitioner, from AWS ML Community Builder Martin Paradesi. For those attending re:Invent in-person this year, check out Martin’s guide for five sessions curated for new ML practitioners.

From our new Egypt-based AWS ML Hero Salah Elhossiny: Top 5 AWS ML Sessions to Attend at AWS re:Invent 2021. For those new to AWS ML, spend your time learning and using Amazon SageMaker with the best five AWS re:Invent sessions to help you get started quickly!

Community guides for AI/ML developers

AWS ML Hero Juv Chan’s top five recommendations for AI/ML builders and architects. Juv, a Sr. Cloud AI Engineer/Architect, ML Hero, and re:Invent Championship Cup 2019 finalist, shares his top five session picks and can’t miss photos from re:Invent 2019.

Top 5 Sessions for AI/ML Developers at AWS re:Invent 2021, from AWS ML Community Builder Brooke Jamieson. For those attending re:Invent virtually this year, check out Brooke’s guide.

AWS ML Hero Tomasz Ptak’s AWS re:Invent 2021 schedule. Tomasz shares his session picks plus tips and advice for making the most of your re:Invent experience.

Production-grade ML re:Invent 2021 sessions guide, from AWS ML Community Builder Kyle Gallatin. Builder Kyle Gallatin shares five ML talks skewed towards his interests in scalable, production-grade ML.

Community guides for MLOps developers

AWS ML Hero Rustem Feyzkhanov’s top MLOps breakout sessions to look forward to at re:Invent 2021. Rustem shares seven sessions to help you stay in the loop of MLOps in the AWS Cloud.

AWS ML Community Builder Phil Basford’s must-see sessions. For those interested in MLOps, ML architecture, edge computing, or data analytics, see Phil’s guide and his tips on how to have fun in Vegas and at home for those attending virtually.

Community guides for ML data scientists

AWS ML Hero’s Philipp Schmid’s remote guide for your virtual re:Invent 2021, focused on NLP and machine learning. Attending remote from Germany, Hugging Face ML engineer and AWS ML Hero Philipp Schmid offers an in-depth guide.

AWS ML Community Builder Pier Paolo Ippolito’s top five suggestions for ML data scientists. Pier, a data scientist at SAS and editor at Towards Data Science, shares his top five picks curated for technical ML builders.

Other AWS ML Community guides worth exploring

AWS ML Hero Kesha Williams’s Machine Learning Attendee Guide 2021. The official AWS Hero guide from Kesha dives deep across all session categories. Check this guide out for a full walkthrough of how to build your schedule, and the ultimate deep dive into Kesha’s ML session picks.

Lastly, we have a unique in-depth guide from AWS ML Community Builder Janos Tolgyesi. Learn how to fight climate change with ML skills and make the Earth a better place with ML at re:Invent 2021. Janos shares his sessions picks and a bonus session suggestion for those interested in beer, plus personalized recommendations!

Whether you’re attending in-person or virtually this year, we hope these recommendations and advice from the AWS ML Community help you make the most of your re:Invent experience. Have a great re:Invent!


About the Author

Paxton Hall is a Marketing Program Manager for the AWS AI/ML Community on the AI/ML Education team at AWS. He has worked in retail and experiential marketing for the past 7 years, focused on developing communities and marketing campaigns. Out of the office, he’s passionate about public lands access and conservation, and enjoys backcountry skiing, climbing, biking, and hiking throughout Washington’s Cascade mountains.

Read More

Understand drivers that influence your forecasts with explainability impact scores in Amazon Forecast

We’re excited to launch explainability impact scores in Amazon Forecast, which help you understand the factors that impact your forecasts for specific items and time durations of interest. Forecast is a managed service for developers that uses machine learning (ML) to generate more accurate demand forecasts, without requiring any ML experience. To increase forecast model accuracy, you can add additional information or attributes such as price, promotion, category details, holidays, or weather information to your forecasting model, but you may not know how each attribute influences your forecast. With today’s launch, you can now understand how each attribute impacts your forecasted values using the explainability feature, which we discuss in this post.

ML-based forecasting models, which are more accurate than heuristic rules or human judgment, can drive significant improvement in revenue and customer experience. However, business leaders often lose trust in technology when they see forecasted numbers drastically differing from their intuition, and may find it hard to trust ML systems. Because demand planning decisions have a high impact on the business, business leaders may end up overriding forecasts because they may believe that they have to take the forecast model predictions at face value to make critical business decisions, without understanding why those forecasts were generated and what factors are influencing forecasts to be higher or lower. This can lead to compromising forecast accuracy, and you may lose the benefit of ML forecasting.

Amazon Forecast now provides explainability, which gives you item-level insights across your preferred time duration. Having a certain level of understanding on why a particular forecast value is high or low at a particular time is helpful for decision-making and building trust and confidence in your ML solutions. Explainability reports include impact scores, which help you understand how each attribute in your training data contributes to either increasing or decreasing your forecasted values for specific items. In addition, you can choose to understand explainability for your entire forecast horizon or for specific time durations. Explainability removes the need of running multiple manual analyses to understand past sales and external variable trends to explain forecast results.

How to interpret explainability impact scores

Explainability helps you better understand how the attributes, such as price, category, or holidays, in your datasets impact your forecast values. Forecast uses a metric called impact scores to quantify the relative impact of each attribute and determine whether they generally increase or decrease forecast values.

Impact scores measure the relative impact attributes have on forecast values. For example, if the price attribute has an impact score that is twice as large as the brand_id attribute, you can conclude that the price of an item has twice the impact on forecast values than the product brand. Impact scores also provide information on whether an attribute increases or decreases the forecasted value. A negative impact score reflects that the attribute tends to decrease the value of the forecast.

Impact scores measure the relative impact of attributes to each other, not the absolute impact. If an attribute has a low impact score, that doesn’t necessarily mean that it has a low impact on forecast values; it means that it has a lower impact on forecast values than other attributes used by the predictor. If you change attributes in your predictor, the impact scores may differ, and the attribute with the low impact score may have a higher score relative to other attributes. Also, you can’t use impact scores to determine whether particular attributes improve the model accuracy or not. You should use accuracy metrics such as weighted quantile loss and others provided by Forecast to access predictor accuracy.

In the following graph, we take an example of an explainability report graph that shows the relative impact of different attributes on the forecasted value of item_d 1 across all the time points in the forecast horizon. We see that the relative impact is in the following order: Price has the highest impact, followed by StoreLocation, then Promo and Holiday_US. Price has the highest influence item_id 1 and tends to increase the forecast value. StoreLocation has the second highest impact on item_id 1 but tends to decrease the forecast value. Because Promo is close to 0.2 impact score, Price has five times more impact than Promo on the forecasted value of item_id 1, and both attributes tend to increase the forecast value. Holiday_US has an impact score of 0, which means that this attribute doesn’t increase or decrease the forecast value for item_id 1 relative to other attributes.

The following image shows an example of the explainability report export file with the impact scores for specific time series and time points as well as aggregated scores across those time series and time points.

Generate explainability impact scores

In this section, we walk through how to generate explainability impact scores for your forecasts using the Forecast console. To use the new CreateExplainability API, refer to the notebook in our GitHub repo or review Forecast Explainability.

  1. On the Forecast console, create a dataset group. Upload your historical demand dataset as target time series followed by related time series or item metadata that you want to use for more accurate forecasting and for which you’re interested in seeing explainability impact scores.

  1. In the navigation pane, under your dataset, choose Predictors.
  2. Choose Train new predictor.

Forecast defaults to AutoPredictor as the default training option. No further action is needed from you, but remember that only forecasts generated from a model that has been trained with AutoPredictor are eligible for later generating explainability impact scores for specific forecasts.

  1. Now that your model is trained, choose Forecasts in the navigation pane.
  2. Choose Create a forecast.
  3. Select your trained predictor to create a forecast.
  4. Choose Insights in the navigation pane.
  5. Choose Create explainability.

  1. Choose the forecast that you want to generate explainability impact scores for.
  2. Choose if you want to see impact scores for all the time points in the forecast horizon or only for a specific time duration.

You can specify up to 500 consecutive time points per explainability report.

  1. Upload the list of specific time series for which you want to see explainability impact scores.

A time series is a unique combination of item ID and dimension. You can specify up to 50 time series per Forecast explainability.

  1. Specify the schema of the CSV file that you have uploaded.
  2. Choose Create explainability.

It takes less than an hour to generate the explainability impact scores.

  1. When the job status is active, choose the explainability job to view the impact score.

Here you can review the explainability impact score graph. You can use the controls at the top of the graph to drill down to specific time series or time points or view at an aggregated level.

  1. To export all the impact scores, choose Create explainability export in the Explainability exports
  2. Provide the export details and choose Create explainability export.

The export is saved in an Amazon Simple Storage Service (Amazon S3) bucket that you specify.

  1. When the export is complete, navigate to your S3 bucket to review the explainability report CSV file.

The following is an example of your explainability export CSV file. Depending on how large your dataset is, multiple files may be exported.

Aggregate explainability impact scores for category level analysis

You may want to review explainability for a group of items together, which can have more than 50 items. For example, a grocery retailer might be interested in understanding what is driving the forecasts for all their fruits and vegetables, and this category may consist of more than 50 SKUs in their data. However, Forecast lets you specify up to 50 time series per Forecast explainability job. If you have more than 50 time series, you need to run the explainability job multiple times with different items in each job and then combine them.

The explainability export file provides two type of impact scores: normalized impact scores and raw impact scores. Raw impact scores are based on Shapley values and aren’t scaled or bounded. Normalized impact scores scale the raw scores to a value between -1 and 1. Raw impact scores are useful for combining and comparing scores across different explainability resources. Use the raw impact scores of all the time series across multiple explainability jobs to aggregate, then compare it to find the relative influence of each attribute. You can view an example on how to do so by following the notebook in our GitHub repo.

Conclusion

Forecast now provides explainability for specific items and time durations of interest. With the explainability feature, you can understand how each attribute impacts your forecasted values. To learn more, review Forecast Explainability and the notebook in our GitHub repo. If you are interested in aggregated explainability for all your items at the predictor level, review our blog on using the CreateAutoPredictor API here. Explainability is available in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.


About the Authors

Namita Das is a Sr. Product Manager for Amazon Forecast. Her current focus is to democratize machine learning by building no-code/low-code ML services. On the side, she frequently advises startups and loves training her dog with new tricks.

Dima Fayyad is a Software Development Engineer on the Amazon Forecast team. She is passionate about machine learning and AI and is currently working on large-scale distributed systems in the forecasting space. In her free time, she enjoys exploring different cuisines, traveling, and skiing.

Youngsuk Park is a Machine Learning Scientist at AWS AI and Amazon Forecast. His research lies in the interplay between machine learning, optimization, and decision-making, with over 10 publications in top-notch ML/AI venues. Before joining AWS, he obtained a PhD from Stanford University.

Shannon Killingsworth is a UX Designer for Amazon Forecast. His current work is creating console experiences that are usable by anyone, and integrating new features into the console experience. In his spare time, he is a fitness and automobile enthusiast.

Read More