Build an air quality anomaly detector using Amazon Lookout for Metrics

Today, air pollution is a familiar environmental issue that creates severe respiratory and heart conditions, which pose serious health threats. Acid rain, depletion of the ozone layer, and global warming are also adverse consequences of air pollution. There is a need for intelligent monitoring and automation in order to prevent severe health issues and in extreme cases life-threatening situations. Air quality is measured using the concentration of pollutants in the air. Identifying symptoms early and controlling the pollutant level before it’s dangerous is crucial. The process of identifying the air quality and the anomaly in the weight of pollutants, and quickly diagnosing the root cause, is difficult, costly, and error-prone.

The process of applying AI and machine learning (ML)-based solutions to find data anomalies involves a lot of complexity in ingesting, curating, and preparing data in the right format and then optimizing and maintaining the effectiveness of these ML models over long periods of time. This has been one of the barriers to quickly implementing and scaling the adoption of ML capabilities.

This post shows you how to use an integrated solution with Amazon Lookout for Metrics and Amazon Kinesis Data Firehose to break these barriers by quickly and easily ingesting streaming data, and subsequently detecting anomalies in the key performance indicators of your interest.

Lookout for Metrics automatically detects and diagnoses anomalies (outliers from the norm) in business and operational data. It’s a fully managed ML service that uses specialized ML models to detect anomalies based on the characteristics of your data. For example, trends and seasonality are two characteristics of time series metrics in which threshold-based anomaly detection doesn’t work. Trends are continuous variations (increases or decreases) in a metric’s value. On the other hand, seasonality is periodic patterns that occur in a system, usually rising above a baseline and then decreasing again. You don’t need ML experience to use Lookout for Metrics.

We demonstrate a common air quality monitoring scenario, in which we detect anomalies in the pollutant concentration in the air. By the end of this post, you’ll learn how to use these managed services from AWS to help prevent health issues and global warming. You can apply this solution to other use cases for better environment management, such as detecting anomalies in water quality, land quality, and power consumption patterns, to name a few.

Solution overview

The architecture consists of three functional blocks:

  • Wireless sensors placed at strategic locations to sense the concentration level of carbon monoxide (CO), sulfur dioxide (SO2), and nitrogen dioxide(NO2) in the air
  • Streaming data ingestion and storage
  • Anomaly detection and notification

The solution provides a fully automated data path from the sensors all the way to a notification being raised to the user. You can also interact with the solution using the Lookout for Metrics UI in order to analyze the identified anomalies.

The following diagram illustrates our solution architecture.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region.

  1. Download the Python script (publish.py) and data file from the GitHub repo.
  2. Open the live_data.csv file in your preferred editor and replace the dates to be today’s and tomorrow’s date. For example, if today’s date is July 8, 2022, then replace 2022-03-25 with 2022-07-08. Keep the format the same. This is required to simulate sensor data for the current date using the IoT simulator script.
  3. Create an Amazon Simple Storage Service (Amazon S3) bucket and a folder named air-quality. Create a subfolder inside air-quality named historical. For instructions, see Creating a folder.
  4. Upload the live_data.csv file in the root S3 bucket and historical_data.json in the historical folder.
  5. Create an AWS Cloud9 development environment, which we use to run the Python simulator program to create sensor data for this solution.

Ingest and transform data using AWS IoT Core and Kinesis Data Firehose

We use a Kinesis Data Firehose delivery stream to ingest the streaming data from AWS IoT Core and deliver it to Amazon S3. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, choose Direct PUT.
  3. For Destination, choose Amazon S3.
  4. For Delivery stream name, enter a name for your delivery stream.
  5. For S3 bucket, enter the bucket you created as a prerequisite.
  6. Enter values for S3 bucket prefix and S3 bucket error output prefix.One of the key points to note is the configuration of the custom prefix that is configured for the Amazon S3 destination. This prefix pattern makes sure that the data is created in the S3 bucket as per the prefix hierarchy expected by Lookout for Metrics. (More on this later in this post.) For more information about custom prefixes, see Custom Prefixes for Amazon S3 Objects.
  7. For Buffer interval, enter 60.
  8. Choose Create or update IAM role.
  9. Choose Create delivery stream.

    Now we configure AWS IoT Core and run the air quality simulator program.
  10. On the AWS IoT Core console, create an AWS IoT policy called admin.
  11. In the navigation pane under Message Routing, choose Rules.
  12. Choose Create rule.
  13. Create a rule with the Kinesis Data Firehose(firehose) action.
    This sends data from an MQTT message to a Kinesis Data Firehose delivery stream.
  14. Choose Create.
  15. Create an AWS IoT thing with name Test-Thing and attach the policy you created.
  16. Download the certificate, public key, private key, device certificate, and root CA for AWS IoT Core.
  17. Save each of the downloaded files to the certificates subdirectory that you created earlier.
  18. Upload publish.py to the iot-test-publish folder.
  19. On the AWS IoT Core console, in the navigation pane, choose Settings.
  20. Under Custom endpoint, copy the endpoint.
    This AWS IoT Core custom endpoint URL is personal to your AWS account and Region.
  21. Replace customEndpointUrl with your AWS IoT Core custom endpoint URL, certificates with the name of certificate, and Your_S3_Bucket_Name with your S3 bucket name.
    Next, you install pip and the AWS IoT SDK for Python.
  22. Log in to AWS Cloud9 and create a working directory in your development environment. For example: aq-iot-publish.
  23. Create a subdirectory for certificates in your new working directory. For example: certificates.
  24. Install the AWS IoT SDK for Python v2 by running the following from the command line.
    pip install awsiotsdk

  25. To test the data pipeline, run the following command:
    python3 publish.py

You can see the payload in the following screenshot.

Finally, the data is delivered to the specified S3 bucket in the prefix structure.

The data of the files is as follows:

  • {"TIMESTAMP":"2022-03-20 00:00","LOCATION_ID":"B-101","CO":2.6,"SO2":62,"NO2":57}
  • {"TIMESTAMP":"2022-03-20 00:05","LOCATION_ID":"B-101","CO":3.9,"SO2":60,"NO2":73}

The timestamps show that each file contains data for 5-minute intervals.

With minimal code, we have now ingested the sensor data, created an input stream from the ingested data, and stored the data in an S3 bucket based on the requirements for Lookout for Metrics.

In the following sections, we take a deeper look at the constructs within Lookout for Metrics, and how easy it is to configure these concepts using the Lookout for Metrics console.

Create a detector

A detector is a Lookout for Metrics resource that monitors a dataset and identifies anomalies at a predefined frequency. Detectors use ML to find patterns in data and distinguish between expected variations in data and legitimate anomalies. To improve its performance, a detector learns more about your data over time.

In our use case, the detector analyzes data from the sensor every 5 minutes.

To create the detector, navigate to the Lookout for Metrics console and choose Create detector. Provide the name and description (optional) for the detector, along with the interval of 5 minutes.

Your data is encrypted by default with a key that AWS owns and manages for you. You can also configure if you want to use a different encryption key from the one that is used by default.

Now let’s point this detector to the data that you want it to run anomaly detection on.

Create a dataset

A dataset tells the detector where to find your data and which metrics to analyze for anomalies. To create a dataset, complete the following steps:

  1. On the Amazon Lookout for Metrics console, navigate to your detector.
  2. Choose Add a dataset.
  3. For Name, enter a name (for example, air-quality-dataset).
  4. For Datasource, choose your data source (for this post, Amazon S3).
  5. For Detector mode, select your mode (for this post, Continuous).

With Amazon S3, you can create a detector in two modes:

    • Backtest – This mode is used to find anomalies in historical data. It needs all records to be consolidated in a single file.
    • Continuous – This mode is used to detect anomalies in live data. We use this mode with our use case because we want to detect anomalies as we receive air pollutant data from the air monitoring sensor.
  1. Enter the S3 path for the live S3 folder and path pattern.
  2. For Datasource interval, choose 5 minute intervals.If you have historical data from which the detector can learn patterns, you can provide it during this configuration. The data is expected to be in the same format that you use to perform a backtest. Providing historical data speeds up the ML model training process. If this isn’t available, the continuous detector waits for sufficient data to be available before making inferences.
  3. For this post, we already have historical data, so select Use historical data.
  4. Enter the S3 path of historical_data.json.
  5. For File format, select JSON lines.

At this point, Lookout for Metrics accesses the data source and validates whether it can parse the data. If the parsing is successful, it gives you a “Validation successful” message and takes you to the next page, where you configure measures, dimensions, and timestamps.

Configure measures, dimensions, and timestamps

Measures define KPIs that you want to track anomalies for. You can add up to five measures per detector. The fields that are used to create KPIs from your source data must be of numeric format. The KPIs can be currently defined by aggregating records within the time interval by doing a SUM or AVERAGE.

Dimensions give you the ability to slice and dice your data by defining categories or segments. This allows you to track anomalies for a subset of the whole set of data for which a particular measure is applicable.

In our use case, we add three measures, which calculate the AVG of the objects seen in the 5-minute interval, and have only one dimension, for which pollutants concentration is measured.

Every record in the dataset must have a timestamp. The following configuration allows you to choose the field that represents the timestamp value and also the format of the timestamp.

The next page allows you to review all the details you added and then save and activate the detector.

The detector then begins learning the data streaming into the data source. At this stage, the status of the detector changes to Initializing.

It’s important to note the minimum amount of data that is required before Lookout for Metrics can start detecting anomalies. For more information about requirements and limits, see Lookout for Metrics quotas.

With minimal configuration, you have created your detector, pointed it at a dataset, and defined the metrics that you want Lookout for Metrics to find anomalies in.

Visualize anomalies

Lookout for Metrics provides a rich UI experience for users who want to use the AWS Management Console to analyze the anomalies being detected. It also provides the capability to query the anomalies via APIs.

Let’s look at an example anomaly detected from our air quality data use case. The following screenshot shows an anomaly detected in CO concentration in the air at the designated time and date with a severity score of 93. It also shows the percentage contribution of the dimension towards the anomaly. In this case, 100% contribution comes from the location ID B-101 dimension.

Create alerts

Lookout for Metrics allows you to send alerts using a variety of channels. You can configure the anomaly severity score threshold at which the alerts must be triggered.

In our use case, we configure alerts to be sent to an Amazon Simple Notification Service (Amazon SNS) channel, which in turn sends an SMS. The following screenshots show the configuration details.

You can also use an alert to trigger automations using AWS Lambda functions in order to drive API-driven operations on AWS IoT Core.

Conclusion

In this post, we showed you how easy to use Lookout for Metrics and Kinesis Data Firehose to remove the undifferentiated heavy lifting involved in managing the end-to-end lifecycle of building ML-powered anomaly detection applications. This solution can help you accelerate your ability to find anomalies in key business metrics and allow you focus your efforts on growing and improving your business.

We encourage you to learn more by visiting the Amazon Lookout for Metrics Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.


About the author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Read More

Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and the Deep Graph Library

Fraudulent activities severely impact many industries, such as e-commerce, social media, and financial services. Frauds could cause a significant loss for businesses and consumers. American consumers reported losing more than $5.8 billion to frauds in 2021, up more than 70% over 2020. Many techniques have been used to detect fraudsters—rule-based filters, anomaly detection, and machine learning (ML) models, to name a few.

In real-world data, entities often involve rich relationships with other entities. Such a graph structure can provide valuable information for anomaly detection. For example, in the following figure, users are connected via shared entities such as Wi-Fi IDs, physical locations, and phone numbers. Due to the large number of unique values of these entities, like phone numbers, it’s difficult to use them in the traditional feature-based models—for example, one-hot encoding all phone numbers wouldn’t be viable. But such relationships could help predict whether a user is a fraudster. If a user has shared several entities with a known fraudster, the user is more likely a fraudster.

Recently, graph neural network (GNN) has become a popular method for fraud detection. GNN models can combine both graph structure and attributes of nodes or edges, such as users or transactions, to learn meaningful representations to distinguish malicious users and events from legitimate ones. This capability is crucial for detecting frauds where fraudsters collude to hide their abnormal features but leave some traces of relations.

Current GNN solutions mainly rely on offline batch training and inference mode, which detect fraudsters after malicious events have happened and losses have occurred. However, catching fraudulent users and activities in real time is crucial for preventing losses. This is particularly true in business cases where there is only one chance to prevent fraudulent activities. For example, in some e-commerce platforms, account registration is wide open. Fraudsters can behave maliciously just once with an account and never use the same account again.

Predicting fraudsters in real time is important. Building such a solution, however, is challenging. Because GNNs are still new to the industry, there are limited online resources on converting GNN models from batch serving to real-time serving. Additionally, it’s challenging to construct a streaming data pipeline that can feed incoming events to a GNN real-time serving API. To the best of the authors’ knowledge, no reference architectures and examples are available for GNN-based real-time inference solutions as of this writing.

To help developers apply GNNs to real-time fraud detection, this post shows how to use Amazon Neptune, Amazon SageMaker, and the Deep Graph Library (DGL), among other AWS services, to construct an end-to-end solution for real-time fraud detection using GNN models.

We focus on four tasks:

  • Processing a tabular transaction dataset into a heterogeneous graph dataset
  • Training a GNN model using SageMaker
  • Deploying the trained GNN models as a SageMaker endpoint
  • Demonstrating real-time inference for incoming transactions

This post extends the previous work in Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library, which focuses on the first two tasks. You can refer to that post for more details on heterogeneous graphs, GNNs, and semi-supervised training of GNNs.

Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud.

Solution overview

This solution contains two major parts.

The first part is a pipeline that processes the data, trains GNN models, and deploys the trained models. It uses AWS Glue to process the transaction data, and saves the processed data to both Amazon Neptune and Amazon Simple Storage Service (Amazon S3). Then, a SageMaker training job is triggered to train a GNN model on the data saved in Amazon S3 to predict whether a transaction is fraudulent. The trained model along with other assets are saved back to Amazon S3 upon the completion of the training job. Finally, the saved model is deployed as a SageMaker endpoint. The pipeline is orchestrated by AWS Step Functions, as shown in the following figure.

The second part of the solution implements real-time fraudulent transaction detection. It starts from a RESTful API that queries the graph database in Neptune to extract the subgraph related to an incoming transaction. It also has a web portal that can simulate business activities, generating online transactions with both fraudulent and legitimate ones. The web portal provides a live visualization of the fraud detection. This part uses Amazon CloudFront, AWS Amplify, AWS AppSync, Amazon API Gateway, Step Functions, and Amazon DocumentDB to rapidly build the web application. The following diagram illustrates the real-time inference process and web portal.

The implementation of this solution, along with an AWS CloudFormation template that can launch the architecture in your AWS account, is publicly available through the following GitHub repo.

Data processing

In this section, we briefly describe how to process an example dataset and convert it from raw tables into a graph with relations identified among different columns.

This solution uses the same dataset, the IEEE-CIS fraud dataset, as the previous post Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library. Therefore, the basic principle of the data process is the same. In brief, the fraud dataset includes a transactions table and an identities table, having nearly 500,000 anonymized transaction records along with contextual information (for example, devices used in transactions). Some transactions have a binary label, indicating whether a transaction is fraudulent. Our task is to predict which unlabeled transactions are fraudulent and which are legitimate.

The following figure illustrates the general process of how to convert the IEEE tables into a heterogeneous graph. We first extract two columns from each table. One column is always the transaction ID column, where we set each unique TransactionID as one node. Another column is picked from the categorical columns, such as the ProductCD and id_03 columns, where each unique category was set as a node. If a TransactionID and a unique category appear in the same row, we connect them with one edge. This way, we convert two columns in a table into one bipartite. Then we combine those bipartites along with the TransactionID nodes, where the same TransactionID nodes are merged into one unique node. After this step, we have a heterogeneous graph built from bipartites.

For the rest of the columns that aren’t used to build the graph, we join them together as the feature of the TransactionID nodes. TransactionID values that have the isFraud values are used as the label for model training. Based on this heterogeneous graph, our task becomes a node classification task of the TransactionID nodes. For more details on preparing the graph data for training GNNs, refer to the Feature extraction and Constructing the graph sections of the previous blog post.

The code used in this solution is available in src/scripts/glue-etl.py. You can also experiment with data processing through the Jupyter notebook src/sagemaker/01.FD_SL_Process_IEEE-CIS_Dataset.ipynb.

Instead of manually processing the data, as done in the previous post, this solution uses a fully automatic pipeline orchestrated by Step Functions and AWS Glue that supports processing huge datasets in parallel via Apache Spark. The Step Functions workflow is written in AWS Cloud Development Kit (AWS CDK). The following is a code snippet to create this workflow:

import { LambdaInvoke, GlueStartJobRun } from 'aws-cdk-lib/aws-stepfunctions-tasks';
    
    const parametersNormalizeTask = new LambdaInvoke(this, 'Parameters normalize', {
      lambdaFunction: parametersNormalizeFn,
      integrationPattern: IntegrationPattern.REQUEST_RESPONSE,
    });
    
    ...
    
    const dataProcessTask = new GlueStartJobRun(this, 'Data Process', {
      integrationPattern: IntegrationPattern.RUN_JOB,
      glueJobName: etlConstruct.jobName,
      timeout: Duration.hours(5),
      resultPath: '$.dataProcessOutput',
    });
    
    ...    
    
    const definition = parametersNormalizeTask
      .next(dataIngestTask)
      .next(dataCatalogCrawlerTask)
      .next(dataProcessTask)
      .next(hyperParaTask)
      .next(trainingJobTask)
      .next(runLoadGraphDataTask)
      .next(modelRepackagingTask)
      .next(createModelTask)
      .next(createEndpointConfigTask)
      .next(checkEndpointTask)
      .next(endpointChoice);

Besides constructing the graph data for GNN model training, this workflow also batch loads the graph data into Neptune to conduct real-time inference later on. This batch data loading process is demonstrated in the following code snippet:

from neptune_python_utils.endpoints import Endpoints
from neptune_python_utils.bulkload import BulkLoad

...

bulkload = BulkLoad(
        source=targetDataPath,
        endpoints=endpoints,
        role=args.neptune_iam_role_arn,
        region=args.region,
        update_single_cardinality_properties=True,
        fail_on_error=True)
        
load_status = bulkload.load_async()
status, json = load_status.status(details=True, errors=True)
load_status.wait()

GNN model training

After the graph data for model training is saved in Amazon S3, a SageMaker training job, which is only charged when the training job is running, is triggered to start the GNN model training process in the Bring Your Own Container (BYOC) mode. It allows you to pack your model training scripts and dependencies in a Docker image, which it uses to create SageMaker training instances. The BYOC method could save significant effort in setting up the training environment. In src/sagemaker/02.FD_SL_Build_Training_Container_Test_Local.ipynb, you can find details of the GNN model training.

Docker image

The first part of the Jupyter notebook file is the training Docker image generation (see the following code snippet):

*!* aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
image_name *=* 'fraud-detection-with-gnn-on-dgl/training'
*!* docker build -t $image_name ./FD_SL_DGL/gnn_fraud_detection_dgl

We used a PyTorch-based image for the model training. The Deep Graph Library (DGL) and other dependencies are installed when building the Docker image. The GNN model code in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder is copied to the image as well.

Because we process the transaction data into a heterogeneous graph, in this solution we choose the Relational Graph Convolutional Network (RGCN) model, which is specifically designed for heterogeneous graphs. Our RGCN model can train learnable embeddings for the nodes in heterogeneous graphs. Then, the learned embeddings are used as inputs of a fully connected layer for predicting the node labels.

Hyperparameters

To train the GNN, we need to define a few hyperparameters before the training process, such as the file names of the graph constructed, the number of layers of GNN models, the training epochs, the optimizer, the optimization parameters, and more. See the following code for a subset of the configurations:

edges *=* ","*.*join(map(*lambda* x: x*.*split("/")[*-*1], [file *for* file *in* processed_files *if* "relation" *in* file]))

params *=* {'nodes' : 'features.csv',
          'edges': edges,
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2}

For more information about all the hyperparameters and their default values, see estimator_fns.py in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder.

Model training with SageMaker

After the customized container Docker image is built, we use the preprocessed data to train our GNN model with the hyperparameters we defined. The training job uses the DGL, with PyTorch as the backend deep learning framework, to construct and train the GNN. SageMaker makes it easy to train GNN models with the customized Docker image, which is an input argument of the SageMaker estimator. For more information about training GNNs with the DGL on SageMaker, see Train a Deep Graph Network.

The SageMaker Python SDK uses Estimator to encapsulate training on SageMaker, which runs SageMaker-compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK. The following code snippet demonstrates training the model with SageMaker (either in a local environment or cloud instances):

from sagemaker.estimator import Estimator
from time import strftime, gmtime
from sagemaker.local import LocalSession

localSageMakerSession = LocalSession(boto_session=boto3.session.Session(region_name=current_region))
estimator = Estimator(image_uri=image_name,
                      role=sagemaker_exec_role,
                      instance_count=1,
                      instance_type='local',
                      hyperparameters=params,
                      output_path=output_path,
                      sagemaker_session=localSageMakerSession)

training_job_name = "{}-{}".format('GNN-FD-SL-DGL-Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

After training, the GNN model’s performance on the test set is displayed like the following outputs. The RGCN model normally can achieve around 0.87 AUC and more than 95% accuracy. For a comparison of the RGCN model with other ML models, refer to the Results section of the previous blog post for more details.

Epoch 00099 | Time(s) 7.9413 | Loss 0.1023 | f1 0.3745
Metrics
Confusion Matrix:
                        labels positive labels negative
    predicted positive  4343            576
    predicted negative  13494           454019

    f1: 0.3817, precision: 0.8829, recall: 0.2435, acc: 0.9702, roc: 0.8704, pr: 0.4782, ap: 0.4782

Finished Model training

Upon the completion of model training, SageMaker packs the trained model along with other assets, including the trained node embeddings, into a ZIP file and then uploads it to a specified S3 location. Next, we discuss the deployment of the trained model for real-time fraudulent detection.

GNN model deployment

SageMaker makes the deployment of trained ML models simple. In this stage, we use the SageMaker PyTorchModel class to deploy the trained model, because our DGL model depends on PyTorch as the backend framework. You can find the deployment code in the src/sagemaker/03.FD_SL_Endpoint_Deployment.ipynb file.

Besides the trained model file and assets, SageMaker requires an entry point file for the deployment of a customized model. The entry point file is run and stored in the memory of an inference endpoint instance to respond to the inference request. In our case, the entry point file is the fd_sl_deployment_entry_point.py file in the src/sagemaker/FD_SL_DGL/code folder, which performs four major functions:

  • Receive requests and parse contents of requests to obtain the to-be-predicted nodes and their associated data
  • Convert the data to a DGL heterogeneous graph as input for the RGCN model
  • Perform the real-time inference via the trained RGCN model
  • Return the prediction results to the requester

Following SageMaker conventions, the first two functions are implemented in the input_fn method. See the following code (for simplicity, we delete some commentary code):

def input_fn(request_body, request_content_type='application/json'):

    # --------------------- receive request ------------------------------------------------ #
    input_data = json.loads(request_body)

    subgraph_dict = input_data['graph']
    n_feats = input_data['n_feats']
    target_id = input_data['target_id']

    graph, new_n_feats, new_pred_target_id = recreate_graph_data(subgraph_dict, n_feats, target_id)

    return (graph, new_n_feats, new_pred_target_id)

The constructed DGL graph and features are then passed to the predict_fn method to fulfill the third function. predict_fn takes two input arguments: the outputs of input_fn and the trained model. See the following code:

def predict_fn(input_data, model):

    # ---------------------  Inference ------------------------------------------------ #
    graph, new_n_feats, new_pred_target_id = input_data

    with th.no_grad():
        logits = model(graph, new_n_feats)
        res = logits[new_pred_target_id].cpu().detach().numpy()

    return res[1]

The model used in perdict_fn is created by the model_fn method when the endpoint is called the first time. The function model_fn loads the saved model file and associated assets from the model_dir argument and the SageMaker model folder. See the following code:

def model_fn(model_dir):

    # ------------------ Loading model -------------------
    ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size = 
    initialize_arguments(os.path.join(BASE_PATH, 'metadata.pkl'))

    rgcn_model = HeteroRGCN(ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size)

    stat_dict = th.load('model.pth')

    rgcn_model.load_state_dict(stat_dict)

    return rgcn_model

The output of the predict_fn method is a list of two numbers, indicating the logits for class 0 and class 1, where 0 means legitimate and 1 means fraudulent. SageMaker takes this list and passes it to an inner method called output_fn to complete the final function.

To deploy our GNN model, we first wrap the GNN model into a SageMaker PyTorchModel class with the entry point file and other parameters (the path of the saved ZIP file, the PyTorch framework version, the Python version, and so on). Then we call its deploy method with instance settings. See the following code:

env = {
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'
}

print(f'Use model {repackged_model_path}')

sagemakerSession = sm.session.Session(boto3.session.Session(region_name=current_region))
fd_sl_model = PyTorchModel(model_data=repackged_model_path, 
                           role=sagemaker_exec_role,
                           entry_point='./FD_SL_DGL/code/fd_sl_deployment_entry_point.py',
                           framework_version='1.6.0',
                           py_version='py3',
                           predictor_cls=JSONPredictor,
                           env=env,
                           sagemaker_session=sagemakerSession)
                           
fd_sl_predictor *=* fd_sl_model*.*deploy(instance_type*=*'ml.c5.4xlarge',
                                     initial_instance_count*=*1,)

The preceding procedures and code snippets demonstrate how to deploy your GNN model as an online inference endpoint from a Jupyter notebook. However, for production, we recommend using the previously mentioned MLOps pipeline orchestrated by Step Functions for the entire workflow, including processing data, training the model, and deploying an inference endpoint. The entire pipeline is implemented by an AWS CDK application, which can be easily replicated in different Regions and accounts.

Real-time inference

When a new transaction arrives, to perform real-time prediction, we need to complete four steps:

  1. Node and edge insertion – Extract the transaction’s information such as the TransactionID and ProductCD as nodes and edges, and insert the new nodes into the existing graph data stored at the Neptune database.
  2. Subgraph extraction – Set the to-be-predicted transaction node as the center node, and extract a n-hop subgraph according to the GNN model’s input requirements.
  3. Feature extraction – For the nodes and edges in the subgraph, extract their associated features.
  4. Call the inference endpoint – Pack the subgraph and features into the contents of a request, then send the request to the inference endpoint.

In this solution, we implement a RESTful API to achieve real-time fraudulent predication described in the preceding steps. See the following pseudo-code for real-time predictions. The full implementation is in the complete source code file.

For prediction in real time, the first three steps require lower latency. Therefore, a graph database is an optimal choice for these tasks, particularly for the subgraph extraction, which could be achieved efficiently with graph database queries. The underline functions that support the pseudo-code are based on Neptune’s gremlin queries.

def handler(event, context):
    
    graph_input = GraphModelClient(endpoints)
    
    # Step 1: node and edge insertion
    trans_dict, identity_dict, target_id, transaction_value_cols, union_li_cols = 
        load_data_from_event(event, transactions_id_cols, transactions_cat_cols, dummied_col)
    graph_input.insert_new_transaction_vertex_and_edge(trans_dict, identity_dict , target_id, vertex_type = 'Transaction')
    
    
    # Setp 2: subgraph extraction
    subgraph_dict, transaction_embed_value_dict = 
        graph_input.query_target_subgraph(target_id, trans_dict, transaction_value_cols, union_li_cols, dummied_col)
    

    # Step 3 & 4: feature extraction & call the inference endpoint
    transaction_id = int(target_id[(target_id.find('-')+1):])
    pred_prob = invoke_endpoint_with_idx(endpointname = ENDPOINT_NAME, target_id = transaction_id, subgraph_dict = subgraph_dict, n_feats = transaction_embed_value_dict)
       
    function_res = {
                    'id': event['transaction_data'][0]['TransactionID'],
                    'flag': pred_prob > MODEL_BTW,
                    'pred_prob': pred_prob
                    }
       
    return function_res

One caveat about real-time fraud detection using GNNs is the GNN inference mode. To fulfill real-time inference, we need to convert the GNN model inference from transductive mode to inductive mode. GNN models in transductive inference mode can’t make predictions for newly appeared nodes and edges, whereas in inductive mode, GNN models can handle new nodes and edges. A demonstration of the difference between transductive and inductive mode is shown in the following figure.

In transductive mode, predicted nodes and edges coexist with labeled nodes and edges during training. Models identify them before inference, and they could be inferred in training. Models in inductive mode are trained on the training graph but need to predict unseen nodes (those in red dotted circles on the right) with their associated neighbors, which might be new nodes, like the gray triangle node on the right.

Our RGCN model is trained and tested in transductive mode. It has access to all nodes in training, and also trained an embedding for each featureless node, such as IP address and card types. In the testing stage, the RGCN model uses these embeddings as node features to predict nodes in the test set. When we do real-time inference, however, some of the newly added featureless nodes have no such embeddings because they’re not in the training graph. One way to tackle this issue is to assign the mean of all embeddings in the same node type to the new nodes. In this solution, we adopt this method.

In addition, this solution provides a web portal (as seen in the following screenshot) to demonstrate real-time fraudulent predictions from business operators’ perspectives. It can generate the simulated online transactions, and provide a live visualization of detected fraudulent transaction information.

Clean up

When you’re finished exploring the solution, you can clean the resources to avoid incurring charges.

Conclusion

In this post, we showed how to build a GNN-based real-time fraud detection solution using SageMaker, Neptune, and the DGL. This solution has three major advantages:

  • It has good performance in terms of prediction accuracy and AUC metrics
  • It can perform real-time inference via a streaming MLOps pipeline and SageMaker endpoints
  • It automates the total deployment process with the provided CloudFormation template so that interested developers can easily test this solution with custom data in their account

For more details about the solution, see the GitHub repo.

After you deploy this solution, we recommend customizing the data processing code to fit your own data format and modify the real-time inference mechanism while keeping the GNN model unchanged. Note that we split the real-time inference into four steps without further optimization of the latency. These four steps take a few seconds to get a prediction on the demo dataset. We believe that optimizing the Neptune graph data schema design and queries for subgraph and feature extraction can significantly reduce the inference latency.


About the authors

Jian Zhang is an applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Mengxin Zhu is a manager of Solutions Architects at AWS, with a focus on designing and developing reusable AWS solutions. He has been engaged in software development for many years and has been responsible for several startup teams of various sizes. He also is an advocate of open-source software and was an Eclipse Committer.

Haozhu Wang is a research scientist at Amazon ML Solutions Lab, where he co-leads the Reinforcement Learning Vertical. He helps customers build advanced machine learning solutions with the latest research on graph learning, natural language processing, reinforcement learning, and AutoML. Haozhu received his PhD in Electrical and Computer Engineering from the University of Michigan.

Read More

Use computer vision to measure agriculture yield with Amazon Rekognition Custom Labels

In the agriculture sector, the problem of identifying and counting the amount of fruit on trees plays an important role in crop estimation. The concept of renting and leasing a tree is becoming popular, where a tree owner leases the tree every year before the harvest based on the estimated fruit yeild. The common practice of manually counting fruit is a time-consuming and labor-intensive process. It’s one of the hardest but most important tasks in order to obtain better results in your crop management system. This estimation of the amount of fruit and flowers helps farmers make better decisions—not only on only leasing prices, but also on cultivation practices and plant disease prevention.

This is where an automated machine learning (ML) solution for computer vision (CV) can help farmers. Amazon Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to your business.

Rekognition Custom Labels doesn’t require you to have any prior computer vision expertise. You can get started by simply uploading tens of images instead of thousands. If the images are already labeled, you can begin training a model in just a few clicks. If not, you can label them directly within the Rekognition Custom Labels console, or use Amazon SageMaker Ground Truth to label them. Rekognition Custom Labels uses transfer learning to automatically inspect the training data, select the right model framework and algorithm, optimize the hyperparameters, and train the model. When you’re satisfied with the model accuracy, you can start hosting the trained model with just one click.

In this post, we showcase how you can build an end-to-end solution using Rekognition Custom Labels to detect and count fruit to measure agriculture yield.

Solution overview

We create a custom model to detect fruit using the following steps:

  1. Label a dataset with images containing fruit using Amazon SageMaker Ground Truth.
  2. Create a project in Rekognition Custom Labels.
  3. Import your labeled dataset.
  4. Train the model.
  5. Test the new custom model using the automatically generated API endpoint.

Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end model development and inference process.

Prerequisites

To create an agriculture yield measuring model, you first need to prepare a dataset to train the model with. For this post, our dataset is composed of images of fruit. The following images show some examples.

We sourced our images from our own garden. You can download the image files from the GitHub repo.

For this post, we only use a handful of images to showcase the fruit yield use case. You can experiment further with more images.

To prepare your dataset, complete the following steps:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket.
  2. Create two folders inside this bucket, called raw_data and test_data, to store images for labeling and model testing.
  3. Choose Upload to upload the images to their respective folders from the GitHub repo.

The uploaded images aren’t labeled. You label the images in the following step.

Label your dataset using Ground Truth

To train the ML model, you need labeled images. Ground Truth provides an easy process to label the images. The labeling task is performed by a human workforce; in this post, you create a private workforce. You can use Amazon Mechanical Turk for labeling at scale.

Create a labeling workforce

Let’s first create our labeling workforce. Complete the following steps:

  1. On the SageMaker console, under Ground Truth in the navigation pane, choose Labeling workforces.
  2. On the Private tab, choose Create private team.
  3. For Team name, enter a name for your workforce (for this post, labeling-team).
  4. Choose Create private team.
  5. Choose Invite new workers.

  6. In the Add workers by email address section, enter the email addresses of your workers. For this post, enter your own email address.
  7. Choose Invite new workers.

You have created a labeling workforce, which you use in the next step while creating a labeling job.

Create a Ground Truth labeling job

To great your labeling job, complete the following steps:

  1. On the SageMaker console, under Ground Truth, choose Labeling jobs.
  2. Choose Create labeling job.
  3. For Job name, enter fruits-detection.
  4. Select I want to specify a label attribute name different from the labeling job name.
  5. For Label attribute name¸ enter Labels.
  6. For Input data setup, select Automated data setup.
  7. For S3 location for input datasets, enter the S3 location of the images, using the bucket you created earlier (s3://{your-bucket-name}/raw-data/images/).
  8. For S3 location for output datasets, select Specify a new location and enter the output location for annotated data (s3://{your-bucket-name}/annotated-data/).
  9. For Data type, choose Image.
  10. Choose Complete data setup.
    This creates the image manifest file and updates the S3 input location path. Wait for the message “Input data connection successful.”
  11. Expand Additional configuration.
  12. Confirm that Full dataset is selected.
    This is used to specify whether you want to provide all the images to the labeling job or a subset of images based on filters or random sampling.
  13. For Task category, choose Image because this is a task for image annotation.
  14. Because this is an object detection use case, for Task selection, select Bounding box.
  15. Leave the other options as default and choose Next.
  16. Choose Next.

    Now you specify your workers and configure the labeling tool.
  17. For Worker types, select Private.For this post, you use an internal workforce to annotate the images. You also have the option to select a public contractual workforce (Amazon Mechanical Turk) or a partner workforce (Vendor managed) depending on your use case.
  18. For Private teams¸ choose the team you created earlier.
  19. Leave the other options as default and scroll down to Bounding box labeling tool.It’s essential to provide clear instructions here in the labeling tool for the private labeling team. These instructions acts as a guide for annotators while labeling. Good instructions are concise, so we recommend limiting the verbal or textual instructions to two sentences and focusing on visual instructions. In the case of image classification, we recommend providing one labeled image in each of the classes as part of the instructions.
  20. Add two labels: fruit and no_fruit.
  21. Enter detailed instructions in the Description field to provide instructions to the workers. For example: You need to label fruits in the provided image. Please ensure that you select label 'fruit' and draw the box around the fruit just to fit the fruit for better quality of label data. You also need to label other areas which look similar to fruit but are not fruit with label 'no_fruit'.You can also optionally provide examples of good and bad labeling images. You need to make sure that these images are publicly accessible.
  22. Choose Create to create the labeling job.

After the job is successfully created, the next step is to label the input images.

Start the labeling job

Once you have successfully created the job, the status of the job is InProgress. This means that the job is created and the private workforce is notified via email regarding the task assigned to them. Because you have assigned the task to yourself, you should receive an email with instructions to log in to the Ground Truth Labeling project.

  1. Open the email and choose the link provided.
  2. Enter the user name and password provided in the email.
    You may have to change the temporary password provided in the email to a new password after login.
  3. After you log in, select your job and choose Start working.

    You can use the provided tools to zoom in, zoom out, move, and draw bounding boxes in the images.
  4. Choose your label (fruit or no_fruit) and then draw a bounding box in the image to annotate it.
  5. When you’re finished, choose Submit.

Now you have correctly labeled images that will be used by the ML model for training.

Create your Amazon Rekognition project

To create your agriculture yield measuring project, complete the following steps:

  1. On the Amazon Rekognition console, choose Custom Labels.
  2. Choose Get Started.
  3. For Project name, enter fruits_yield.
  4. Choose Create project.

You can also create a project on the Projects page. You can access the Projects page via the navigation pane. The next step is to provide images as input.

Import your dataset

To create your agriculture yield measuring model, you first need to import a dataset to train the model with. For this post, our dataset is already labeled using Ground Truth.

  1. For Import images, select Import images labeled by SageMaker Ground Truth.
  2. For Manifest file location, enter the S3 bucket location of your manifest file (s3://{your-bucket-name}/fruits_image/annotated_data/fruits-labels/manifests/output/output.manifest).
  3. Choose Create Dataset.

You can see your labeled dataset.

Now you have your input dataset for the ML model to start training on them.

Train your model

After you label your images, you’re ready to train your model.

  1. Choose Train model.
  2. For Choose project, choose your project fruits_yield.
  3. Choose Train Model.

Wait for the training to complete. Now you can start testing the performance for this trained model.

Test your model

Your agriculture yield measuring model is now ready for use and should be in the Running state. To test the model, complete the following steps:

Step 1 : Start the model

On your model details page, on the Use model tab, choose Start.

Rekognition Custom Labels also provides the API calls for starting, using, and stopping your model.

Step 2 : Test the model

When the model is in the Running state, you can use the sample testing script analyzeImage.py to count the amount of fruit in an image.

  1. Download this script from of the GitHub repo.
  2. Edit this file to replace the parameter bucket with your bucket name and model with your Amazon Rekognition model ARN.

We use the parameters photo and min_confidence as input for this Python script.

You can run this script locally using the AWS Command Line Interface (AWS CLI) or using AWS CloudShell. In our example, we ran the script via the CloudShell console. Note that CloudShell is free to use.

Make sure to install the required dependences using the command pip3 install boto3 PILLOW if not already installed.

  1. Upload the file analyzeImage.py to CloudShell using the Actions menu.

The following screenshot shows the output, which detected two fruits in the input image. We supplied 15.jpeg as the photo argument and 85 as the min_confidence value.

The following example shows image 15.jpeg with two bounding boxes.

You can run the same script with other images and experiment by changing the confidence score further.

Step 3:  Stop the model

When you’re done, remember to stop model to avoid incurring in unnecessary charges. On your model details page, on the Use model tab, choose Stop.

Clean up

To avoid incurring unnecessary charges, delete the resources used in this walkthrough when not in use. We need to delete the Amazon Rekognition project and the S3 bucket.

Delete the Amazon Rekognition project

To delete the Amazon Rekognition project, complete the following steps:

  1. On the Amazon Rekognition console, choose Use Custom Labels.
  2. Choose Get started.
  3. In the navigation pane, choose Projects.
  4. On the Projects page, select the project that you want to delete.
    1. Choose Delete.
      The Delete project dialog box appears.
  5. If the project has no associated models:
    1. Enter delete to delete the project.
    2. Choose Delete to delete the project.
  6. If the project has associated models or datasets:
    1. Enter delete to confirm that you want to delete the model and datasets.
    2. Choose either Delete associated models, Delete associated datasets, or Delete associated datasets and models, depending on whether the model has datasets, models, or both.

    Model deletion might take a while to complete. Note that the Amazon Rekognition console can’t delete models that are in training or running. Try again after stopping any running models that are listed, and wait until the models listed as training are complete. If you close the dialog box during model deletion, the models are still deleted. Later, you can delete the project by repeating this procedure.

  7. Enter delete to confirm that you want to delete the project.
  8. Choose Delete to delete the project.

Delete your S3 bucket

You first need to empty the bucket and then delete it.

  1. On the Amazon S3 console, choose Buckets.
  2. Select the bucket that you want to empty, then choose Empty.
  3. Confirm that you want to empty the bucket by entering the bucket name into the text field, then choose Empty.
  4. Choose Delete.
  5. Confirm that you want to delete the bucket by entering the bucket name into the text field, then choose Delete bucket.

Conclusion

In this post, we showed you how to create an object detection model with Rekognition Custom Labels. This feature makes it easy to train a custom model that can detect an object class without needing to specify other objects or losing accuracy in its results.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?


About the authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Sameer Goel is a Sr. Solutions Architect in the Netherlands, who drives customer success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with a master’s degree from Boston, with a concentration in data science. He enjoys building and experimenting with AI/ML projects on Raspberry Pi. You can find him on LinkedIn.

Read More

Amazon SageMaker Automatic Model Tuning now supports SageMaker Training Instance Fallbacks

Today Amazon SageMaker announced the support of SageMaker training instance fallbacks for Amazon SageMaker Automatic Model Tuning (AMT) that allow users to specify alternative compute resource configurations.

SageMaker automatic model tuning finds the best version of a model by running many training jobs on your dataset using the ranges of hyperparameters that you specify for your algorithm. Then, it chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

Previously, users only had the option to specify a single instance configuration. This can lead to problems when the specified instance type isn’t available due to high utilization. In the past, your training jobs would fail with an InsufficientCapacityError (ICE). AMT used smart retries to avoid these failures in many cases, but it remained powerless in the face of sustained low capacity.

This new feature means that you can specify a list of instance configurations in the order of preference, such that your AMT job will automatically fallback to the next instance in the list in the event of low capacity.

In the following sections, we walk through these high-level steps for overcoming an ICE:

  1. Define HyperParameter Tuning Job Configuration
  2. Define the Training Job Parameters
  3. Create the Hyperparameter Tuning Job
  4. Describe training job

Define HyperParameter Tuning Job Configuration

The HyperParameterTuningJobConfig object describes the tuning job, including the search strategy, the objective metric used to evaluate training jobs, the ranges of the parameters to search, and the resource limits for the tuning job. This aspect wasn’t changed with today’s feature release. Nevertheless, we’ll go over it to give a complete example.

The ResourceLimits object specifies the maximum number of training jobs and parallel training jobs for this tuning job. In this example, we’re doing a random search strategy and specifying a maximum of 10 jobs (MaxNumberOfTrainingJobs) and 5 concurrent jobs (MaxParallelTrainingJobs) at a time.

The ParameterRanges object specifies the ranges of hyperparameters that this tuning job searches. We specify the name, as well as the minimum and maximum value of the hyperparameter to search. In this example, we define the minimum and maximum values for the Continuous and Integer parameter ranges and the name of the hyperparameter (“eta”, “max_depth”).

AmtTuningJobConfig={
            "Strategy": "Random",
            "ResourceLimits": {
              "MaxNumberOfTrainingJobs": 10,
              "MaxParallelTrainingJobs": 5
            },
            "HyperParameterTuningJobObjective": {
              "MetricName": "validation:rmse",
              "Type": "Minimize"
            },
            "ParameterRanges": {
              "CategoricalParameterRanges": [],
              "ContinuousParameterRanges": [
                {
                    "MaxValue": "1",
                    "MinValue": "0",
                    "Name": "eta"
                }
              ],
              "IntegerParameterRanges": [
                {
                  "MaxValue": "6",
                  "MinValue": "2",
                  "Name": "max_depth"
                }
              ]
            }
          }

Define the Training Job Parameters

In the training job definition, we define the input needed to run a training job using the algorithm that we specify. After the training completes, SageMaker saves the resulting model artifacts to an Amazon Simple Storage Service (Amazon S3) location that you specify.

Previously, we specified the instance type, count, and volume size under the ResourceConfig parameter. When the instance under this parameter was unavailable, an Insufficient Capacity Error (ICE) was thrown.

To avoid this, we now have the HyperParameterTuningResourceConfig parameter under the TrainingJobDefinition, where we specify a list of instances to fall back on. The format of these instances is the same as in the ResourceConfig. The job will traverse the list top-to-bottom to find an available instance configuration. If an instance is unavailable, then instead of an Insufficient Capacity Error (ICE), the next instance in the list is chosen, thereby overcoming the ICE.

TrainingJobDefinition={
            "HyperParameterTuningResourceConfig": {
      		"InstanceConfigs": [
            		{
                		"InstanceType": "ml.m4.xlarge",
                		"InstanceCount": 1,
                		"VolumeSizeInGB": 5
            		},
            		{
                		"InstanceType": "ml.m5.4xlarge",
                		"InstanceCount": 1,
                		"VolumeSizeInGB": 5
            		}
        		 ]
    		  },
            "AlgorithmSpecification": {
              "TrainingImage": "433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest",
              "TrainingInputMode": "File"
            },
            "InputDataConfig": [
              {
                "ChannelName": "train",
                "CompressionType": "None",
                "ContentType": "json",
                "DataSource": {
                  "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://<bucket>/test/"
                  }
                },
                "RecordWrapperType": "None"
              }
            ],
            "OutputDataConfig": {
              "S3OutputPath": "s3://<bucket>/output/"
            },
            "RoleArn": "arn:aws:iam::340308762637:role/service-role/AmazonSageMaker-ExecutionRole-20201117T142856",
            "StoppingCondition": {
              "MaxRuntimeInSeconds": 259200
            },
            "StaticHyperParameters": {
              "training_script_loc": "q2bn-sagemaker-test_6"
            },
          }

Run a Hyperparameter Tuning Job

In this step, we’re creating and running a hyperparameter tuning job with the hyperparameter tuning resource configuration defined above.

We initialize a SageMaker client and create the job by specifying the tuning config, training job definition, and a job name.

import boto3
sm = boto3.client('sagemaker')     
                    
sm.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName="my-job-name",
    HyperParameterTuningJobConfig=AmtTuningJobConfig,
    TrainingJobDefinition=TrainingJobDefinition) 

Running an AMT job with the support of SageMaker training instance fallbacks empowers the user to overcome insufficient capacity by themselves, thereby reducing the chance of a job failure.

Describe training jobs

The following function lists all instance types used during the experiment and can be used to verify if an SageMaker training instance has automatically fallen back to the next instance in the list during resource allocation.

def list_instances(name):
    job_list = []
    instances = []
    def _get_training_jobs(name, next=None):
        if next:
            list = sm.list_training_jobs_for_hyper_parameter_tuning_job(
            HyperParameterTuningJobName=name, NextToken=next)
        else:
            list = sm.list_training_jobs_for_hyper_parameter_tuning_job(
            HyperParameterTuningJobName=name)
        for jobs in list['TrainingJobSummaries']:
            job_list.append(jobs['TrainingJobName'])
        next = list.get('NextToken', None)
        if next:
            _get_training_jobs(name, next=next)
            pass
        else:
            pass
    _get_training_jobs(name)


    for job_name in job_list:
        ec2 = sm.describe_training_job(
        TrainingJobName=job_name
        )
        instances.append(ec2['ResourceConfig'])
    return instances

list_instances("my-job-name")  

The output of the function above displays all of the instances that the AMT job is using to run the experiment.

Conclusion

In this post, we demonstrated how you can now define a pool of instances on which your AMT experiment can fall back in the case of InsufficientCapacityError. We saw how to define a hyperparameter tuning job configuration, as well as specify the maximum number of training jobs and maximum parallel jobs. Finally, we saw how to overcome the InsufficientCapacityError by using the HyperParameterTuningResourceConfig parameter, which can be specified under the training job definition.

To learn more about AMT, visit Amazon SageMaker Automatic Model Tuning.


About the authors

Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solution in the cloud.

Kruthi Jayasimha Rao is a Partner Solutions Architect in the Scale-PSA team. Kruthi conducts technical validations for Partners enabling them progress in the Partner Path.

Bernard Jollans is a Software Development Engineer for Amazon SageMaker Automatic Model Tuning.

Read More

Create Amazon SageMaker model building pipelines and deploy R models using RStudio on Amazon SageMaker

In November 2021, in collaboration with RStudio PBC, we announced the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to Amazon SageMaker in just a few simple steps.

RStudio is one of the most popular IDEs among R developers for machine learning (ML) and data science projects. RStudio provides open-source tools for R and enterprise-ready professional software for data science teams to develop and share their work in the organization. Bringing RStudio on SageMaker not only gives you access to the AWS infrastructure in a fully managed way, but it also gives you native access to SageMaker.

In this post, we explore how you can use SageMaker features via RStudio on SageMaker to build a SageMaker pipeline that builds, processes, trains and registers your R models. We also explore using SageMaker for our model deployment, all using R.

Solution overview

The following diagram shows the architecture used in our solution. All code used in this example can be found in the GitHub repository.

Prerequisites

To follow this post, access to RStudio on SageMaker is required. If you’re new to using RStudio on SageMaker, review Get started with RStudio on Amazon SageMaker.

We also need to build custom Docker containers. We use AWS CodeBuild to build these containers, so you need a few extra AWS Identity and Access Management (IAM) permissions that you might not have by default. Before you proceed, make sure that the IAM role that you’re using has a trust policy with CodeBuild:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": [
          "codebuild.amazonaws.com"
        ]
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The following permissions are also required in the IAM role to run a build in CodeBuild and push the image to Amazon Elastic Container Registry (Amazon ECR):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "codebuild:DeleteProject",
                "codebuild:CreateProject",
                "codebuild:BatchGetBuilds",
                "codebuild:StartBuild"
            ],
            "Resource": "arn:aws:codebuild:*:*:project/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogStream",
            "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:GetLogEvents",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:log-group:/aws/codebuild/sagemaker-studio*:log-stream:*"
        },
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ecr:CreateRepository",
                "ecr:BatchGetImage",
                "ecr:CompleteLayerUpload",
                "ecr:DescribeImages",
                "ecr:DescribeRepositories",
                "ecr:UploadLayerPart",
                "ecr:ListImages",
                "ecr:InitiateLayerUpload", 
                "ecr:BatchCheckLayerAvailability",
                "ecr:PutImage"
            ],
            "Resource": "arn:aws:ecr:*:*:repository/sagemaker-studio*"
        },
        {
            "Sid": "ReadAccessToPrebuiltAwsImages",
            "Effect": "Allow",
            "Action": [
                "ecr:BatchGetImage",
                "ecr:GetDownloadUrlForLayer"
            ],
            "Resource": [
                "arn:aws:ecr:*:763104351884:repository/*",
                "arn:aws:ecr:*:217643126080:repository/*",
                "arn:aws:ecr:*:727897471807:repository/*",
                "arn:aws:ecr:*:626614931356:repository/*",
                "arn:aws:ecr:*:683313688378:repository/*",
                "arn:aws:ecr:*:520713654638:repository/*",
                "arn:aws:ecr:*:462105765813:repository/*"
            ]
        },
        {
            "Sid": "EcrAuthorizationTokenRetrieval",
            "Effect": "Allow",
            "Action": [
                "ecr:GetAuthorizationToken"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
              "s3:GetObject",
              "s3:DeleteObject",
              "s3:PutObject"
              ],
            "Resource": "arn:aws:s3:::sagemaker-*/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket"
            ],
            "Resource": "arn:aws:s3:::sagemaker*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:GetRole",
                "iam:ListRoles"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::*:role/*",
            "Condition": {
                "StringLikeIfExists": {
                    "iam:PassedToService": "codebuild.amazonaws.com"
                }
            }
        }
    ]
}

Create baseline R containers

To use our R scripts for processing and training on SageMaker processing and training jobs, we need to create our own Docker containers containing the necessary runtime and packages. The ability to use your own container, which is part of the SageMaker offering, gives great flexibility to developers and data scientists to use the tools and frameworks of their choice, with virtually no limitations.

We create two R-enabled Docker containers: one for processing jobs and one for training and deployment of our models. Processing data typically requires different packages and libraries than modeling, so it makes sense here to separate the two stages and use different containers.

For more details about using containers with SageMaker, refer to Using Docker containers with SageMaker.

The container used for processing is defined as follows:

FROM public.ecr.aws/docker/library/r-base:4.1.2

# Install tidyverse
RUN apt update && apt-get install -y --no-install-recommends 
    r-cran-tidyverse
    
RUN R -e "install.packages(c('rjson'))"

ENTRYPOINT ["Rscript"]

For this post, we use a simple and relatively lightweight container. Depending on your or your organization’s needs, you may want to pre-install several more R packages.

The container used for training and deployment is defined as follows:

FROM public.ecr.aws/docker/library/r-base:4.1.2

RUN apt-get -y update && apt-get install -y --no-install-recommends 
    wget 
    apt-transport-https 
    ca-certificates 
    libcurl4-openssl-dev 
    libsodium-dev
    
RUN apt-get update && apt-get install -y python3-dev python3-pip 
RUN pip3 install boto3
RUN R -e "install.packages(c('readr','plumber', 'reticulate'),dependencies=TRUE, repos='http://cran.rstudio.com/')"

ENV PATH="/opt/ml/code:${PATH}"

WORKDIR /opt/ml/code

COPY ./docker/run.sh /opt/ml/code/run.sh
COPY ./docker/entrypoint.R /opt/ml/entrypoint.R

RUN /bin/bash -c 'chmod +x /opt/ml/code/run.sh'

ENTRYPOINT ["/bin/bash", "run.sh"]

The RStudio kernel runs on a Docker container, so you won’t be able to build and deploy the containers using Docker commands directly on your Studio session. Instead, you can use the very useful library sagemaker-studio-image-build, which essentially outsources the task of building containers to CodeBuild.

With the following commands, we create two Amazon ECR registries: sagemaker-r-processing and sagemaker-r-train-n-deploy, and build the respective containers that we use later:

if (!py_module_available("sagemaker-studio-image-build")){py_install("sagemaker-studio-image-build", pip=TRUE)}
system("cd pipeline-example ; sm-docker build . —file ./docker/Dockerfile-train-n-deploy —repository sagemaker-r-train-and-deploy:1.0")
system("cd pipeline-example ; sm-docker build . —file ./docker/Dockerfile-processing —repository sagemaker-r-processing:1.0")

Create the pipeline

Now that the containers are built and ready, we can create the SageMaker pipeline that orchestrates the model building workflow. The full code of this is under the file pipeline.R in the repository. The easiest way to create a SageMaker pipeline is by using the SageMaker SDK, which is a Python library that we can access using the library reticulate. This gives us access to all functionalities of SageMaker without leaving the R language environment.

The pipeline we build has the following components:

  • Preprocessing step – This is a SageMaker processing job (utilizing the sagemaker-r-processing container) responsible for preprocessing the data and splitting the data into train and test datasets.
  • Training step – This is a SageMaker training job (utilizing the sagemaker-r-train-n-deploy container) responsible for training the model. In this example, we train a simple linear model.
  • Evaluation step – This is a SageMaker processing job (utilizing the sagemaker-r-processing container) responsible for performing evaluation of the model. Specifically in this example, we’re interested in the RMSE (root mean square error) on the test dataset, which we want to use in the next step as well as to associate with the model itself.
  • Conditional step – This is a conditional step, native to SageMaker pipelines, that allows us to branch the pipeline logic based on some parameter. In this case, the pipeline branches based on the value of RMSE that is calculated in the previous step.
  • Register model step – If the preceding conditional step is True, and the performance of the model is acceptable, then the model is registered in the model registry. For more information, refer to Register and Deploy Models with Model Registry.

First call the upsert function to create (or update) the pipeline and then call the start function to actually start running the pipeline:

source("pipeline-example/pipeline.R")
my_pipeline <- get_pipeline(input_data_uri=s3_raw_data)

upserted <- my_pipeline$upsert(role_arn=role_arn)
started <- my_pipeline$start()

Inspect the pipeline and model registry

One of the great things about using RStudio on SageMaker is that by being on the SageMaker platform, you can use the right tool for the right job and swiftly switch between them based on what you need to do.

As soon as we start the pipeline run, we can switch to Amazon SageMaker Studio, which allows us to visualize the pipeline and monitor current and previous runs of it.

To view details about the pipeline we just created and ran, navigate to the Studio IDE interface, choose SageMaker resources, choose Pipelines on the drop-down menu, and choose the pipeline (in this case, AbalonePipelineUsingR).

This reveals details of the pipeline, including all current and previous runs. Choose the latest one to bring up a visual representation of the pipeline, as per the following screenshot.

The DAG of the pipeline is created automatically by the service based on the data dependencies between steps, as well as based on custom added dependencies (not added any in this example).

When the run is complete, if successful, you should see all the steps turn green.

Choosing any of the individual steps brings up details about the specific step, including inputs, outputs, logs, and initial configuration settings. This allows you to drill down in the pipeline and investigate any failed steps.

Similarly, when the pipeline has finished running, a model is saved in the model registry. To access it, in the SageMaker resources pane, choose Model registry on the drop-down and choose your model. This reveals the list of registered models, as shown in the following screenshot. Choose one to open the details page for that particular model version.

After you open a version of the model, choose Update Status and Approve to approve the model.

At this point, based on your use case, you can set up this approval to trigger further actions, including the deployment of the model as per your needs.

Serverless deployment of the model

After you’ve trained and registered a model on SageMaker, deploying the model on SageMaker is straightforward.

There are several options of how you can deploy a model, such as batch inference, real-time endpoints, or asynchronous endpoints. Each method comes with several required configurations, including choosing the instance type you want as well as the scaling mechanism.

For this example, we use the recently announced feature of SageMaker, Serverless Inference (in preview mode as of the time of writing), to deploy our R model on a serverless endpoint. For this type of endpoint, we only define the amount of RAM that we want to be allocated to the model for inference, as well as the maximum number of allowed concurrent invocations of the model. SageMaker takes care of hosting the model and auto scaling as needed. You’re only charged for the exact number of seconds and data used by the model, with no cost for idle time.

You can deploy the model to a serverless endpoint with the following code:

model_package_arn <- 'ENTER_MODEL_PACKAGE_ARN_HERE'
model <- sagemaker$ModelPackage(
                        role=role_arn, 
                        model_package_arn=model_package_arn, 
                        sagemaker_session=session)
serverless_config <- sagemaker$serverless$ServerlessInferenceConfig(
                        memory_size_in_mb=1024L, 
                        max_concurrency=5L)
model$deploy(serverless_inference_config=serverless_config, 
             endpoint_name="serverless-r-abalone-endpoint")

If you see the error ClientError: An error occurred (ValidationException) when calling the CreateModel operation: Invalid approval status "PendingManualApproval" the model you want to deploy hasn’t been approved. Follow the steps from the previous section to approve your model.

Invoke the endpoint by sending a request to the HTTP endpoint we deployed, or instead use the SageMaker SDK. In the following code, we invoke the endpoint on some test data:

library(jsonlite)
x = list(features=format_csv(abalone_t[1:3,1:11]))
x = toJSON(x)

# test the endpoint
predictor <- sagemaker$predictor$Predictor(endpoint_name="serverless-r-abalone-endpoint", sagemaker_session=session)
predictor$predict(x)

The endpoint we invoked was a serverless endpoint, and as such we’re charged for the exact duration and data used. You might notice that the first time you invoke the endpoint it takes about a second to respond. This is due to the cold start time of the serverless endpoint. If you make another invocation soon after, the model returns the prediction in real time because it’s already warm.

When you finish experimenting with the endpoint, you can delete it with the following command:

predictor$delete_endpoint(delete_endpoint_config=TRUE)

Conclusion

In this post, we walked through the process of creating a SageMaker pipeline using R in our RStudio environment and showcased how to deploy our R model on a serverless endpoint on SageMaker using the SageMaker model registry.

With the combination of RStudio and SageMaker, you can now create and orchestrate complete end-to-end ML workflows on AWS using our preferred language of choice, R.

To dive deeper into this solution, I encourage you to review the source code of this solution, as well as other examples, on GitHub.


About the Author

Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.

Read More