Build an air quality anomaly detector using Amazon Lookout for Metrics

Today, air pollution is a familiar environmental issue that creates severe respiratory and heart conditions, which pose serious health threats. Acid rain, depletion of the ozone layer, and global warming are also adverse consequences of air pollution. There is a need for intelligent monitoring and automation in order to prevent severe health issues and in extreme cases life-threatening situations. Air quality is measured using the concentration of pollutants in the air. Identifying symptoms early and controlling the pollutant level before it’s dangerous is crucial. The process of identifying the air quality and the anomaly in the weight of pollutants, and quickly diagnosing the root cause, is difficult, costly, and error-prone.

The process of applying AI and machine learning (ML)-based solutions to find data anomalies involves a lot of complexity in ingesting, curating, and preparing data in the right format and then optimizing and maintaining the effectiveness of these ML models over long periods of time. This has been one of the barriers to quickly implementing and scaling the adoption of ML capabilities.

This post shows you how to use an integrated solution with Amazon Lookout for Metrics and Amazon Kinesis Data Firehose to break these barriers by quickly and easily ingesting streaming data, and subsequently detecting anomalies in the key performance indicators of your interest.

Lookout for Metrics automatically detects and diagnoses anomalies (outliers from the norm) in business and operational data. It’s a fully managed ML service that uses specialized ML models to detect anomalies based on the characteristics of your data. For example, trends and seasonality are two characteristics of time series metrics in which threshold-based anomaly detection doesn’t work. Trends are continuous variations (increases or decreases) in a metric’s value. On the other hand, seasonality is periodic patterns that occur in a system, usually rising above a baseline and then decreasing again. You don’t need ML experience to use Lookout for Metrics.

We demonstrate a common air quality monitoring scenario, in which we detect anomalies in the pollutant concentration in the air. By the end of this post, you’ll learn how to use these managed services from AWS to help prevent health issues and global warming. You can apply this solution to other use cases for better environment management, such as detecting anomalies in water quality, land quality, and power consumption patterns, to name a few.

Solution overview

The architecture consists of three functional blocks:

  • Wireless sensors placed at strategic locations to sense the concentration level of carbon monoxide (CO), sulfur dioxide (SO2), and nitrogen dioxide(NO2) in the air
  • Streaming data ingestion and storage
  • Anomaly detection and notification

The solution provides a fully automated data path from the sensors all the way to a notification being raised to the user. You can also interact with the solution using the Lookout for Metrics UI in order to analyze the identified anomalies.

The following diagram illustrates our solution architecture.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region.

  1. Download the Python script (publish.py) and data file from the GitHub repo.
  2. Open the live_data.csv file in your preferred editor and replace the dates to be today’s and tomorrow’s date. For example, if today’s date is July 8, 2022, then replace 2022-03-25 with 2022-07-08. Keep the format the same. This is required to simulate sensor data for the current date using the IoT simulator script.
  3. Create an Amazon Simple Storage Service (Amazon S3) bucket and a folder named air-quality. Create a subfolder inside air-quality named historical. For instructions, see Creating a folder.
  4. Upload the live_data.csv file in the root S3 bucket and historical_data.json in the historical folder.
  5. Create an AWS Cloud9 development environment, which we use to run the Python simulator program to create sensor data for this solution.

Ingest and transform data using AWS IoT Core and Kinesis Data Firehose

We use a Kinesis Data Firehose delivery stream to ingest the streaming data from AWS IoT Core and deliver it to Amazon S3. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, choose Direct PUT.
  3. For Destination, choose Amazon S3.
  4. For Delivery stream name, enter a name for your delivery stream.
  5. For S3 bucket, enter the bucket you created as a prerequisite.
  6. Enter values for S3 bucket prefix and S3 bucket error output prefix.One of the key points to note is the configuration of the custom prefix that is configured for the Amazon S3 destination. This prefix pattern makes sure that the data is created in the S3 bucket as per the prefix hierarchy expected by Lookout for Metrics. (More on this later in this post.) For more information about custom prefixes, see Custom Prefixes for Amazon S3 Objects.
  7. For Buffer interval, enter 60.
  8. Choose Create or update IAM role.
  9. Choose Create delivery stream.

    Now we configure AWS IoT Core and run the air quality simulator program.
  10. On the AWS IoT Core console, create an AWS IoT policy called admin.
  11. In the navigation pane under Message Routing, choose Rules.
  12. Choose Create rule.
  13. Create a rule with the Kinesis Data Firehose(firehose) action.
    This sends data from an MQTT message to a Kinesis Data Firehose delivery stream.
  14. Choose Create.
  15. Create an AWS IoT thing with name Test-Thing and attach the policy you created.
  16. Download the certificate, public key, private key, device certificate, and root CA for AWS IoT Core.
  17. Save each of the downloaded files to the certificates subdirectory that you created earlier.
  18. Upload publish.py to the iot-test-publish folder.
  19. On the AWS IoT Core console, in the navigation pane, choose Settings.
  20. Under Custom endpoint, copy the endpoint.
    This AWS IoT Core custom endpoint URL is personal to your AWS account and Region.
  21. Replace customEndpointUrl with your AWS IoT Core custom endpoint URL, certificates with the name of certificate, and Your_S3_Bucket_Name with your S3 bucket name.
    Next, you install pip and the AWS IoT SDK for Python.
  22. Log in to AWS Cloud9 and create a working directory in your development environment. For example: aq-iot-publish.
  23. Create a subdirectory for certificates in your new working directory. For example: certificates.
  24. Install the AWS IoT SDK for Python v2 by running the following from the command line.
    pip install awsiotsdk

  25. To test the data pipeline, run the following command:
    python3 publish.py

You can see the payload in the following screenshot.

Finally, the data is delivered to the specified S3 bucket in the prefix structure.

The data of the files is as follows:

  • {"TIMESTAMP":"2022-03-20 00:00","LOCATION_ID":"B-101","CO":2.6,"SO2":62,"NO2":57}
  • {"TIMESTAMP":"2022-03-20 00:05","LOCATION_ID":"B-101","CO":3.9,"SO2":60,"NO2":73}

The timestamps show that each file contains data for 5-minute intervals.

With minimal code, we have now ingested the sensor data, created an input stream from the ingested data, and stored the data in an S3 bucket based on the requirements for Lookout for Metrics.

In the following sections, we take a deeper look at the constructs within Lookout for Metrics, and how easy it is to configure these concepts using the Lookout for Metrics console.

Create a detector

A detector is a Lookout for Metrics resource that monitors a dataset and identifies anomalies at a predefined frequency. Detectors use ML to find patterns in data and distinguish between expected variations in data and legitimate anomalies. To improve its performance, a detector learns more about your data over time.

In our use case, the detector analyzes data from the sensor every 5 minutes.

To create the detector, navigate to the Lookout for Metrics console and choose Create detector. Provide the name and description (optional) for the detector, along with the interval of 5 minutes.

Your data is encrypted by default with a key that AWS owns and manages for you. You can also configure if you want to use a different encryption key from the one that is used by default.

Now let’s point this detector to the data that you want it to run anomaly detection on.

Create a dataset

A dataset tells the detector where to find your data and which metrics to analyze for anomalies. To create a dataset, complete the following steps:

  1. On the Amazon Lookout for Metrics console, navigate to your detector.
  2. Choose Add a dataset.
  3. For Name, enter a name (for example, air-quality-dataset).
  4. For Datasource, choose your data source (for this post, Amazon S3).
  5. For Detector mode, select your mode (for this post, Continuous).

With Amazon S3, you can create a detector in two modes:

    • Backtest – This mode is used to find anomalies in historical data. It needs all records to be consolidated in a single file.
    • Continuous – This mode is used to detect anomalies in live data. We use this mode with our use case because we want to detect anomalies as we receive air pollutant data from the air monitoring sensor.
  1. Enter the S3 path for the live S3 folder and path pattern.
  2. For Datasource interval, choose 5 minute intervals.If you have historical data from which the detector can learn patterns, you can provide it during this configuration. The data is expected to be in the same format that you use to perform a backtest. Providing historical data speeds up the ML model training process. If this isn’t available, the continuous detector waits for sufficient data to be available before making inferences.
  3. For this post, we already have historical data, so select Use historical data.
  4. Enter the S3 path of historical_data.json.
  5. For File format, select JSON lines.

At this point, Lookout for Metrics accesses the data source and validates whether it can parse the data. If the parsing is successful, it gives you a “Validation successful” message and takes you to the next page, where you configure measures, dimensions, and timestamps.

Configure measures, dimensions, and timestamps

Measures define KPIs that you want to track anomalies for. You can add up to five measures per detector. The fields that are used to create KPIs from your source data must be of numeric format. The KPIs can be currently defined by aggregating records within the time interval by doing a SUM or AVERAGE.

Dimensions give you the ability to slice and dice your data by defining categories or segments. This allows you to track anomalies for a subset of the whole set of data for which a particular measure is applicable.

In our use case, we add three measures, which calculate the AVG of the objects seen in the 5-minute interval, and have only one dimension, for which pollutants concentration is measured.

Every record in the dataset must have a timestamp. The following configuration allows you to choose the field that represents the timestamp value and also the format of the timestamp.

The next page allows you to review all the details you added and then save and activate the detector.

The detector then begins learning the data streaming into the data source. At this stage, the status of the detector changes to Initializing.

It’s important to note the minimum amount of data that is required before Lookout for Metrics can start detecting anomalies. For more information about requirements and limits, see Lookout for Metrics quotas.

With minimal configuration, you have created your detector, pointed it at a dataset, and defined the metrics that you want Lookout for Metrics to find anomalies in.

Visualize anomalies

Lookout for Metrics provides a rich UI experience for users who want to use the AWS Management Console to analyze the anomalies being detected. It also provides the capability to query the anomalies via APIs.

Let’s look at an example anomaly detected from our air quality data use case. The following screenshot shows an anomaly detected in CO concentration in the air at the designated time and date with a severity score of 93. It also shows the percentage contribution of the dimension towards the anomaly. In this case, 100% contribution comes from the location ID B-101 dimension.

Create alerts

Lookout for Metrics allows you to send alerts using a variety of channels. You can configure the anomaly severity score threshold at which the alerts must be triggered.

In our use case, we configure alerts to be sent to an Amazon Simple Notification Service (Amazon SNS) channel, which in turn sends an SMS. The following screenshots show the configuration details.

You can also use an alert to trigger automations using AWS Lambda functions in order to drive API-driven operations on AWS IoT Core.

Conclusion

In this post, we showed you how easy to use Lookout for Metrics and Kinesis Data Firehose to remove the undifferentiated heavy lifting involved in managing the end-to-end lifecycle of building ML-powered anomaly detection applications. This solution can help you accelerate your ability to find anomalies in key business metrics and allow you focus your efforts on growing and improving your business.

We encourage you to learn more by visiting the Amazon Lookout for Metrics Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.


About the author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Read More

Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and the Deep Graph Library

Fraudulent activities severely impact many industries, such as e-commerce, social media, and financial services. Frauds could cause a significant loss for businesses and consumers. American consumers reported losing more than $5.8 billion to frauds in 2021, up more than 70% over 2020. Many techniques have been used to detect fraudsters—rule-based filters, anomaly detection, and machine learning (ML) models, to name a few.

In real-world data, entities often involve rich relationships with other entities. Such a graph structure can provide valuable information for anomaly detection. For example, in the following figure, users are connected via shared entities such as Wi-Fi IDs, physical locations, and phone numbers. Due to the large number of unique values of these entities, like phone numbers, it’s difficult to use them in the traditional feature-based models—for example, one-hot encoding all phone numbers wouldn’t be viable. But such relationships could help predict whether a user is a fraudster. If a user has shared several entities with a known fraudster, the user is more likely a fraudster.

Recently, graph neural network (GNN) has become a popular method for fraud detection. GNN models can combine both graph structure and attributes of nodes or edges, such as users or transactions, to learn meaningful representations to distinguish malicious users and events from legitimate ones. This capability is crucial for detecting frauds where fraudsters collude to hide their abnormal features but leave some traces of relations.

Current GNN solutions mainly rely on offline batch training and inference mode, which detect fraudsters after malicious events have happened and losses have occurred. However, catching fraudulent users and activities in real time is crucial for preventing losses. This is particularly true in business cases where there is only one chance to prevent fraudulent activities. For example, in some e-commerce platforms, account registration is wide open. Fraudsters can behave maliciously just once with an account and never use the same account again.

Predicting fraudsters in real time is important. Building such a solution, however, is challenging. Because GNNs are still new to the industry, there are limited online resources on converting GNN models from batch serving to real-time serving. Additionally, it’s challenging to construct a streaming data pipeline that can feed incoming events to a GNN real-time serving API. To the best of the authors’ knowledge, no reference architectures and examples are available for GNN-based real-time inference solutions as of this writing.

To help developers apply GNNs to real-time fraud detection, this post shows how to use Amazon Neptune, Amazon SageMaker, and the Deep Graph Library (DGL), among other AWS services, to construct an end-to-end solution for real-time fraud detection using GNN models.

We focus on four tasks:

  • Processing a tabular transaction dataset into a heterogeneous graph dataset
  • Training a GNN model using SageMaker
  • Deploying the trained GNN models as a SageMaker endpoint
  • Demonstrating real-time inference for incoming transactions

This post extends the previous work in Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library, which focuses on the first two tasks. You can refer to that post for more details on heterogeneous graphs, GNNs, and semi-supervised training of GNNs.

Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud.

Solution overview

This solution contains two major parts.

The first part is a pipeline that processes the data, trains GNN models, and deploys the trained models. It uses AWS Glue to process the transaction data, and saves the processed data to both Amazon Neptune and Amazon Simple Storage Service (Amazon S3). Then, a SageMaker training job is triggered to train a GNN model on the data saved in Amazon S3 to predict whether a transaction is fraudulent. The trained model along with other assets are saved back to Amazon S3 upon the completion of the training job. Finally, the saved model is deployed as a SageMaker endpoint. The pipeline is orchestrated by AWS Step Functions, as shown in the following figure.

The second part of the solution implements real-time fraudulent transaction detection. It starts from a RESTful API that queries the graph database in Neptune to extract the subgraph related to an incoming transaction. It also has a web portal that can simulate business activities, generating online transactions with both fraudulent and legitimate ones. The web portal provides a live visualization of the fraud detection. This part uses Amazon CloudFront, AWS Amplify, AWS AppSync, Amazon API Gateway, Step Functions, and Amazon DocumentDB to rapidly build the web application. The following diagram illustrates the real-time inference process and web portal.

The implementation of this solution, along with an AWS CloudFormation template that can launch the architecture in your AWS account, is publicly available through the following GitHub repo.

Data processing

In this section, we briefly describe how to process an example dataset and convert it from raw tables into a graph with relations identified among different columns.

This solution uses the same dataset, the IEEE-CIS fraud dataset, as the previous post Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library. Therefore, the basic principle of the data process is the same. In brief, the fraud dataset includes a transactions table and an identities table, having nearly 500,000 anonymized transaction records along with contextual information (for example, devices used in transactions). Some transactions have a binary label, indicating whether a transaction is fraudulent. Our task is to predict which unlabeled transactions are fraudulent and which are legitimate.

The following figure illustrates the general process of how to convert the IEEE tables into a heterogeneous graph. We first extract two columns from each table. One column is always the transaction ID column, where we set each unique TransactionID as one node. Another column is picked from the categorical columns, such as the ProductCD and id_03 columns, where each unique category was set as a node. If a TransactionID and a unique category appear in the same row, we connect them with one edge. This way, we convert two columns in a table into one bipartite. Then we combine those bipartites along with the TransactionID nodes, where the same TransactionID nodes are merged into one unique node. After this step, we have a heterogeneous graph built from bipartites.

For the rest of the columns that aren’t used to build the graph, we join them together as the feature of the TransactionID nodes. TransactionID values that have the isFraud values are used as the label for model training. Based on this heterogeneous graph, our task becomes a node classification task of the TransactionID nodes. For more details on preparing the graph data for training GNNs, refer to the Feature extraction and Constructing the graph sections of the previous blog post.

The code used in this solution is available in src/scripts/glue-etl.py. You can also experiment with data processing through the Jupyter notebook src/sagemaker/01.FD_SL_Process_IEEE-CIS_Dataset.ipynb.

Instead of manually processing the data, as done in the previous post, this solution uses a fully automatic pipeline orchestrated by Step Functions and AWS Glue that supports processing huge datasets in parallel via Apache Spark. The Step Functions workflow is written in AWS Cloud Development Kit (AWS CDK). The following is a code snippet to create this workflow:

import { LambdaInvoke, GlueStartJobRun } from 'aws-cdk-lib/aws-stepfunctions-tasks';
    
    const parametersNormalizeTask = new LambdaInvoke(this, 'Parameters normalize', {
      lambdaFunction: parametersNormalizeFn,
      integrationPattern: IntegrationPattern.REQUEST_RESPONSE,
    });
    
    ...
    
    const dataProcessTask = new GlueStartJobRun(this, 'Data Process', {
      integrationPattern: IntegrationPattern.RUN_JOB,
      glueJobName: etlConstruct.jobName,
      timeout: Duration.hours(5),
      resultPath: '$.dataProcessOutput',
    });
    
    ...    
    
    const definition = parametersNormalizeTask
      .next(dataIngestTask)
      .next(dataCatalogCrawlerTask)
      .next(dataProcessTask)
      .next(hyperParaTask)
      .next(trainingJobTask)
      .next(runLoadGraphDataTask)
      .next(modelRepackagingTask)
      .next(createModelTask)
      .next(createEndpointConfigTask)
      .next(checkEndpointTask)
      .next(endpointChoice);

Besides constructing the graph data for GNN model training, this workflow also batch loads the graph data into Neptune to conduct real-time inference later on. This batch data loading process is demonstrated in the following code snippet:

from neptune_python_utils.endpoints import Endpoints
from neptune_python_utils.bulkload import BulkLoad

...

bulkload = BulkLoad(
        source=targetDataPath,
        endpoints=endpoints,
        role=args.neptune_iam_role_arn,
        region=args.region,
        update_single_cardinality_properties=True,
        fail_on_error=True)
        
load_status = bulkload.load_async()
status, json = load_status.status(details=True, errors=True)
load_status.wait()

GNN model training

After the graph data for model training is saved in Amazon S3, a SageMaker training job, which is only charged when the training job is running, is triggered to start the GNN model training process in the Bring Your Own Container (BYOC) mode. It allows you to pack your model training scripts and dependencies in a Docker image, which it uses to create SageMaker training instances. The BYOC method could save significant effort in setting up the training environment. In src/sagemaker/02.FD_SL_Build_Training_Container_Test_Local.ipynb, you can find details of the GNN model training.

Docker image

The first part of the Jupyter notebook file is the training Docker image generation (see the following code snippet):

*!* aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
image_name *=* 'fraud-detection-with-gnn-on-dgl/training'
*!* docker build -t $image_name ./FD_SL_DGL/gnn_fraud_detection_dgl

We used a PyTorch-based image for the model training. The Deep Graph Library (DGL) and other dependencies are installed when building the Docker image. The GNN model code in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder is copied to the image as well.

Because we process the transaction data into a heterogeneous graph, in this solution we choose the Relational Graph Convolutional Network (RGCN) model, which is specifically designed for heterogeneous graphs. Our RGCN model can train learnable embeddings for the nodes in heterogeneous graphs. Then, the learned embeddings are used as inputs of a fully connected layer for predicting the node labels.

Hyperparameters

To train the GNN, we need to define a few hyperparameters before the training process, such as the file names of the graph constructed, the number of layers of GNN models, the training epochs, the optimizer, the optimization parameters, and more. See the following code for a subset of the configurations:

edges *=* ","*.*join(map(*lambda* x: x*.*split("/")[*-*1], [file *for* file *in* processed_files *if* "relation" *in* file]))

params *=* {'nodes' : 'features.csv',
          'edges': edges,
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2}

For more information about all the hyperparameters and their default values, see estimator_fns.py in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder.

Model training with SageMaker

After the customized container Docker image is built, we use the preprocessed data to train our GNN model with the hyperparameters we defined. The training job uses the DGL, with PyTorch as the backend deep learning framework, to construct and train the GNN. SageMaker makes it easy to train GNN models with the customized Docker image, which is an input argument of the SageMaker estimator. For more information about training GNNs with the DGL on SageMaker, see Train a Deep Graph Network.

The SageMaker Python SDK uses Estimator to encapsulate training on SageMaker, which runs SageMaker-compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK. The following code snippet demonstrates training the model with SageMaker (either in a local environment or cloud instances):

from sagemaker.estimator import Estimator
from time import strftime, gmtime
from sagemaker.local import LocalSession

localSageMakerSession = LocalSession(boto_session=boto3.session.Session(region_name=current_region))
estimator = Estimator(image_uri=image_name,
                      role=sagemaker_exec_role,
                      instance_count=1,
                      instance_type='local',
                      hyperparameters=params,
                      output_path=output_path,
                      sagemaker_session=localSageMakerSession)

training_job_name = "{}-{}".format('GNN-FD-SL-DGL-Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

After training, the GNN model’s performance on the test set is displayed like the following outputs. The RGCN model normally can achieve around 0.87 AUC and more than 95% accuracy. For a comparison of the RGCN model with other ML models, refer to the Results section of the previous blog post for more details.

Epoch 00099 | Time(s) 7.9413 | Loss 0.1023 | f1 0.3745
Metrics
Confusion Matrix:
                        labels positive labels negative
    predicted positive  4343            576
    predicted negative  13494           454019

    f1: 0.3817, precision: 0.8829, recall: 0.2435, acc: 0.9702, roc: 0.8704, pr: 0.4782, ap: 0.4782

Finished Model training

Upon the completion of model training, SageMaker packs the trained model along with other assets, including the trained node embeddings, into a ZIP file and then uploads it to a specified S3 location. Next, we discuss the deployment of the trained model for real-time fraudulent detection.

GNN model deployment

SageMaker makes the deployment of trained ML models simple. In this stage, we use the SageMaker PyTorchModel class to deploy the trained model, because our DGL model depends on PyTorch as the backend framework. You can find the deployment code in the src/sagemaker/03.FD_SL_Endpoint_Deployment.ipynb file.

Besides the trained model file and assets, SageMaker requires an entry point file for the deployment of a customized model. The entry point file is run and stored in the memory of an inference endpoint instance to respond to the inference request. In our case, the entry point file is the fd_sl_deployment_entry_point.py file in the src/sagemaker/FD_SL_DGL/code folder, which performs four major functions:

  • Receive requests and parse contents of requests to obtain the to-be-predicted nodes and their associated data
  • Convert the data to a DGL heterogeneous graph as input for the RGCN model
  • Perform the real-time inference via the trained RGCN model
  • Return the prediction results to the requester

Following SageMaker conventions, the first two functions are implemented in the input_fn method. See the following code (for simplicity, we delete some commentary code):

def input_fn(request_body, request_content_type='application/json'):

    # --------------------- receive request ------------------------------------------------ #
    input_data = json.loads(request_body)

    subgraph_dict = input_data['graph']
    n_feats = input_data['n_feats']
    target_id = input_data['target_id']

    graph, new_n_feats, new_pred_target_id = recreate_graph_data(subgraph_dict, n_feats, target_id)

    return (graph, new_n_feats, new_pred_target_id)

The constructed DGL graph and features are then passed to the predict_fn method to fulfill the third function. predict_fn takes two input arguments: the outputs of input_fn and the trained model. See the following code:

def predict_fn(input_data, model):

    # ---------------------  Inference ------------------------------------------------ #
    graph, new_n_feats, new_pred_target_id = input_data

    with th.no_grad():
        logits = model(graph, new_n_feats)
        res = logits[new_pred_target_id].cpu().detach().numpy()

    return res[1]

The model used in perdict_fn is created by the model_fn method when the endpoint is called the first time. The function model_fn loads the saved model file and associated assets from the model_dir argument and the SageMaker model folder. See the following code:

def model_fn(model_dir):

    # ------------------ Loading model -------------------
    ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size = 
    initialize_arguments(os.path.join(BASE_PATH, 'metadata.pkl'))

    rgcn_model = HeteroRGCN(ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size)

    stat_dict = th.load('model.pth')

    rgcn_model.load_state_dict(stat_dict)

    return rgcn_model

The output of the predict_fn method is a list of two numbers, indicating the logits for class 0 and class 1, where 0 means legitimate and 1 means fraudulent. SageMaker takes this list and passes it to an inner method called output_fn to complete the final function.

To deploy our GNN model, we first wrap the GNN model into a SageMaker PyTorchModel class with the entry point file and other parameters (the path of the saved ZIP file, the PyTorch framework version, the Python version, and so on). Then we call its deploy method with instance settings. See the following code:

env = {
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'
}

print(f'Use model {repackged_model_path}')

sagemakerSession = sm.session.Session(boto3.session.Session(region_name=current_region))
fd_sl_model = PyTorchModel(model_data=repackged_model_path, 
                           role=sagemaker_exec_role,
                           entry_point='./FD_SL_DGL/code/fd_sl_deployment_entry_point.py',
                           framework_version='1.6.0',
                           py_version='py3',
                           predictor_cls=JSONPredictor,
                           env=env,
                           sagemaker_session=sagemakerSession)
                           
fd_sl_predictor *=* fd_sl_model*.*deploy(instance_type*=*'ml.c5.4xlarge',
                                     initial_instance_count*=*1,)

The preceding procedures and code snippets demonstrate how to deploy your GNN model as an online inference endpoint from a Jupyter notebook. However, for production, we recommend using the previously mentioned MLOps pipeline orchestrated by Step Functions for the entire workflow, including processing data, training the model, and deploying an inference endpoint. The entire pipeline is implemented by an AWS CDK application, which can be easily replicated in different Regions and accounts.

Real-time inference

When a new transaction arrives, to perform real-time prediction, we need to complete four steps:

  1. Node and edge insertion – Extract the transaction’s information such as the TransactionID and ProductCD as nodes and edges, and insert the new nodes into the existing graph data stored at the Neptune database.
  2. Subgraph extraction – Set the to-be-predicted transaction node as the center node, and extract a n-hop subgraph according to the GNN model’s input requirements.
  3. Feature extraction – For the nodes and edges in the subgraph, extract their associated features.
  4. Call the inference endpoint – Pack the subgraph and features into the contents of a request, then send the request to the inference endpoint.

In this solution, we implement a RESTful API to achieve real-time fraudulent predication described in the preceding steps. See the following pseudo-code for real-time predictions. The full implementation is in the complete source code file.

For prediction in real time, the first three steps require lower latency. Therefore, a graph database is an optimal choice for these tasks, particularly for the subgraph extraction, which could be achieved efficiently with graph database queries. The underline functions that support the pseudo-code are based on Neptune’s gremlin queries.

def handler(event, context):
    
    graph_input = GraphModelClient(endpoints)
    
    # Step 1: node and edge insertion
    trans_dict, identity_dict, target_id, transaction_value_cols, union_li_cols = 
        load_data_from_event(event, transactions_id_cols, transactions_cat_cols, dummied_col)
    graph_input.insert_new_transaction_vertex_and_edge(trans_dict, identity_dict , target_id, vertex_type = 'Transaction')
    
    
    # Setp 2: subgraph extraction
    subgraph_dict, transaction_embed_value_dict = 
        graph_input.query_target_subgraph(target_id, trans_dict, transaction_value_cols, union_li_cols, dummied_col)
    

    # Step 3 & 4: feature extraction & call the inference endpoint
    transaction_id = int(target_id[(target_id.find('-')+1):])
    pred_prob = invoke_endpoint_with_idx(endpointname = ENDPOINT_NAME, target_id = transaction_id, subgraph_dict = subgraph_dict, n_feats = transaction_embed_value_dict)
       
    function_res = {
                    'id': event['transaction_data'][0]['TransactionID'],
                    'flag': pred_prob > MODEL_BTW,
                    'pred_prob': pred_prob
                    }
       
    return function_res

One caveat about real-time fraud detection using GNNs is the GNN inference mode. To fulfill real-time inference, we need to convert the GNN model inference from transductive mode to inductive mode. GNN models in transductive inference mode can’t make predictions for newly appeared nodes and edges, whereas in inductive mode, GNN models can handle new nodes and edges. A demonstration of the difference between transductive and inductive mode is shown in the following figure.

In transductive mode, predicted nodes and edges coexist with labeled nodes and edges during training. Models identify them before inference, and they could be inferred in training. Models in inductive mode are trained on the training graph but need to predict unseen nodes (those in red dotted circles on the right) with their associated neighbors, which might be new nodes, like the gray triangle node on the right.

Our RGCN model is trained and tested in transductive mode. It has access to all nodes in training, and also trained an embedding for each featureless node, such as IP address and card types. In the testing stage, the RGCN model uses these embeddings as node features to predict nodes in the test set. When we do real-time inference, however, some of the newly added featureless nodes have no such embeddings because they’re not in the training graph. One way to tackle this issue is to assign the mean of all embeddings in the same node type to the new nodes. In this solution, we adopt this method.

In addition, this solution provides a web portal (as seen in the following screenshot) to demonstrate real-time fraudulent predictions from business operators’ perspectives. It can generate the simulated online transactions, and provide a live visualization of detected fraudulent transaction information.

Clean up

When you’re finished exploring the solution, you can clean the resources to avoid incurring charges.

Conclusion

In this post, we showed how to build a GNN-based real-time fraud detection solution using SageMaker, Neptune, and the DGL. This solution has three major advantages:

  • It has good performance in terms of prediction accuracy and AUC metrics
  • It can perform real-time inference via a streaming MLOps pipeline and SageMaker endpoints
  • It automates the total deployment process with the provided CloudFormation template so that interested developers can easily test this solution with custom data in their account

For more details about the solution, see the GitHub repo.

After you deploy this solution, we recommend customizing the data processing code to fit your own data format and modify the real-time inference mechanism while keeping the GNN model unchanged. Note that we split the real-time inference into four steps without further optimization of the latency. These four steps take a few seconds to get a prediction on the demo dataset. We believe that optimizing the Neptune graph data schema design and queries for subgraph and feature extraction can significantly reduce the inference latency.


About the authors

Jian Zhang is an applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Mengxin Zhu is a manager of Solutions Architects at AWS, with a focus on designing and developing reusable AWS solutions. He has been engaged in software development for many years and has been responsible for several startup teams of various sizes. He also is an advocate of open-source software and was an Eclipse Committer.

Haozhu Wang is a research scientist at Amazon ML Solutions Lab, where he co-leads the Reinforcement Learning Vertical. He helps customers build advanced machine learning solutions with the latest research on graph learning, natural language processing, reinforcement learning, and AutoML. Haozhu received his PhD in Electrical and Computer Engineering from the University of Michigan.

Read More

Rax: Composable Learning-to-Rank Using JAX

Ranking is a core problem across a variety of domains, such as search engines, recommendation systems, or question answering. As such, researchers often utilize learning-to-rank (LTR), a set of supervised machine learning techniques that optimize for the utility of an entire list of items (rather than a single item at a time). A noticeable recent focus is on combining LTR with deep learning. Existing libraries, most notably TF-Ranking, offer researchers and practitioners the necessary tools to use LTR in their work. However, none of the existing LTR libraries work natively with JAX, a new machine learning framework that provides an extensible system of function transformations that compose: automatic differentiation, JIT-compilation to GPU/TPU devices and more.

Today, we are excited to introduce Rax, a library for LTR in the JAX ecosystem. Rax brings decades of LTR research to the JAX ecosystem, making it possible to apply JAX to a variety of ranking problems and combine ranking techniques with recent advances in deep learning built upon JAX (e.g., T5X). Rax provides state-of-the-art ranking losses, a number of standard ranking metrics, and a set of function transformations to enable ranking metric optimization. All this functionality is provided with a well-documented and easy to use API that will look and feel familiar to JAX users. Please check out our paper for more technical details.

Learning-to-Rank Using Rax
Rax is designed to solve LTR problems. To this end, Rax provides loss and metric functions that operate on batches of lists, not batches of individual data points as is common in other machine learning problems. An example of such a list is the multiple potential results from a search engine query. The figure below illustrates how tools from Rax can be used to train neural networks on ranking tasks. In this example, the green items (B, F) are very relevant, the yellow items (C, E) are somewhat relevant and the red items (A, D) are not relevant. A neural network is used to predict a relevancy score for each item, then these items are sorted by these scores to produce a ranking. A Rax ranking loss incorporates the entire list of scores to optimize the neural network, improving the overall ranking of the items. After several iterations of stochastic gradient descent, the neural network learns to score the items such that the resulting ranking is optimal: relevant items are placed at the top of the list and non-relevant items at the bottom.

Using Rax to optimize a neural network for a ranking task. The green items (B, F) are very relevant, the yellow items (C, E) are somewhat relevant and the red items (A, D) are not relevant.

Approximate Metric Optimization
The quality of a ranking is commonly evaluated using ranking metrics, e.g., the normalized discounted cumulative gain (NDCG). An important objective of LTR is to optimize a neural network so that it scores highly on ranking metrics. However, ranking metrics like NDCG can present challenges because they are often discontinuous and flat, so stochastic gradient descent cannot directly be applied to these metrics. Rax provides state-of-the-art approximation techniques that make it possible to produce differentiable surrogates to ranking metrics that permit optimization via gradient descent. The figure below illustrates the use of rax.approx_t12n, a function transformation unique to Rax, which allows for the NDCG metric to be transformed into an approximate and differentiable form.

Using an approximation technique from Rax to transform the NDCG ranking metric into a differentiable and optimizable ranking loss (approx_t12n and gumbel_t12n).

First, notice how the NDCG metric (in green) is flat and discontinuous, making it hard to optimize using stochastic gradient descent. By applying the rax.approx_t12n transformation to the metric, we obtain ApproxNDCG, an approximate metric that is now differentiable with well-defined gradients (in red). However, it potentially has many local optima — points where the loss is locally optimal, but not globally optimal — in which the training process can get stuck. When the loss encounters such a local optimum, training procedures like stochastic gradient descent will have difficulty improving the neural network further.

To overcome this, we can obtain the gumbel-version of ApproxNDCG by using the rax.gumbel_t12n transformation. This gumbel version introduces noise in the ranking scores which causes the loss to sample many different rankings that may incur a non-zero cost (in blue). This stochastic treatment may help the loss escape local optima and often is a better choice when training a neural network on a ranking metric. Rax, by design, allows the approximate and gumbel transformations to be freely used with all metrics that are offered by the library, including metrics with a top-k cutoff value, like recall or precision. In fact, it is even possible to implement your own metrics and transform them to obtain gumbel-approximate versions that permit optimization without any extra effort.

Ranking in the JAX Ecosystem
Rax is designed to integrate well in the JAX ecosystem and we prioritize interoperability with other JAX-based libraries. For example, a common workflow for researchers that use JAX is to use TensorFlow Datasets to load a dataset, Flax to build a neural network, and Optax to optimize the parameters of the network. Each of these libraries composes well with the others and the composition of these tools is what makes working with JAX both flexible and powerful. For researchers and practitioners of ranking systems, the JAX ecosystem was previously missing LTR functionality, and Rax fills this gap by providing a collection of ranking losses and metrics. We have carefully constructed Rax to function natively with standard JAX transformations such as jax.jit and jax.grad and various libraries like Flax and Optax. This means that users can freely use their favorite JAX and Rax tools together.

Ranking with T5
While giant language models such as T5 have shown great performance on natural language tasks, how to leverage ranking losses to improve their performance on ranking tasks, such as search or question answering, is under-explored. With Rax, it is possible to fully tap this potential. Rax is written as a JAX-first library, thus it is easy to integrate it with other JAX libraries. Since T5X is an implementation of T5 in the JAX ecosystem, Rax can work with it seamlessly.

To this end, we have an example that demonstrates how Rax can be used in T5X. By incorporating ranking losses and metrics, it is now possible to fine-tune T5 for ranking problems, and our results indicate that enhancing T5 with ranking losses can offer significant performance improvements. For example, on the MS-MARCO QNA v2.1 benchmark we are able to achieve a +1.2% NDCG and +1.7% MRR by fine-tuning a T5-Base model using the Rax listwise softmax cross-entropy loss instead of a pointwise sigmoid cross-entropy loss.

Fine-tuning a T5-Base model on MS-MARCO QNA v2.1 with a ranking loss (softmax, in blue) versus a non-ranking loss (pointwise sigmoid, in red).

Conclusion
Overall, Rax is a new addition to the growing ecosystem of JAX libraries. Rax is entirely open source and available to everyone at github.com/google/rax. More technical details can also be found in our paper. We encourage everyone to explore the examples included in the github repository: (1) optimizing a neural network with Flax and Optax, (2) comparing different approximate metric optimization techniques, and (3) how to integrate Rax with T5X.

Acknowledgements
Many collaborators within Google made this project possible: Xuanhui Wang, Zhen Qin, Le Yan, Rama Kumar Pasumarthi, Michael Bendersky, Marc Najork, Fernando Diaz, Ryan Doherty, Afroz Mohiuddin, and Samer Hassan.

Read More

Top Israel Medical Center Partners with AI Startups to Help Detect Brain Bleeds, Other Critical Cases

Israel’s largest private medical center is working with startups and researchers to bring potentially life-saving AI solutions to real-world healthcare workflows.

With more than 1.5 million patients across eight medical centers, Assuta Medical Centers conduct over 100,000 surgeries, 800,000 imaging tests and hundreds of thousands of other health diagnostics and treatments each year. These create huge amounts of de-identified data that Assuta is securely sharing with more than 20 startups through its innovation arm, RISE, launched last year working in collaboration with NVIDIA.

One of the startups, Aidoc, is helping Assuta alert imaging technicians with AI-based insights of possible bleeding in the brain and other critical conditions in a patient’s scan within minutes. Another, Rhino Health, is using federated learning powered by NVIDIA FLARE to make AI development on diverse medical datasets from hospitals across the globe more accessible to Assuta’s collaborators.

Both companies are members of NVIDIA Inception, a global program designed to support cutting-edge startups with go-to-market support, expertise and technology.

“We’re building a hub to serve innovators with the infrastructure they need to develop, test and deploy new AI technology for image analysis and other data-heavy computations in radiology, pathology, genomics and more,” said Daniel Rabina, director of innovation at RISE. “We want to make collaboration with companies, research institutes, hospitals and universities possible while maintaining patient data privacy.”

To support AI development, testing and deployment, Assuta has installed NVIDIA DGX A100 systems on premises and adopted the NVIDIA Clara Holoscan platform, plus software libraries including MONAI for healthcare imaging and NVIDIA FLARE for federated learning.

NVIDIA and RISE are collaborating on RISE with US, a program built to introduce selected Israeli entrepreneurs and early-stage startups working on digital and computational health solutions to the U.S. market. Applications to join the program are open until August 28.

Aidoc Flags Urgent Cases for Radiologist Review

Aidoc, which is New York-based with a research branch in Israel, has developed FDA-cleared AI solutions to flag acute conditions including brain hemorrhages, pulmonary embolisms and strokes from imaging scans.

Aidoc desktop and mobile interfaceFounded in 2016 by a group of veterans from the Israel Defense Forces, the startup has deployed its AI to analyze millions of cases across more than 1,000 medical facilities, primarily in the U.S., Europe and Israel.

Its algorithms integrate seamlessly with the PACS imaging workflow used by radiologists worldwide, working behind the scenes to analyze each imaging study and flag urgent findings — bringing potentially critical cases to the radiologist’s attention for review.

Aidoc’s tools can help address the growing shortage of radiologists globally by reducing the time a radiologist needs to spend on each case, enabling care for more patients. And by pushing potentially critical cases to the top of a radiologist’s pile, the AI can help clinicians catch important findings sooner, improving patient outcomes.

The startup uses NVIDIA Tensor Core GPUs in the cloud through AWS for AI training and inference. Adopting NVIDIA GPUs helped reduce model training time from days to a couple hours.

Immediate Impact at Assuta Medical Centers 

Assuta is a private chain of hospitals that provides elective care — typically dealing with routine screenings rather than emergency room patients — but it adopted Aidoc’s solution to help imaging technicians spot critical cases that need urgent attention among its roughly 200,000 CT tests conducted annually.

When a radiology scan isn’t urgent, it may take a couple days for a doctor to review the case. Aidoc can shrink this time to minutes by identifying concerning cases as soon as the scans are captured by radiology staff. Assuta facilities

At Assuta, urgent findings are typically found among cancer patients, or people who have recently undergone surgery and need follow-up scans. The healthcare organization is using Aidoc’s AI tools to detect intracranial hemorrhages and two kinds of pulmonary embolism.

“We saw the impact right away,” said Dr. Michal Guindy, head of medical imaging and head of RISE at Assuta. “Just a couple days after installing Aidoc at Assuta, a patient came in for a follow-up scan after a brain procedure and had an intracranial hemorrhage. Because Aidoc alerted the imaging technician to flag it for further review, our doctors were able to call the patient while they were on their way home and immediately redirect them to the hospital for treatment.”

Rhino Health Fosters Collaboration With Federated Learning

In addition to deploying AI models in full-scale, real-world settings, Assuta is supporting innovators who are developing, testing or validating new medical AI solutions by sharing the healthcare organization’s data, while also using federated learning through Rhino Health.

Assuta has millions of radiology cases digitized — a desirable resource for researchers and startups looking for robust, diverse datasets to train or validate their AI models. But because of data privacy protection, it’s important that patient information stays safely within the firewall of medical centers like Assuta.

“Data diversity is necessary to develop AI models meant for the use of medical teams. Without optimal computing resources, it would be extremely difficult to use our data and make the magic happen,” said Rabina. “That’s why we need federated learning enabled by both NVIDIA and Rhino Health.”

Federated learning allows companies, healthcare institutions and universities to work together by training and validating AI models across multiple organizations’ datasets while maintaining each organization’s data privacy. Rhino Health provides a neutral platform — available through the NVIDIA AI Enterprise software suite — that enables secure collaboration, powered by NVIDIA A100 GPUs in the cloud and the NVIDIA FLARE federated learning framework.

With Rhino Health, Assuta aims to help its collaborators develop AI models across hospitals internationally, resulting in more generalizable algorithms that perform more accurately across different patient populations.

Register for NVIDIA GTC, running online Sept. 19-22, to hear more from leaders in healthcare AI.

Subscribe to NVIDIA healthcare news and watch on demand as Assuta, Aidoc and Rhino Health speak at an GTC panel.

The post Top Israel Medical Center Partners with AI Startups to Help Detect Brain Bleeds, Other Critical Cases appeared first on NVIDIA Blog.

Read More

GFN Thursday Brings Thunder to the Cloud With ‘Rumbleverse’ Arriving on GeForce NOW

It’s time to rumble in Grapital City with Rumbleverse launching today on GeForce NOW.

Punch your way into the all-new, free-to-play Brawler Royale from Iron Galaxy Studios and Epic Games Publishing, streaming from the cloud to nearly all devices.

That means gamers can tackle, uppercut, body slam and more from any GeForce NOW-compatible device, including mobile, at full PC quality. And GeForce NOW is the only way for Mac gamers to join the fray.

Plus, jump over to the list of seven new titles in the GeForce NOW library. Members will also notice a new “GET” button that makes accessing titles they’re interested in more seamless, so they can get right into gaming.

Drop in, Throw Down!

Drop into the chaotic world of Grapital City, where players must brawl it out to become the champion. Rumblers can create their own fighters using hundreds of unique items to stand out in the crowd — a 40-person melee crowd, to be exact.

Or, maybe, the strategy isn’t to stand out. With a massive city to run around in — including skyscrapers and an urban landscape — there are plenty of places to hide, duke it out and find crates full of weapons, like baseball bats or a stop sign, as well as upgrades to level up with.

Players can explore a ton of moves to take down other rumblers and discover perks with each round to come up with devious new ways to be the last person standing.

To learn the ways of the Rumble, Playground mode is available to explore Grapital City, in addition to various training modules scattered across the map. Players can also form a tag team and fight back to back with a friend.

Rumbleverse is free to play, so getting started is easy when paired with a free GeForce NOW membership. Thanks to the cloud, members don’t even have to wait for the game to download.

Level up to a RTX 3080 membership to stream at up to 1440p and 120 frames per second at ultra-low latency, plus dedicated access to RTX 3080 servers and eight-hour gaming sessions. It’s the best way to get the upper hand when duking it out with fellow rumblers.

Smash the ‘GET’ Button

The GeForce NOW apps on PC, Mac, iOS and browser now feature a “GET” button to link members directly to the digital store of their choice, making it even easier to quickly purchase titles or access free-to-play ones. And without having to wait for game downloads due to cloud streaming, members can dive into their new games as quickly as possible.

GeForce NOW GET Button
Get what you want, when you want it.

Hunted, the newest season of Apex Legends, is also available for members to stream. Hunt or be hunted in the cloud with a new Legend Vantage, an updated map of Kings Canyon, increased level cap and more in Apex Legends: Hunted.

Apex Legends Season 14 on GeForce NOW
Hunt or be hunted in the cloud in the newest season of “Apex Legends”

Plus, make sure to check out the seven new titles being added this week:

How do you plan to be a champion this week? We’ve got a couple of options for you to choose from. Let us know your answer on Twitter or in the comments below.

The post GFN Thursday Brings Thunder to the Cloud With ‘Rumbleverse’ Arriving on GeForce NOW appeared first on NVIDIA Blog.

Read More

FORML: Learning to Reweight Data for Fairness

Machine learning models are trained to minimize the mean loss for a single metric, and thus typically do not consider fairness and robustness. Neglecting such metrics in training can make these models prone to fairness violations when training data are imbalanced or test distributions differ. This work introduces Fairness Optimized Reweighting via Meta-Learning (FORML), a training algorithm that balances fairness and robustness with accuracy by jointly learning training sample weights and neural network parameters. The approach increases model fairness by learning to balance the contributions…Apple Machine Learning Research