Intelligent document processing with AWS AI services: Part 1

Organizations across industries such as healthcare, finance and lending, legal, retail, and manufacturing often have to deal with a lot of documents in their day-to-day business processes. These documents contain critical information that are key to making decisions on time in order to maintain the highest levels of customer satisfaction, faster customer onboarding, and lower customer churn. In most cases, documents are processed manually to extract information and insights, which is time-consuming, error-prone, expensive, and difficult to scale. There is limited automation available today to process and extract information from these documents. Intelligent document processing (IDP) with AWS artificial intelligence (AI) services helps automate information extraction from documents of different types and formats, quickly and with high accuracy, without the need for machine learning (ML) skills. Faster information extraction with high accuracy helps in making quality business decisions on time, while reducing overall costs.

Although the stages in an IDP workflow may vary and be influenced by use case and business requirements, the following figure shows the stages that are typically part of an IDP workflow. Processing documents such as tax forms, claims, medical notes, new customer forms, invoices, legal contracts, and more are just a few of the use cases for IDP.

Phases of intelligent document processing in AWS

In this two-part series, we discuss how you can automate and intelligently process documents at scale using AWS AI services. In this post, we discuss the first three phases of the IDP workflow. In part 2, we discuss the remaining workflow phases.

Solution overview

The following architecture diagram shows the stages of an IDP workflow. It starts with a data capture stage to securely store and aggregate different file formats (PDF, JPEG, PNG, TIFF) and layouts of documents. The next stage is classification, where you categorize your documents (such as contracts, claim forms, invoices, or receipts), followed by document extraction. In the extraction stage, you can extract meaningful business information from your documents. This extracted data is often used to gather insights via data analysis, or sent to downstream systems such as databases or transactional systems. The following stage is enrichment, where documents can be enriched by redacting protected health information (PHI) or personally identifiable information (PII) data, custom business term extraction, and so on. Finally, in the review and validation stage, you can include a human workforce for document reviews to ensure the outcome is accurate.

For the purposes of this post, we consider a set of sample documents such as bank statements, invoices, and store receipts. The document samples, along with sample code, can be found in our GitHub repository. In the following sections, we walk you through these code samples along with real practical application. We demonstrate how you can utilize ML capabilities with Amazon Textract, Amazon Comprehend, and Amazon Augmented AI (Amazon A2I) to process documents and validate the data extracted from them.

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.

Amazon Comprehend is a natural-language processing (NLP) service that uses ML to extract insights about the content of the documents. Amazon Comprehend can identify critical elements in documents, including references to language, people, and places, and classify them into relevant topics or clusters. It can perform sentiment analysis to determine the sentiment of a document in real time using single document or batch detection. For example, it can analyze the comments on a blog post to know if your readers like the post or not. Amazon Comprehend also detects PII like addresses, bank account numbers, and phone numbers in text documents in real time and asynchronous batch jobs. It can also redact PII entities in asynchronous batch jobs.

Amazon A2I is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether it runs on AWS or not. Amazon A2I integrates both with Amazon Textract and Amazon Comprehend to provide you the ability to introduce human review steps within your intelligent document processing workflow.

Data capture phase

You can store documents in a highly scalable and durable storage like Amazon Simple Storage Service (Amazon S3). Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon S3 is designed for 11 9’s of durability and stores data for millions of customers all around the world. Documents can come in various formats and layouts, and can come from different channels like web portals or email attachments.

Classification phase

In the previous step, we collected documents of various types and formats. In this step, we need to categorize the documents before we can do further extraction. For that, we use Amazon Comprehend custom classification. Document classification is a two-step process. First, you train an Amazon Comprehend custom classifier to recognize the classes that are of interest to you. Next, you deploy the model with a custom classifier real-time endpoint and send unlabeled documents to the real-time endpoint to be classified.

The following figure represents a typical document classification workflow.

Classification phase

To train the classifier, identify the classes you’re interested in and provide sample documents for each of the classes as training material. Based on the options you indicated, Amazon Comprehend creates a custom ML model that it trains based on the documents you provided. This custom model (the classifier) examines each document you submit. It returns either the specific class that best represents the content (if you’re using multi-class mode) or the set of classes that apply to it (if you’re using multi-label mode).

Prepare training data

The first step is to extract text from documents required for the Amazon Comprehend custom classifier. To extract the raw text information for all the documents in Amazon S3, we use the Amazon Textract detect_document_text() API. We also label the data according to the document type to be used to train a custom Amazon Comprehend classifier.

The following code has been trimmed down for simplification purposes. For the full code, refer to the GitHub sample code for textract_extract_text(). The function call_textract() is a wr4apper function that calls the AnalyzeDocument API internally, and the parameters passed to the method abstract some of the configurations that the API needs to run the extraction task.

def textract_extract_text(document, bucket=data_bucket):        
    try:
        print(f'Processing document: {document}')
        lines = ""
        row = []
        
        # using amazon-textract-caller
        response = call_textract(input_document=f's3://{bucket}/{document}') 
        # using pretty printer to get all the lines
        lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
        
        label = [name for name in names if(name in document)]  
        row.append(label[0])
        row.append(lines)        
        return row
    except Exception as e:
        print (e)        

Train a custom classifier

In this step, we use Amazon Comprehend custom classification to train our model for classifying the documents. We use the CreateDocumentClassifier API to create a classifier that trains a custom model using our labeled data. See the following code:

create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/{key}'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )

Deploy a real-time endpoint

To use the Amazon Comprehend custom classifier, we create a real-time endpoint using the CreateEndpoint API:

endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ENDPOINT_ARN=endpoint_response['EndpointArn']
print(f'Endpoint created with ARN: {ENDPOINT_ARN}')  

Classify documents with the real-time endpoint

After the Amazon Comprehend endpoint is created, we can use the real-time endpoint to classify documents. We use the comprehend.classify_document() function with the extracted document text and inference endpoint as input parameters:

response = comprehend.classify_document(
      Text= document,
      EndpointArn=ENDPOINT_ARN
      )

Amazon Comprehend returns all classes of documents with a confidence score linked to each class in an array of key-value pairs (name-score). We pick the document class with the highest confidence score. The following screenshot is a sample response.

Classify documents with the real-time endpoint

We recommend going through the detailed document classification sample code on GitHub.

Extraction phase

Amazon Textract lets you extract text and structured data information using the Amazon Textract DetectDocumentText and AnalyzeDocument APIs, respectively. These APIs respond with JSON data, which contains WORDS, LINES, FORMS, TABLES, geometry or bounding box information, relationships, and so on. Both DetectDocumentText and AnalyzeDocument are synchronous operations. To analyze documents asynchronously, use StartDocumentTextDetection.

Structured data extraction

You can extract structured data such as tables from documents while preserving the data structure and relationships between detected items. You can use the AnalyzeDocument API with the FeatureType as TABLE to detect all tables in a document. The following figure illustrates this process.

Structured data extraction

See the following code:

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])

We run the analyze_document() method with the FeatureType as TABLES on the employee history document and obtain the table extraction in the following results.

Analyze document API response for tables extraction

Semi-structured data extraction

You can extract semi-structured data such as forms or key-value pairs from documents while preserving the data structure and relationships between detected items. You can use the AnalyzeDocument API with the FeatureType as FORMS to detect all forms in a document. The following diagram illustrates this process.

Semi-structured data extraction

See the following code:

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["FORMS"])

Here, we run the analyze_document() method with the FeatureType as FORMS on the employee application document and obtain the table extraction in the results.

Unstructured data extraction

Amazon Textract is optimal for dense text extraction with industry-leading OCR accuracy. You can use the DetectDocumentText API to detect lines of text and the words that make up a line of text, as illustrated in the following figure.

Unstructured data extraction

See the following code:

response = textract.detect_document_text(Document={'Bytes': imageBytes})

# Print detected text
for item in response["Blocks"]:
	if item["BlockType"] == "LINE":
 		print (item["Text"])

Now we run the detect_document_text() method on the sample image and obtain raw text extraction in the results.

Invoices and receipts

Amazon Textract provides specialized support to process invoices and receipts at scale. The AnalyzeExpense API can extract explicitly labeled data, implied data, and line items from an itemized list of goods or services from almost any invoice or receipt without any templates or configuration. The following figure illustrates this process.

Invoices and receipts extraction

See the following code:

response = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo on the page without an explicit label called “vendor”. It can also find and extract expense items, quantity, and prices that aren’t labeled with column headers for line items.

Analyze expense API response

Identity documents

The Amazon Textract AnalyzeID API can help you automatically extract information from identification documents, such as driver’s licenses and passports, without the need for templates or configuration. We can extract specific information, such as date of expiry and date of birth, as well as intelligently identify and extract implied information, such as name and address. The following diagram illustrates this process.

Identity documents extraction

See the following code:

textract_client = boto3.client('textract')
j = call_textract_analyzeid(document_pages=["s3://amazon-textract-public-content/analyzeid/driverlicense.png"],boto3_textract_client=textract_client)

We can use tabulate to get a pretty printed output:

from tabulate import tabulate

print(tabulate([x[1:3] for x in result]))

We recommend going through the detailed document extraction sample code on GitHub. For more information about the full code samples in this post, refer to the GitHub repo.

Conclusion

In this first post of a two-part series, we discussed the various stages of IDP and a solution architecture. We also discussed document classification using an Amazon Comprehend custom classifier. Next, we explored the ways you can use Amazon Textract to extract information from unstructured, semi-structured, structured, and specialized document types.

In part 2 of this series, we continue the discussion with the extract and queries features of Amazon Textract. We look at how to use Amazon Comprehend pre-defined entities and custom entities to extract key business terms from documents with dense text, and how to integrate an Amazon A2I human-in-the-loop review in your IDP processes.

We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. Also, take a moment to review and understand the pricing for Amazon Textract, Amazon Comprehend, and Amazon A2I.


About the authors

Suprakash Dutta is a Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning.

Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and data analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Read More

Build an air quality anomaly detector using Amazon Lookout for Metrics

Today, air pollution is a familiar environmental issue that creates severe respiratory and heart conditions, which pose serious health threats. Acid rain, depletion of the ozone layer, and global warming are also adverse consequences of air pollution. There is a need for intelligent monitoring and automation in order to prevent severe health issues and in extreme cases life-threatening situations. Air quality is measured using the concentration of pollutants in the air. Identifying symptoms early and controlling the pollutant level before it’s dangerous is crucial. The process of identifying the air quality and the anomaly in the weight of pollutants, and quickly diagnosing the root cause, is difficult, costly, and error-prone.

The process of applying AI and machine learning (ML)-based solutions to find data anomalies involves a lot of complexity in ingesting, curating, and preparing data in the right format and then optimizing and maintaining the effectiveness of these ML models over long periods of time. This has been one of the barriers to quickly implementing and scaling the adoption of ML capabilities.

This post shows you how to use an integrated solution with Amazon Lookout for Metrics and Amazon Kinesis Data Firehose to break these barriers by quickly and easily ingesting streaming data, and subsequently detecting anomalies in the key performance indicators of your interest.

Lookout for Metrics automatically detects and diagnoses anomalies (outliers from the norm) in business and operational data. It’s a fully managed ML service that uses specialized ML models to detect anomalies based on the characteristics of your data. For example, trends and seasonality are two characteristics of time series metrics in which threshold-based anomaly detection doesn’t work. Trends are continuous variations (increases or decreases) in a metric’s value. On the other hand, seasonality is periodic patterns that occur in a system, usually rising above a baseline and then decreasing again. You don’t need ML experience to use Lookout for Metrics.

We demonstrate a common air quality monitoring scenario, in which we detect anomalies in the pollutant concentration in the air. By the end of this post, you’ll learn how to use these managed services from AWS to help prevent health issues and global warming. You can apply this solution to other use cases for better environment management, such as detecting anomalies in water quality, land quality, and power consumption patterns, to name a few.

Solution overview

The architecture consists of three functional blocks:

  • Wireless sensors placed at strategic locations to sense the concentration level of carbon monoxide (CO), sulfur dioxide (SO2), and nitrogen dioxide(NO2) in the air
  • Streaming data ingestion and storage
  • Anomaly detection and notification

The solution provides a fully automated data path from the sensors all the way to a notification being raised to the user. You can also interact with the solution using the Lookout for Metrics UI in order to analyze the identified anomalies.

The following diagram illustrates our solution architecture.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region.

  1. Download the Python script (publish.py) and data file from the GitHub repo.
  2. Open the live_data.csv file in your preferred editor and replace the dates to be today’s and tomorrow’s date. For example, if today’s date is July 8, 2022, then replace 2022-03-25 with 2022-07-08. Keep the format the same. This is required to simulate sensor data for the current date using the IoT simulator script.
  3. Create an Amazon Simple Storage Service (Amazon S3) bucket and a folder named air-quality. Create a subfolder inside air-quality named historical. For instructions, see Creating a folder.
  4. Upload the live_data.csv file in the root S3 bucket and historical_data.json in the historical folder.
  5. Create an AWS Cloud9 development environment, which we use to run the Python simulator program to create sensor data for this solution.

Ingest and transform data using AWS IoT Core and Kinesis Data Firehose

We use a Kinesis Data Firehose delivery stream to ingest the streaming data from AWS IoT Core and deliver it to Amazon S3. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, choose Direct PUT.
  3. For Destination, choose Amazon S3.
  4. For Delivery stream name, enter a name for your delivery stream.
  5. For S3 bucket, enter the bucket you created as a prerequisite.
  6. Enter values for S3 bucket prefix and S3 bucket error output prefix.One of the key points to note is the configuration of the custom prefix that is configured for the Amazon S3 destination. This prefix pattern makes sure that the data is created in the S3 bucket as per the prefix hierarchy expected by Lookout for Metrics. (More on this later in this post.) For more information about custom prefixes, see Custom Prefixes for Amazon S3 Objects.
  7. For Buffer interval, enter 60.
  8. Choose Create or update IAM role.
  9. Choose Create delivery stream.

    Now we configure AWS IoT Core and run the air quality simulator program.
  10. On the AWS IoT Core console, create an AWS IoT policy called admin.
  11. In the navigation pane under Message Routing, choose Rules.
  12. Choose Create rule.
  13. Create a rule with the Kinesis Data Firehose(firehose) action.
    This sends data from an MQTT message to a Kinesis Data Firehose delivery stream.
  14. Choose Create.
  15. Create an AWS IoT thing with name Test-Thing and attach the policy you created.
  16. Download the certificate, public key, private key, device certificate, and root CA for AWS IoT Core.
  17. Save each of the downloaded files to the certificates subdirectory that you created earlier.
  18. Upload publish.py to the iot-test-publish folder.
  19. On the AWS IoT Core console, in the navigation pane, choose Settings.
  20. Under Custom endpoint, copy the endpoint.
    This AWS IoT Core custom endpoint URL is personal to your AWS account and Region.
  21. Replace customEndpointUrl with your AWS IoT Core custom endpoint URL, certificates with the name of certificate, and Your_S3_Bucket_Name with your S3 bucket name.
    Next, you install pip and the AWS IoT SDK for Python.
  22. Log in to AWS Cloud9 and create a working directory in your development environment. For example: aq-iot-publish.
  23. Create a subdirectory for certificates in your new working directory. For example: certificates.
  24. Install the AWS IoT SDK for Python v2 by running the following from the command line.
    pip install awsiotsdk

  25. To test the data pipeline, run the following command:
    python3 publish.py

You can see the payload in the following screenshot.

Finally, the data is delivered to the specified S3 bucket in the prefix structure.

The data of the files is as follows:

  • {"TIMESTAMP":"2022-03-20 00:00","LOCATION_ID":"B-101","CO":2.6,"SO2":62,"NO2":57}
  • {"TIMESTAMP":"2022-03-20 00:05","LOCATION_ID":"B-101","CO":3.9,"SO2":60,"NO2":73}

The timestamps show that each file contains data for 5-minute intervals.

With minimal code, we have now ingested the sensor data, created an input stream from the ingested data, and stored the data in an S3 bucket based on the requirements for Lookout for Metrics.

In the following sections, we take a deeper look at the constructs within Lookout for Metrics, and how easy it is to configure these concepts using the Lookout for Metrics console.

Create a detector

A detector is a Lookout for Metrics resource that monitors a dataset and identifies anomalies at a predefined frequency. Detectors use ML to find patterns in data and distinguish between expected variations in data and legitimate anomalies. To improve its performance, a detector learns more about your data over time.

In our use case, the detector analyzes data from the sensor every 5 minutes.

To create the detector, navigate to the Lookout for Metrics console and choose Create detector. Provide the name and description (optional) for the detector, along with the interval of 5 minutes.

Your data is encrypted by default with a key that AWS owns and manages for you. You can also configure if you want to use a different encryption key from the one that is used by default.

Now let’s point this detector to the data that you want it to run anomaly detection on.

Create a dataset

A dataset tells the detector where to find your data and which metrics to analyze for anomalies. To create a dataset, complete the following steps:

  1. On the Amazon Lookout for Metrics console, navigate to your detector.
  2. Choose Add a dataset.
  3. For Name, enter a name (for example, air-quality-dataset).
  4. For Datasource, choose your data source (for this post, Amazon S3).
  5. For Detector mode, select your mode (for this post, Continuous).

With Amazon S3, you can create a detector in two modes:

    • Backtest – This mode is used to find anomalies in historical data. It needs all records to be consolidated in a single file.
    • Continuous – This mode is used to detect anomalies in live data. We use this mode with our use case because we want to detect anomalies as we receive air pollutant data from the air monitoring sensor.
  1. Enter the S3 path for the live S3 folder and path pattern.
  2. For Datasource interval, choose 5 minute intervals.If you have historical data from which the detector can learn patterns, you can provide it during this configuration. The data is expected to be in the same format that you use to perform a backtest. Providing historical data speeds up the ML model training process. If this isn’t available, the continuous detector waits for sufficient data to be available before making inferences.
  3. For this post, we already have historical data, so select Use historical data.
  4. Enter the S3 path of historical_data.json.
  5. For File format, select JSON lines.

At this point, Lookout for Metrics accesses the data source and validates whether it can parse the data. If the parsing is successful, it gives you a “Validation successful” message and takes you to the next page, where you configure measures, dimensions, and timestamps.

Configure measures, dimensions, and timestamps

Measures define KPIs that you want to track anomalies for. You can add up to five measures per detector. The fields that are used to create KPIs from your source data must be of numeric format. The KPIs can be currently defined by aggregating records within the time interval by doing a SUM or AVERAGE.

Dimensions give you the ability to slice and dice your data by defining categories or segments. This allows you to track anomalies for a subset of the whole set of data for which a particular measure is applicable.

In our use case, we add three measures, which calculate the AVG of the objects seen in the 5-minute interval, and have only one dimension, for which pollutants concentration is measured.

Every record in the dataset must have a timestamp. The following configuration allows you to choose the field that represents the timestamp value and also the format of the timestamp.

The next page allows you to review all the details you added and then save and activate the detector.

The detector then begins learning the data streaming into the data source. At this stage, the status of the detector changes to Initializing.

It’s important to note the minimum amount of data that is required before Lookout for Metrics can start detecting anomalies. For more information about requirements and limits, see Lookout for Metrics quotas.

With minimal configuration, you have created your detector, pointed it at a dataset, and defined the metrics that you want Lookout for Metrics to find anomalies in.

Visualize anomalies

Lookout for Metrics provides a rich UI experience for users who want to use the AWS Management Console to analyze the anomalies being detected. It also provides the capability to query the anomalies via APIs.

Let’s look at an example anomaly detected from our air quality data use case. The following screenshot shows an anomaly detected in CO concentration in the air at the designated time and date with a severity score of 93. It also shows the percentage contribution of the dimension towards the anomaly. In this case, 100% contribution comes from the location ID B-101 dimension.

Create alerts

Lookout for Metrics allows you to send alerts using a variety of channels. You can configure the anomaly severity score threshold at which the alerts must be triggered.

In our use case, we configure alerts to be sent to an Amazon Simple Notification Service (Amazon SNS) channel, which in turn sends an SMS. The following screenshots show the configuration details.

You can also use an alert to trigger automations using AWS Lambda functions in order to drive API-driven operations on AWS IoT Core.

Conclusion

In this post, we showed you how easy to use Lookout for Metrics and Kinesis Data Firehose to remove the undifferentiated heavy lifting involved in managing the end-to-end lifecycle of building ML-powered anomaly detection applications. This solution can help you accelerate your ability to find anomalies in key business metrics and allow you focus your efforts on growing and improving your business.

We encourage you to learn more by visiting the Amazon Lookout for Metrics Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.


About the author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Read More

Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and the Deep Graph Library

Fraudulent activities severely impact many industries, such as e-commerce, social media, and financial services. Frauds could cause a significant loss for businesses and consumers. American consumers reported losing more than $5.8 billion to frauds in 2021, up more than 70% over 2020. Many techniques have been used to detect fraudsters—rule-based filters, anomaly detection, and machine learning (ML) models, to name a few.

In real-world data, entities often involve rich relationships with other entities. Such a graph structure can provide valuable information for anomaly detection. For example, in the following figure, users are connected via shared entities such as Wi-Fi IDs, physical locations, and phone numbers. Due to the large number of unique values of these entities, like phone numbers, it’s difficult to use them in the traditional feature-based models—for example, one-hot encoding all phone numbers wouldn’t be viable. But such relationships could help predict whether a user is a fraudster. If a user has shared several entities with a known fraudster, the user is more likely a fraudster.

Recently, graph neural network (GNN) has become a popular method for fraud detection. GNN models can combine both graph structure and attributes of nodes or edges, such as users or transactions, to learn meaningful representations to distinguish malicious users and events from legitimate ones. This capability is crucial for detecting frauds where fraudsters collude to hide their abnormal features but leave some traces of relations.

Current GNN solutions mainly rely on offline batch training and inference mode, which detect fraudsters after malicious events have happened and losses have occurred. However, catching fraudulent users and activities in real time is crucial for preventing losses. This is particularly true in business cases where there is only one chance to prevent fraudulent activities. For example, in some e-commerce platforms, account registration is wide open. Fraudsters can behave maliciously just once with an account and never use the same account again.

Predicting fraudsters in real time is important. Building such a solution, however, is challenging. Because GNNs are still new to the industry, there are limited online resources on converting GNN models from batch serving to real-time serving. Additionally, it’s challenging to construct a streaming data pipeline that can feed incoming events to a GNN real-time serving API. To the best of the authors’ knowledge, no reference architectures and examples are available for GNN-based real-time inference solutions as of this writing.

To help developers apply GNNs to real-time fraud detection, this post shows how to use Amazon Neptune, Amazon SageMaker, and the Deep Graph Library (DGL), among other AWS services, to construct an end-to-end solution for real-time fraud detection using GNN models.

We focus on four tasks:

  • Processing a tabular transaction dataset into a heterogeneous graph dataset
  • Training a GNN model using SageMaker
  • Deploying the trained GNN models as a SageMaker endpoint
  • Demonstrating real-time inference for incoming transactions

This post extends the previous work in Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library, which focuses on the first two tasks. You can refer to that post for more details on heterogeneous graphs, GNNs, and semi-supervised training of GNNs.

Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud.

Solution overview

This solution contains two major parts.

The first part is a pipeline that processes the data, trains GNN models, and deploys the trained models. It uses AWS Glue to process the transaction data, and saves the processed data to both Amazon Neptune and Amazon Simple Storage Service (Amazon S3). Then, a SageMaker training job is triggered to train a GNN model on the data saved in Amazon S3 to predict whether a transaction is fraudulent. The trained model along with other assets are saved back to Amazon S3 upon the completion of the training job. Finally, the saved model is deployed as a SageMaker endpoint. The pipeline is orchestrated by AWS Step Functions, as shown in the following figure.

The second part of the solution implements real-time fraudulent transaction detection. It starts from a RESTful API that queries the graph database in Neptune to extract the subgraph related to an incoming transaction. It also has a web portal that can simulate business activities, generating online transactions with both fraudulent and legitimate ones. The web portal provides a live visualization of the fraud detection. This part uses Amazon CloudFront, AWS Amplify, AWS AppSync, Amazon API Gateway, Step Functions, and Amazon DocumentDB to rapidly build the web application. The following diagram illustrates the real-time inference process and web portal.

The implementation of this solution, along with an AWS CloudFormation template that can launch the architecture in your AWS account, is publicly available through the following GitHub repo.

Data processing

In this section, we briefly describe how to process an example dataset and convert it from raw tables into a graph with relations identified among different columns.

This solution uses the same dataset, the IEEE-CIS fraud dataset, as the previous post Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library. Therefore, the basic principle of the data process is the same. In brief, the fraud dataset includes a transactions table and an identities table, having nearly 500,000 anonymized transaction records along with contextual information (for example, devices used in transactions). Some transactions have a binary label, indicating whether a transaction is fraudulent. Our task is to predict which unlabeled transactions are fraudulent and which are legitimate.

The following figure illustrates the general process of how to convert the IEEE tables into a heterogeneous graph. We first extract two columns from each table. One column is always the transaction ID column, where we set each unique TransactionID as one node. Another column is picked from the categorical columns, such as the ProductCD and id_03 columns, where each unique category was set as a node. If a TransactionID and a unique category appear in the same row, we connect them with one edge. This way, we convert two columns in a table into one bipartite. Then we combine those bipartites along with the TransactionID nodes, where the same TransactionID nodes are merged into one unique node. After this step, we have a heterogeneous graph built from bipartites.

For the rest of the columns that aren’t used to build the graph, we join them together as the feature of the TransactionID nodes. TransactionID values that have the isFraud values are used as the label for model training. Based on this heterogeneous graph, our task becomes a node classification task of the TransactionID nodes. For more details on preparing the graph data for training GNNs, refer to the Feature extraction and Constructing the graph sections of the previous blog post.

The code used in this solution is available in src/scripts/glue-etl.py. You can also experiment with data processing through the Jupyter notebook src/sagemaker/01.FD_SL_Process_IEEE-CIS_Dataset.ipynb.

Instead of manually processing the data, as done in the previous post, this solution uses a fully automatic pipeline orchestrated by Step Functions and AWS Glue that supports processing huge datasets in parallel via Apache Spark. The Step Functions workflow is written in AWS Cloud Development Kit (AWS CDK). The following is a code snippet to create this workflow:

import { LambdaInvoke, GlueStartJobRun } from 'aws-cdk-lib/aws-stepfunctions-tasks';
    
    const parametersNormalizeTask = new LambdaInvoke(this, 'Parameters normalize', {
      lambdaFunction: parametersNormalizeFn,
      integrationPattern: IntegrationPattern.REQUEST_RESPONSE,
    });
    
    ...
    
    const dataProcessTask = new GlueStartJobRun(this, 'Data Process', {
      integrationPattern: IntegrationPattern.RUN_JOB,
      glueJobName: etlConstruct.jobName,
      timeout: Duration.hours(5),
      resultPath: '$.dataProcessOutput',
    });
    
    ...    
    
    const definition = parametersNormalizeTask
      .next(dataIngestTask)
      .next(dataCatalogCrawlerTask)
      .next(dataProcessTask)
      .next(hyperParaTask)
      .next(trainingJobTask)
      .next(runLoadGraphDataTask)
      .next(modelRepackagingTask)
      .next(createModelTask)
      .next(createEndpointConfigTask)
      .next(checkEndpointTask)
      .next(endpointChoice);

Besides constructing the graph data for GNN model training, this workflow also batch loads the graph data into Neptune to conduct real-time inference later on. This batch data loading process is demonstrated in the following code snippet:

from neptune_python_utils.endpoints import Endpoints
from neptune_python_utils.bulkload import BulkLoad

...

bulkload = BulkLoad(
        source=targetDataPath,
        endpoints=endpoints,
        role=args.neptune_iam_role_arn,
        region=args.region,
        update_single_cardinality_properties=True,
        fail_on_error=True)
        
load_status = bulkload.load_async()
status, json = load_status.status(details=True, errors=True)
load_status.wait()

GNN model training

After the graph data for model training is saved in Amazon S3, a SageMaker training job, which is only charged when the training job is running, is triggered to start the GNN model training process in the Bring Your Own Container (BYOC) mode. It allows you to pack your model training scripts and dependencies in a Docker image, which it uses to create SageMaker training instances. The BYOC method could save significant effort in setting up the training environment. In src/sagemaker/02.FD_SL_Build_Training_Container_Test_Local.ipynb, you can find details of the GNN model training.

Docker image

The first part of the Jupyter notebook file is the training Docker image generation (see the following code snippet):

*!* aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
image_name *=* 'fraud-detection-with-gnn-on-dgl/training'
*!* docker build -t $image_name ./FD_SL_DGL/gnn_fraud_detection_dgl

We used a PyTorch-based image for the model training. The Deep Graph Library (DGL) and other dependencies are installed when building the Docker image. The GNN model code in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder is copied to the image as well.

Because we process the transaction data into a heterogeneous graph, in this solution we choose the Relational Graph Convolutional Network (RGCN) model, which is specifically designed for heterogeneous graphs. Our RGCN model can train learnable embeddings for the nodes in heterogeneous graphs. Then, the learned embeddings are used as inputs of a fully connected layer for predicting the node labels.

Hyperparameters

To train the GNN, we need to define a few hyperparameters before the training process, such as the file names of the graph constructed, the number of layers of GNN models, the training epochs, the optimizer, the optimization parameters, and more. See the following code for a subset of the configurations:

edges *=* ","*.*join(map(*lambda* x: x*.*split("/")[*-*1], [file *for* file *in* processed_files *if* "relation" *in* file]))

params *=* {'nodes' : 'features.csv',
          'edges': edges,
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2}

For more information about all the hyperparameters and their default values, see estimator_fns.py in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder.

Model training with SageMaker

After the customized container Docker image is built, we use the preprocessed data to train our GNN model with the hyperparameters we defined. The training job uses the DGL, with PyTorch as the backend deep learning framework, to construct and train the GNN. SageMaker makes it easy to train GNN models with the customized Docker image, which is an input argument of the SageMaker estimator. For more information about training GNNs with the DGL on SageMaker, see Train a Deep Graph Network.

The SageMaker Python SDK uses Estimator to encapsulate training on SageMaker, which runs SageMaker-compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK. The following code snippet demonstrates training the model with SageMaker (either in a local environment or cloud instances):

from sagemaker.estimator import Estimator
from time import strftime, gmtime
from sagemaker.local import LocalSession

localSageMakerSession = LocalSession(boto_session=boto3.session.Session(region_name=current_region))
estimator = Estimator(image_uri=image_name,
                      role=sagemaker_exec_role,
                      instance_count=1,
                      instance_type='local',
                      hyperparameters=params,
                      output_path=output_path,
                      sagemaker_session=localSageMakerSession)

training_job_name = "{}-{}".format('GNN-FD-SL-DGL-Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

After training, the GNN model’s performance on the test set is displayed like the following outputs. The RGCN model normally can achieve around 0.87 AUC and more than 95% accuracy. For a comparison of the RGCN model with other ML models, refer to the Results section of the previous blog post for more details.

Epoch 00099 | Time(s) 7.9413 | Loss 0.1023 | f1 0.3745
Metrics
Confusion Matrix:
                        labels positive labels negative
    predicted positive  4343            576
    predicted negative  13494           454019

    f1: 0.3817, precision: 0.8829, recall: 0.2435, acc: 0.9702, roc: 0.8704, pr: 0.4782, ap: 0.4782

Finished Model training

Upon the completion of model training, SageMaker packs the trained model along with other assets, including the trained node embeddings, into a ZIP file and then uploads it to a specified S3 location. Next, we discuss the deployment of the trained model for real-time fraudulent detection.

GNN model deployment

SageMaker makes the deployment of trained ML models simple. In this stage, we use the SageMaker PyTorchModel class to deploy the trained model, because our DGL model depends on PyTorch as the backend framework. You can find the deployment code in the src/sagemaker/03.FD_SL_Endpoint_Deployment.ipynb file.

Besides the trained model file and assets, SageMaker requires an entry point file for the deployment of a customized model. The entry point file is run and stored in the memory of an inference endpoint instance to respond to the inference request. In our case, the entry point file is the fd_sl_deployment_entry_point.py file in the src/sagemaker/FD_SL_DGL/code folder, which performs four major functions:

  • Receive requests and parse contents of requests to obtain the to-be-predicted nodes and their associated data
  • Convert the data to a DGL heterogeneous graph as input for the RGCN model
  • Perform the real-time inference via the trained RGCN model
  • Return the prediction results to the requester

Following SageMaker conventions, the first two functions are implemented in the input_fn method. See the following code (for simplicity, we delete some commentary code):

def input_fn(request_body, request_content_type='application/json'):

    # --------------------- receive request ------------------------------------------------ #
    input_data = json.loads(request_body)

    subgraph_dict = input_data['graph']
    n_feats = input_data['n_feats']
    target_id = input_data['target_id']

    graph, new_n_feats, new_pred_target_id = recreate_graph_data(subgraph_dict, n_feats, target_id)

    return (graph, new_n_feats, new_pred_target_id)

The constructed DGL graph and features are then passed to the predict_fn method to fulfill the third function. predict_fn takes two input arguments: the outputs of input_fn and the trained model. See the following code:

def predict_fn(input_data, model):

    # ---------------------  Inference ------------------------------------------------ #
    graph, new_n_feats, new_pred_target_id = input_data

    with th.no_grad():
        logits = model(graph, new_n_feats)
        res = logits[new_pred_target_id].cpu().detach().numpy()

    return res[1]

The model used in perdict_fn is created by the model_fn method when the endpoint is called the first time. The function model_fn loads the saved model file and associated assets from the model_dir argument and the SageMaker model folder. See the following code:

def model_fn(model_dir):

    # ------------------ Loading model -------------------
    ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size = 
    initialize_arguments(os.path.join(BASE_PATH, 'metadata.pkl'))

    rgcn_model = HeteroRGCN(ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size)

    stat_dict = th.load('model.pth')

    rgcn_model.load_state_dict(stat_dict)

    return rgcn_model

The output of the predict_fn method is a list of two numbers, indicating the logits for class 0 and class 1, where 0 means legitimate and 1 means fraudulent. SageMaker takes this list and passes it to an inner method called output_fn to complete the final function.

To deploy our GNN model, we first wrap the GNN model into a SageMaker PyTorchModel class with the entry point file and other parameters (the path of the saved ZIP file, the PyTorch framework version, the Python version, and so on). Then we call its deploy method with instance settings. See the following code:

env = {
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'
}

print(f'Use model {repackged_model_path}')

sagemakerSession = sm.session.Session(boto3.session.Session(region_name=current_region))
fd_sl_model = PyTorchModel(model_data=repackged_model_path, 
                           role=sagemaker_exec_role,
                           entry_point='./FD_SL_DGL/code/fd_sl_deployment_entry_point.py',
                           framework_version='1.6.0',
                           py_version='py3',
                           predictor_cls=JSONPredictor,
                           env=env,
                           sagemaker_session=sagemakerSession)
                           
fd_sl_predictor *=* fd_sl_model*.*deploy(instance_type*=*'ml.c5.4xlarge',
                                     initial_instance_count*=*1,)

The preceding procedures and code snippets demonstrate how to deploy your GNN model as an online inference endpoint from a Jupyter notebook. However, for production, we recommend using the previously mentioned MLOps pipeline orchestrated by Step Functions for the entire workflow, including processing data, training the model, and deploying an inference endpoint. The entire pipeline is implemented by an AWS CDK application, which can be easily replicated in different Regions and accounts.

Real-time inference

When a new transaction arrives, to perform real-time prediction, we need to complete four steps:

  1. Node and edge insertion – Extract the transaction’s information such as the TransactionID and ProductCD as nodes and edges, and insert the new nodes into the existing graph data stored at the Neptune database.
  2. Subgraph extraction – Set the to-be-predicted transaction node as the center node, and extract a n-hop subgraph according to the GNN model’s input requirements.
  3. Feature extraction – For the nodes and edges in the subgraph, extract their associated features.
  4. Call the inference endpoint – Pack the subgraph and features into the contents of a request, then send the request to the inference endpoint.

In this solution, we implement a RESTful API to achieve real-time fraudulent predication described in the preceding steps. See the following pseudo-code for real-time predictions. The full implementation is in the complete source code file.

For prediction in real time, the first three steps require lower latency. Therefore, a graph database is an optimal choice for these tasks, particularly for the subgraph extraction, which could be achieved efficiently with graph database queries. The underline functions that support the pseudo-code are based on Neptune’s gremlin queries.

def handler(event, context):
    
    graph_input = GraphModelClient(endpoints)
    
    # Step 1: node and edge insertion
    trans_dict, identity_dict, target_id, transaction_value_cols, union_li_cols = 
        load_data_from_event(event, transactions_id_cols, transactions_cat_cols, dummied_col)
    graph_input.insert_new_transaction_vertex_and_edge(trans_dict, identity_dict , target_id, vertex_type = 'Transaction')
    
    
    # Setp 2: subgraph extraction
    subgraph_dict, transaction_embed_value_dict = 
        graph_input.query_target_subgraph(target_id, trans_dict, transaction_value_cols, union_li_cols, dummied_col)
    

    # Step 3 & 4: feature extraction & call the inference endpoint
    transaction_id = int(target_id[(target_id.find('-')+1):])
    pred_prob = invoke_endpoint_with_idx(endpointname = ENDPOINT_NAME, target_id = transaction_id, subgraph_dict = subgraph_dict, n_feats = transaction_embed_value_dict)
       
    function_res = {
                    'id': event['transaction_data'][0]['TransactionID'],
                    'flag': pred_prob > MODEL_BTW,
                    'pred_prob': pred_prob
                    }
       
    return function_res

One caveat about real-time fraud detection using GNNs is the GNN inference mode. To fulfill real-time inference, we need to convert the GNN model inference from transductive mode to inductive mode. GNN models in transductive inference mode can’t make predictions for newly appeared nodes and edges, whereas in inductive mode, GNN models can handle new nodes and edges. A demonstration of the difference between transductive and inductive mode is shown in the following figure.

In transductive mode, predicted nodes and edges coexist with labeled nodes and edges during training. Models identify them before inference, and they could be inferred in training. Models in inductive mode are trained on the training graph but need to predict unseen nodes (those in red dotted circles on the right) with their associated neighbors, which might be new nodes, like the gray triangle node on the right.

Our RGCN model is trained and tested in transductive mode. It has access to all nodes in training, and also trained an embedding for each featureless node, such as IP address and card types. In the testing stage, the RGCN model uses these embeddings as node features to predict nodes in the test set. When we do real-time inference, however, some of the newly added featureless nodes have no such embeddings because they’re not in the training graph. One way to tackle this issue is to assign the mean of all embeddings in the same node type to the new nodes. In this solution, we adopt this method.

In addition, this solution provides a web portal (as seen in the following screenshot) to demonstrate real-time fraudulent predictions from business operators’ perspectives. It can generate the simulated online transactions, and provide a live visualization of detected fraudulent transaction information.

Clean up

When you’re finished exploring the solution, you can clean the resources to avoid incurring charges.

Conclusion

In this post, we showed how to build a GNN-based real-time fraud detection solution using SageMaker, Neptune, and the DGL. This solution has three major advantages:

  • It has good performance in terms of prediction accuracy and AUC metrics
  • It can perform real-time inference via a streaming MLOps pipeline and SageMaker endpoints
  • It automates the total deployment process with the provided CloudFormation template so that interested developers can easily test this solution with custom data in their account

For more details about the solution, see the GitHub repo.

After you deploy this solution, we recommend customizing the data processing code to fit your own data format and modify the real-time inference mechanism while keeping the GNN model unchanged. Note that we split the real-time inference into four steps without further optimization of the latency. These four steps take a few seconds to get a prediction on the demo dataset. We believe that optimizing the Neptune graph data schema design and queries for subgraph and feature extraction can significantly reduce the inference latency.


About the authors

Jian Zhang is an applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Mengxin Zhu is a manager of Solutions Architects at AWS, with a focus on designing and developing reusable AWS solutions. He has been engaged in software development for many years and has been responsible for several startup teams of various sizes. He also is an advocate of open-source software and was an Eclipse Committer.

Haozhu Wang is a research scientist at Amazon ML Solutions Lab, where he co-leads the Reinforcement Learning Vertical. He helps customers build advanced machine learning solutions with the latest research on graph learning, natural language processing, reinforcement learning, and AutoML. Haozhu received his PhD in Electrical and Computer Engineering from the University of Michigan.

Read More

Use computer vision to measure agriculture yield with Amazon Rekognition Custom Labels

In the agriculture sector, the problem of identifying and counting the amount of fruit on trees plays an important role in crop estimation. The concept of renting and leasing a tree is becoming popular, where a tree owner leases the tree every year before the harvest based on the estimated fruit yeild. The common practice of manually counting fruit is a time-consuming and labor-intensive process. It’s one of the hardest but most important tasks in order to obtain better results in your crop management system. This estimation of the amount of fruit and flowers helps farmers make better decisions—not only on only leasing prices, but also on cultivation practices and plant disease prevention.

This is where an automated machine learning (ML) solution for computer vision (CV) can help farmers. Amazon Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to your business.

Rekognition Custom Labels doesn’t require you to have any prior computer vision expertise. You can get started by simply uploading tens of images instead of thousands. If the images are already labeled, you can begin training a model in just a few clicks. If not, you can label them directly within the Rekognition Custom Labels console, or use Amazon SageMaker Ground Truth to label them. Rekognition Custom Labels uses transfer learning to automatically inspect the training data, select the right model framework and algorithm, optimize the hyperparameters, and train the model. When you’re satisfied with the model accuracy, you can start hosting the trained model with just one click.

In this post, we showcase how you can build an end-to-end solution using Rekognition Custom Labels to detect and count fruit to measure agriculture yield.

Solution overview

We create a custom model to detect fruit using the following steps:

  1. Label a dataset with images containing fruit using Amazon SageMaker Ground Truth.
  2. Create a project in Rekognition Custom Labels.
  3. Import your labeled dataset.
  4. Train the model.
  5. Test the new custom model using the automatically generated API endpoint.

Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end model development and inference process.

Prerequisites

To create an agriculture yield measuring model, you first need to prepare a dataset to train the model with. For this post, our dataset is composed of images of fruit. The following images show some examples.

We sourced our images from our own garden. You can download the image files from the GitHub repo.

For this post, we only use a handful of images to showcase the fruit yield use case. You can experiment further with more images.

To prepare your dataset, complete the following steps:

  1. Create an Amazon Simple Storage Service (Amazon S3) bucket.
  2. Create two folders inside this bucket, called raw_data and test_data, to store images for labeling and model testing.
  3. Choose Upload to upload the images to their respective folders from the GitHub repo.

The uploaded images aren’t labeled. You label the images in the following step.

Label your dataset using Ground Truth

To train the ML model, you need labeled images. Ground Truth provides an easy process to label the images. The labeling task is performed by a human workforce; in this post, you create a private workforce. You can use Amazon Mechanical Turk for labeling at scale.

Create a labeling workforce

Let’s first create our labeling workforce. Complete the following steps:

  1. On the SageMaker console, under Ground Truth in the navigation pane, choose Labeling workforces.
  2. On the Private tab, choose Create private team.
  3. For Team name, enter a name for your workforce (for this post, labeling-team).
  4. Choose Create private team.
  5. Choose Invite new workers.

  6. In the Add workers by email address section, enter the email addresses of your workers. For this post, enter your own email address.
  7. Choose Invite new workers.

You have created a labeling workforce, which you use in the next step while creating a labeling job.

Create a Ground Truth labeling job

To great your labeling job, complete the following steps:

  1. On the SageMaker console, under Ground Truth, choose Labeling jobs.
  2. Choose Create labeling job.
  3. For Job name, enter fruits-detection.
  4. Select I want to specify a label attribute name different from the labeling job name.
  5. For Label attribute name¸ enter Labels.
  6. For Input data setup, select Automated data setup.
  7. For S3 location for input datasets, enter the S3 location of the images, using the bucket you created earlier (s3://{your-bucket-name}/raw-data/images/).
  8. For S3 location for output datasets, select Specify a new location and enter the output location for annotated data (s3://{your-bucket-name}/annotated-data/).
  9. For Data type, choose Image.
  10. Choose Complete data setup.
    This creates the image manifest file and updates the S3 input location path. Wait for the message “Input data connection successful.”
  11. Expand Additional configuration.
  12. Confirm that Full dataset is selected.
    This is used to specify whether you want to provide all the images to the labeling job or a subset of images based on filters or random sampling.
  13. For Task category, choose Image because this is a task for image annotation.
  14. Because this is an object detection use case, for Task selection, select Bounding box.
  15. Leave the other options as default and choose Next.
  16. Choose Next.

    Now you specify your workers and configure the labeling tool.
  17. For Worker types, select Private.For this post, you use an internal workforce to annotate the images. You also have the option to select a public contractual workforce (Amazon Mechanical Turk) or a partner workforce (Vendor managed) depending on your use case.
  18. For Private teams¸ choose the team you created earlier.
  19. Leave the other options as default and scroll down to Bounding box labeling tool.It’s essential to provide clear instructions here in the labeling tool for the private labeling team. These instructions acts as a guide for annotators while labeling. Good instructions are concise, so we recommend limiting the verbal or textual instructions to two sentences and focusing on visual instructions. In the case of image classification, we recommend providing one labeled image in each of the classes as part of the instructions.
  20. Add two labels: fruit and no_fruit.
  21. Enter detailed instructions in the Description field to provide instructions to the workers. For example: You need to label fruits in the provided image. Please ensure that you select label 'fruit' and draw the box around the fruit just to fit the fruit for better quality of label data. You also need to label other areas which look similar to fruit but are not fruit with label 'no_fruit'.You can also optionally provide examples of good and bad labeling images. You need to make sure that these images are publicly accessible.
  22. Choose Create to create the labeling job.

After the job is successfully created, the next step is to label the input images.

Start the labeling job

Once you have successfully created the job, the status of the job is InProgress. This means that the job is created and the private workforce is notified via email regarding the task assigned to them. Because you have assigned the task to yourself, you should receive an email with instructions to log in to the Ground Truth Labeling project.

  1. Open the email and choose the link provided.
  2. Enter the user name and password provided in the email.
    You may have to change the temporary password provided in the email to a new password after login.
  3. After you log in, select your job and choose Start working.

    You can use the provided tools to zoom in, zoom out, move, and draw bounding boxes in the images.
  4. Choose your label (fruit or no_fruit) and then draw a bounding box in the image to annotate it.
  5. When you’re finished, choose Submit.

Now you have correctly labeled images that will be used by the ML model for training.

Create your Amazon Rekognition project

To create your agriculture yield measuring project, complete the following steps:

  1. On the Amazon Rekognition console, choose Custom Labels.
  2. Choose Get Started.
  3. For Project name, enter fruits_yield.
  4. Choose Create project.

You can also create a project on the Projects page. You can access the Projects page via the navigation pane. The next step is to provide images as input.

Import your dataset

To create your agriculture yield measuring model, you first need to import a dataset to train the model with. For this post, our dataset is already labeled using Ground Truth.

  1. For Import images, select Import images labeled by SageMaker Ground Truth.
  2. For Manifest file location, enter the S3 bucket location of your manifest file (s3://{your-bucket-name}/fruits_image/annotated_data/fruits-labels/manifests/output/output.manifest).
  3. Choose Create Dataset.

You can see your labeled dataset.

Now you have your input dataset for the ML model to start training on them.

Train your model

After you label your images, you’re ready to train your model.

  1. Choose Train model.
  2. For Choose project, choose your project fruits_yield.
  3. Choose Train Model.

Wait for the training to complete. Now you can start testing the performance for this trained model.

Test your model

Your agriculture yield measuring model is now ready for use and should be in the Running state. To test the model, complete the following steps:

Step 1 : Start the model

On your model details page, on the Use model tab, choose Start.

Rekognition Custom Labels also provides the API calls for starting, using, and stopping your model.

Step 2 : Test the model

When the model is in the Running state, you can use the sample testing script analyzeImage.py to count the amount of fruit in an image.

  1. Download this script from of the GitHub repo.
  2. Edit this file to replace the parameter bucket with your bucket name and model with your Amazon Rekognition model ARN.

We use the parameters photo and min_confidence as input for this Python script.

You can run this script locally using the AWS Command Line Interface (AWS CLI) or using AWS CloudShell. In our example, we ran the script via the CloudShell console. Note that CloudShell is free to use.

Make sure to install the required dependences using the command pip3 install boto3 PILLOW if not already installed.

  1. Upload the file analyzeImage.py to CloudShell using the Actions menu.

The following screenshot shows the output, which detected two fruits in the input image. We supplied 15.jpeg as the photo argument and 85 as the min_confidence value.

The following example shows image 15.jpeg with two bounding boxes.

You can run the same script with other images and experiment by changing the confidence score further.

Step 3:  Stop the model

When you’re done, remember to stop model to avoid incurring in unnecessary charges. On your model details page, on the Use model tab, choose Stop.

Clean up

To avoid incurring unnecessary charges, delete the resources used in this walkthrough when not in use. We need to delete the Amazon Rekognition project and the S3 bucket.

Delete the Amazon Rekognition project

To delete the Amazon Rekognition project, complete the following steps:

  1. On the Amazon Rekognition console, choose Use Custom Labels.
  2. Choose Get started.
  3. In the navigation pane, choose Projects.
  4. On the Projects page, select the project that you want to delete.
    1. Choose Delete.
      The Delete project dialog box appears.
  5. If the project has no associated models:
    1. Enter delete to delete the project.
    2. Choose Delete to delete the project.
  6. If the project has associated models or datasets:
    1. Enter delete to confirm that you want to delete the model and datasets.
    2. Choose either Delete associated models, Delete associated datasets, or Delete associated datasets and models, depending on whether the model has datasets, models, or both.

    Model deletion might take a while to complete. Note that the Amazon Rekognition console can’t delete models that are in training or running. Try again after stopping any running models that are listed, and wait until the models listed as training are complete. If you close the dialog box during model deletion, the models are still deleted. Later, you can delete the project by repeating this procedure.

  7. Enter delete to confirm that you want to delete the project.
  8. Choose Delete to delete the project.

Delete your S3 bucket

You first need to empty the bucket and then delete it.

  1. On the Amazon S3 console, choose Buckets.
  2. Select the bucket that you want to empty, then choose Empty.
  3. Confirm that you want to empty the bucket by entering the bucket name into the text field, then choose Empty.
  4. Choose Delete.
  5. Confirm that you want to delete the bucket by entering the bucket name into the text field, then choose Delete bucket.

Conclusion

In this post, we showed you how to create an object detection model with Rekognition Custom Labels. This feature makes it easy to train a custom model that can detect an object class without needing to specify other objects or losing accuracy in its results.

For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?


About the authors

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Sameer Goel is a Sr. Solutions Architect in the Netherlands, who drives customer success by building prototypes on cutting-edge initiatives. Prior to joining AWS, Sameer graduated with a master’s degree from Boston, with a concentration in data science. He enjoys building and experimenting with AI/ML projects on Raspberry Pi. You can find him on LinkedIn.

Read More

Amazon SageMaker Automatic Model Tuning now supports SageMaker Training Instance Fallbacks

Today Amazon SageMaker announced the support of SageMaker training instance fallbacks for Amazon SageMaker Automatic Model Tuning (AMT) that allow users to specify alternative compute resource configurations.

SageMaker automatic model tuning finds the best version of a model by running many training jobs on your dataset using the ranges of hyperparameters that you specify for your algorithm. Then, it chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

Previously, users only had the option to specify a single instance configuration. This can lead to problems when the specified instance type isn’t available due to high utilization. In the past, your training jobs would fail with an InsufficientCapacityError (ICE). AMT used smart retries to avoid these failures in many cases, but it remained powerless in the face of sustained low capacity.

This new feature means that you can specify a list of instance configurations in the order of preference, such that your AMT job will automatically fallback to the next instance in the list in the event of low capacity.

In the following sections, we walk through these high-level steps for overcoming an ICE:

  1. Define HyperParameter Tuning Job Configuration
  2. Define the Training Job Parameters
  3. Create the Hyperparameter Tuning Job
  4. Describe training job

Define HyperParameter Tuning Job Configuration

The HyperParameterTuningJobConfig object describes the tuning job, including the search strategy, the objective metric used to evaluate training jobs, the ranges of the parameters to search, and the resource limits for the tuning job. This aspect wasn’t changed with today’s feature release. Nevertheless, we’ll go over it to give a complete example.

The ResourceLimits object specifies the maximum number of training jobs and parallel training jobs for this tuning job. In this example, we’re doing a random search strategy and specifying a maximum of 10 jobs (MaxNumberOfTrainingJobs) and 5 concurrent jobs (MaxParallelTrainingJobs) at a time.

The ParameterRanges object specifies the ranges of hyperparameters that this tuning job searches. We specify the name, as well as the minimum and maximum value of the hyperparameter to search. In this example, we define the minimum and maximum values for the Continuous and Integer parameter ranges and the name of the hyperparameter (“eta”, “max_depth”).

AmtTuningJobConfig={
            "Strategy": "Random",
            "ResourceLimits": {
              "MaxNumberOfTrainingJobs": 10,
              "MaxParallelTrainingJobs": 5
            },
            "HyperParameterTuningJobObjective": {
              "MetricName": "validation:rmse",
              "Type": "Minimize"
            },
            "ParameterRanges": {
              "CategoricalParameterRanges": [],
              "ContinuousParameterRanges": [
                {
                    "MaxValue": "1",
                    "MinValue": "0",
                    "Name": "eta"
                }
              ],
              "IntegerParameterRanges": [
                {
                  "MaxValue": "6",
                  "MinValue": "2",
                  "Name": "max_depth"
                }
              ]
            }
          }

Define the Training Job Parameters

In the training job definition, we define the input needed to run a training job using the algorithm that we specify. After the training completes, SageMaker saves the resulting model artifacts to an Amazon Simple Storage Service (Amazon S3) location that you specify.

Previously, we specified the instance type, count, and volume size under the ResourceConfig parameter. When the instance under this parameter was unavailable, an Insufficient Capacity Error (ICE) was thrown.

To avoid this, we now have the HyperParameterTuningResourceConfig parameter under the TrainingJobDefinition, where we specify a list of instances to fall back on. The format of these instances is the same as in the ResourceConfig. The job will traverse the list top-to-bottom to find an available instance configuration. If an instance is unavailable, then instead of an Insufficient Capacity Error (ICE), the next instance in the list is chosen, thereby overcoming the ICE.

TrainingJobDefinition={
            "HyperParameterTuningResourceConfig": {
      		"InstanceConfigs": [
            		{
                		"InstanceType": "ml.m4.xlarge",
                		"InstanceCount": 1,
                		"VolumeSizeInGB": 5
            		},
            		{
                		"InstanceType": "ml.m5.4xlarge",
                		"InstanceCount": 1,
                		"VolumeSizeInGB": 5
            		}
        		 ]
    		  },
            "AlgorithmSpecification": {
              "TrainingImage": "433757028032.dkr.ecr.us-west-2.amazonaws.com/xgboost:latest",
              "TrainingInputMode": "File"
            },
            "InputDataConfig": [
              {
                "ChannelName": "train",
                "CompressionType": "None",
                "ContentType": "json",
                "DataSource": {
                  "S3DataSource": {
                    "S3DataDistributionType": "FullyReplicated",
                    "S3DataType": "S3Prefix",
                    "S3Uri": "s3://<bucket>/test/"
                  }
                },
                "RecordWrapperType": "None"
              }
            ],
            "OutputDataConfig": {
              "S3OutputPath": "s3://<bucket>/output/"
            },
            "RoleArn": "arn:aws:iam::340308762637:role/service-role/AmazonSageMaker-ExecutionRole-20201117T142856",
            "StoppingCondition": {
              "MaxRuntimeInSeconds": 259200
            },
            "StaticHyperParameters": {
              "training_script_loc": "q2bn-sagemaker-test_6"
            },
          }

Run a Hyperparameter Tuning Job

In this step, we’re creating and running a hyperparameter tuning job with the hyperparameter tuning resource configuration defined above.

We initialize a SageMaker client and create the job by specifying the tuning config, training job definition, and a job name.

import boto3
sm = boto3.client('sagemaker')     
                    
sm.create_hyper_parameter_tuning_job(
    HyperParameterTuningJobName="my-job-name",
    HyperParameterTuningJobConfig=AmtTuningJobConfig,
    TrainingJobDefinition=TrainingJobDefinition) 

Running an AMT job with the support of SageMaker training instance fallbacks empowers the user to overcome insufficient capacity by themselves, thereby reducing the chance of a job failure.

Describe training jobs

The following function lists all instance types used during the experiment and can be used to verify if an SageMaker training instance has automatically fallen back to the next instance in the list during resource allocation.

def list_instances(name):
    job_list = []
    instances = []
    def _get_training_jobs(name, next=None):
        if next:
            list = sm.list_training_jobs_for_hyper_parameter_tuning_job(
            HyperParameterTuningJobName=name, NextToken=next)
        else:
            list = sm.list_training_jobs_for_hyper_parameter_tuning_job(
            HyperParameterTuningJobName=name)
        for jobs in list['TrainingJobSummaries']:
            job_list.append(jobs['TrainingJobName'])
        next = list.get('NextToken', None)
        if next:
            _get_training_jobs(name, next=next)
            pass
        else:
            pass
    _get_training_jobs(name)


    for job_name in job_list:
        ec2 = sm.describe_training_job(
        TrainingJobName=job_name
        )
        instances.append(ec2['ResourceConfig'])
    return instances

list_instances("my-job-name")  

The output of the function above displays all of the instances that the AMT job is using to run the experiment.

Conclusion

In this post, we demonstrated how you can now define a pool of instances on which your AMT experiment can fall back in the case of InsufficientCapacityError. We saw how to define a hyperparameter tuning job configuration, as well as specify the maximum number of training jobs and maximum parallel jobs. Finally, we saw how to overcome the InsufficientCapacityError by using the HyperParameterTuningResourceConfig parameter, which can be specified under the training job definition.

To learn more about AMT, visit Amazon SageMaker Automatic Model Tuning.


About the authors

Doug Mbaya is a Senior Partner Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solution in the cloud.

Kruthi Jayasimha Rao is a Partner Solutions Architect in the Scale-PSA team. Kruthi conducts technical validations for Partners enabling them progress in the Partner Path.

Bernard Jollans is a Software Development Engineer for Amazon SageMaker Automatic Model Tuning.

Read More