Cloud-based medical imaging reconstruction using deep neural networks

Medical imaging techniques like computed tomography (CT), magnetic resonance imaging (MRI), medical x-ray imaging, ultrasound imaging, and others are commonly used by doctors for various reasons. Some examples include detecting changes in the appearance of organs, tissues, and vessels, and detecting abnormalities such as tumors and various other type of pathologies.

Before doctors can use the data from those techniques, the data needs to be transformed from its native raw form to a form that can be displayed as an image on a computer screen.

This process is known as image reconstruction, and it plays a crucial role in a medical imaging workflow—it’s the step that creates diagnostic images that can be then reviewed by doctors.

In this post, we discuss a use case of MRI reconstruction, but the architectural concepts can be applied to other types of image reconstruction.

Advances in the field of image reconstruction have led to the successful application of AI-based techniques within magnetic resonance (MR) imaging. These techniques are aimed at increasing the accuracy of the reconstruction and in the case of MR modality, and decreasing the time required for a full scan.

Within MR, applications using AI to work with under-sampled acquisitions have been successfully employed, achieving nearly ten times reduction in scan times.

Waiting times for tests like MRIs and CT scans have increased rapidly in the last couple of years, leading to wait times as long as 3 months. To ensure good patient care, the increasing need for quick availability of reconstructed images along with the need to reduce operational costs has driven the need of a solution capable of scaling according to storage and computational needs.

In addition to computational needs, data growth has seen a steady increase in the last few years. For example, looking at the datasets made available by the Medical Image Computing and Computer-Assisted Intervention (MICCAI), it’s possible to gather that the annual growth is 21% for MRI, 24% for CT, and 31% for functional MRI (fMRI). (For more information, refer to Dataset Growth in Medical Image Analysis Research.)

In this post, we show you a solution architecture that addresses these challenges. This solution can enable research centers, medial institutions, and modality vendors to have access to unlimited storage capabilities, scalable GPU power, fast data access for machine learning (ML) training and reconstruction tasks, simple and fast ML development environments, and the ability to have on-premises caching for fast and low-latency image data availability.

Solution overview

This solution uses an MRI reconstruction technique known as Robust Artificial-neural-networks for k-space Interpolation (RAKI). This approach is advantageous because it’s scan-specific and doesn’t require prior data to train the neural network. The drawback to this technique is that it requires a lot of computational power to be effective.

The AWS architecture outlined shows how a cloud-based reconstruction approach can effectively perform computational-heavy tasks like the one required by the RAKI neural network, scaling according to the load and accelerating the reconstruction process. This opens the door to techniques that can’t realistically be implemented on premises.

Data layer

The data layer has been architected around the following principles:

  • Seamless integration with modalities that store data generated into an attached storage drive via a network share on a NAS device
  • Limitless and secure data storage capabilities to scale to the continuous demand of storage space
  • Fast storage availability for ML workloads such as deep neural training and neural image reconstruction
  • The ability to archive historic data using a low-cost, scalable approach
  • Permit availability to the most frequently accessed reconstructed data while simultaneously keeping less frequently accessed data archived at a lower cost

The following diagram illustrates this architecture.

This approach uses the following services:

  • AWS Storage Gateway for a seamless integration with the on-premises modality that exchanges information via a file share system. This allows transparent access to the following AWS Cloud storage capabilities while maintaining how the modality exchanges data:

    • Fast cloud upload of the volumes generated by the MR modality.
    • Low-latency access to frequently used reconstructed MR studies via local caching offered by Storage Gateway.
  • Amazon SageMaker for unlimited and scalable cloud storage. Amazon S3 also provides low-cost, historical raw MRI data deep archiving with Amazon S3 Glacier, and an intelligent storage tier for the reconstructed MRI with Amazon S3 Intelligent-Tiering.
  • Amazon FSx for Lustre for fast and scalable intermediate storage used for ML training and reconstruction tasks.

The following figure shows a concise architecture describing the data exchange between the cloud environments.

Using Storage Gateway with the caching mechanism allows on-premises applications to quickly access data that’s available on the local cache. This occurs while simultaneously giving access to scalable storage space on the cloud.

With this approach, modalities can generate raw data from acquisition jobs, as well as write the raw data into a network share handled from Storage Gateway.

If the modality generates multiple files that belong to the same scan, it’s recommended to create a single archive (.tar for example), and perform a single transfer to the network share to accelerate the data transfer.

Data decompression and transformation layer

The data decompression layer receives the raw data, automatically performs decompression, and applies potential transformations to the raw data before submitting the preprocessed data to the reconstruction layer.

The adopted architecture is outlined in the following figure.

In this architecture, raw MRI data lands in the raw MRI S3 bucket, thereby triggering a new entry in Amazon Simple Queue Service (Amazon SQS).

An AWS Lambda function retrieves the raw MRI Amazon SQS queue depth, which represents the amount of raw MRI acquisitions uploaded to the AWS Cloud. This is used with AWS Fargate to automatically modulate the size of an Amazon Elastic Container Service (Amazon ECS) cluster.

This architecture approach lets it automatically scale up and down accordingly to the number of raw scans landed into the raw input bucket.

After the raw MRI data is decompressed and preprocessed, it’s saved into another S3 bucket so that it can be reconstructed.

Neural model development layer

The neural model development layer consists of a RAKI implementation. This creates a neural network model to allow the fast image reconstruction of under-sampled magnetic resonance raw data.

The following figure shows the architecture that realizes the neural model development and container creation.

In this architecture, Amazon SageMaker is used to develop the RAKI neural model, and simultaneously to create the container that is later used to perform the MRI reconstruction.

Then, the created container is included in the fully managed Amazon Elastic Container Registry (Amazon ECR) repository so that it can then spin off reconstruction tasks.

Fast data storage is guaranteed by the adoption of Amazon FSx for Lustre. It provides sub-millisecond latencies, up to hundreds of GBps of throughput, and up to millions of IOPS. This approach gives SageMaker access to a cost-effective, high-performance, and scalable storage solution.

MRI reconstruction layer

The MRI reconstruction based on the RAKI neural network is handled by the architecture shown in the following diagram.

With the same architectural pattern adopted in the decompression and preprocessing layer, the reconstruction layer automatically scales up and down by analyzing the depth of the queue responsible for holding all the reconstruction requests. In this case, to enable GPU support, AWS Batch is used to run the MRI reconstruction jobs.

Amazon FSx for Lustre is used to exchange the large amount of data involved in MRI acquisition. Furthermore, when a reconstruction job is complete and the reconstructed MRI data is stored in the target S3 bucket, the architecture employed automatically requests a refresh of the storage gateway. This makes the reconstructed data available to the on-premises facility.

Overall architecture and results

The overall architecture is shown in the following figure.

We applied the described architecture on MRI reconstruction tasks with datasets approximately 2.4 GB in size.

It took approximately 210 seconds to train 221 datasets, for a total of 514 GB of raw data on a single node equipped with a Nvidia Tesla V100-SXM2-16GB.

The reconstruction, after the RAKI network has been trained, took an average of 40 seconds on a single node equipped with a Nvidia Tesla V100-SXM2-16GB.

The application of the preceding architecture to a reconstruction job can yield the results in the following figure.

The image shows that good results can be obtained via reconstruction techniques such as RAKI. Moreover, adopting cloud technology can make these computation-heavy approaches available without the limitations found in on-premises solutions where storage and computational resources are always limited.

Conclusions

With tools such as Amazon SageMaker, Amazon FSx for Lustre, AWS Batch, Fargate, and Lambda, we can create a managed environment that is scalable, secure, cost-effective, and capable of performing complex tasks such as image reconstruction at scale.

In this post, we explored a possible solution for image reconstruction from raw modality data using a computationally intensive technique known as RAKI: a database free deep learning technique for fast image reconstruction.

To learn more about how AWS is accelerating innovation in healthcare, visit AWS for Health.

References


About the author

Benedetto Carollo is the Senior Solution Architect for medical imaging and healthcare at Amazon Web Services in Europe, Middle East, and Africa. His work focuses on helping medical imaging and healthcare customers solve business problems by leveraging technology. Benedetto has over 15 years of experience of technology and medical imaging and has worked for companies like Canon Medical Research and Vital Images. Benedetto received his summa cum laude MSc in Software Engineering from the University of Palermo – Italy.

Read More

Customize your recommendations by promoting specific items using business rules with Amazon Personalize

Today, we are excited to announce Promotions feature in Amazon Personalize that allows you to explicitly recommend specific items to your users based on rules that align with your business goals. For instance, you can have marketing partnerships that require you to promote certain brands, in-house content, or categories that you want to improve the visibility of. Promotions give you more control over recommended items. You can define business rules to identify promotional items and showcase them across your entire user base, without any extra cost. You also control the percentage of the promoted content in your recommendations. Amazon Personalize automatically finds the relevant items within the set of promotional items that meet your business rule and distributes them within each user’s recommendations.

Amazon Personalize enables you to improve customer engagement by powering personalized product and content recommendations in websites, applications, and targeted marketing campaigns. You can get started without any prior machine learning (ML) experience, using APIs to easily build sophisticated personalization capabilities in a few clicks. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

In this post, we demonstrate how to customize your recommendations with the new promotions feature for an ecommerce use case.

Solution overview

Different businesses can use promotions based on their individual goals for the type of content they want to increase engagement on. You can use promotions to have a percentage of your recommendations be of a particular type for any application regardless of the domain. For example, in ecommerce applications, you can use this feature to have 20% of recommended items be those marked as on sale, or from a certain brand, or category. For video-on-demand use cases, you can use this feature to fill 40% of a carousel with newly launched shows and movies that you want to highlight, or to promote live content. You can use promotions in domain dataset groups and custom dataset groups (User-Personalization and Similar-Items recipes).

Amazon Personalize makes configuring promotions simple: first, create a filter that selects the items you want promoted. You can use the Amazon Personalize console or API to create a filter with your logic using the Amazon Personalize DSL (domain-specific language). It only takes a few minutes. Then, when requesting recommendations, specify the promotion by specifying the filter, the percentage of the recommendations that should match that filter, and, if required, the dynamic filter parameters. The promoted items are randomly distributed in the recommendations, but any existing recommendations aren’t removed.

The following diagram shows how you can use promotions in recommendations in Amazon Personalize.

You define the items to promote in the catalog system, load them to the Amazon Personalize items dataset, and then get recommendations. Getting recommendations without specifying a promotion returns the most relevant items, and in this example, only one item from the promoted items. There is no guarantee of promoted items being returned. Getting recommendations with 50% promoted items returns half the items belonging to the promoted items.

This post walks you through the process of defining and applying promotions in your recommendations in Amazon Personalize to ensure the results from a campaign or recommender contain specific items that you want users to see. For this example, we create a retail recommender and promote items with CATEGORY_L2 as halloween, which corresponds to Halloween decorations. A code sample for this use case is available on GitHub.

Prerequisites

To use promotions, you first set up some Amazon Personalize resources on the Amazon Personalize console. Create your dataset group, load your data, and train a recommender. For full instructions, see Getting started.

  1. Create a dataset group.
  2. Create an Interactions dataset using the following schema:
    {
        "type": "record",
        "name": "Interactions",
        "namespace": "com.amazonaws.personalize.schema",
        "fields": [
            {
                "name": "USER_ID",
                "type": "string"
            },
            {
                "name": "ITEM_ID",
                "type": "string"
            },
            {
                "name": "TIMESTAMP",
                "type": "long"
            },
            {
                "name": "EVENT_TYPE",
                "type": "string"
            }
        ],
        "version": "1.0"
    }

  3. Import the interaction data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3). For this example, we use the following data file. We generated the synthetic data based on the code in the Retail Demo Store project. Refer to the GitHub repo to learn more about the data and potential uses.
  4. Create an Items dataset using the following schema:
    {
        "type": "record",
        "name": "Items",
        "namespace": "com.amazonaws.personalize.schema",
        "fields": [
            {
                "name": "ITEM_ID",
                "type": "string"
            },
            {
                "name": "PRICE",
                "type": "float"
            },
            {
                "name": "CATEGORY_L1",
                "type": ["string"],
                "categorical": true
            },
            {
                "name": "CATEGORY_L2",
                "type": ["string"],
                "categorical": true
            },
            {
                "name": "GENDER",
                "type": ["string"],
                "categorical": true
            }
        ],
        "version": "1.0"
    }

  5. Import the item data to Amazon Personalize from Amazon S3. For this example, we use the following data file, based on the code in the Retail Demo Store project.For more information on formatting and importing your interactions and items data from Amazon S3, see Importing bulk records.
  6. Create a recommender. In this example, we create a “Recommended for you” recommender.

Create a filter for your promotions

Now that you have set up your Amazon Personalize resources, you can create a filter that selects the items for your promotion.

You can create a static filter where all variables are hardcoded at filter creation. For example, to add all items that have CATEGORY_L2 as halloween, use the following filter expression:

INCLUDE ItemID WHERE Items.CATEGORY_L2 IN ("halloween")

You can also create dynamic filters. Dynamic filters are customizable in real time when you request the recommendations. To create a dynamic filter, you define your filter expression criteria using a placeholder parameter instead of a fixed value. This allows you to choose the values to filter by applying a filter to a recommendation request, rather than when you create your expression. You provide a filter when you call the GetRecommendations or GetPersonalizedRanking API operations, or as a part of your input data when generating recommendations in batch mode through a batch inference job.

For example, to select all items in a category chosen when you make your inference call with a filter applied, use the following filter expression:

INCLUDE ItemID WHERE Items.CATEGORY_L2 IN ($CATEGORY)

You can use the preceding DSL to create a customizable filter on the Amazon Personalize console. Complete the following steps:

  1. On the Amazon Personalize console, on the Filters page, choose Create filter.
  2. For Filter name, enter the name for your filter (for this post, we enter category_filter).
  3. Select Build expression or add your expression manually to create your custom filter.
  4. Build the expression “Include ItemID WHERE Items.CATEGORY_L2 IN $CATEGORY”For Value, you enter a value of $ plus a parameter name that is similar to your property name and easy to remember (for this example, $CATEGORY).
  5. Optionally, to chain additional expressions with your filter, choose, the plus sign.
  6. To add additional filter expressions, choose Add expression.
  7. Choose Create filter.

You can also create filters via the createFilter API in Amazon Personalize. For more information, see CreateFilter.

Apply promotions to your recommendations

Applying a filter when getting recommendations is a good way to tailor your recommendations to specific criteria. However, using filters directly applies the filter to all the recommendations returned. When using promotions, you can select what percentage of the recommendations correspond to the promoted items, allowing you to mix and match personalized recommendations and the best items that match the promotion criteria for each user in the proportions that make sense for your business use case.

The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender:

{
    "recommenderArn" = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20
}

This request returns personalized recommendations for the specified user. Of the items in the catalog, these are the 20 most relevant items for the user.

We can do the same call and apply a filter to return only items that match the filter. The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender and applies a dynamic filter to only return relevant items that have CATEGORY_L2 as halloween:

{
    "recommenderArn" = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20,
    filterArn = "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
    filterValues={ "CATEGORY": ""halloween""}
}

This request returns personalized recommendations for the specified user that have CATEGORY_L2 as halloween. Out of the items in the catalog, these are the 20 most relevant items with CATEGORY_L2 as halloween for the user.

You can use promotions if you want a certain percentage of items to be of an attribute you want to promote, and the rest to be items that are the most relevant for this user out of all items in the catalog. We can do the same call and apply a promotion. The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender and applies a promotion to include a certain percentage of relevant items that have CATEGORY_L2 as halloween:

{
    recommenderArn = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20,
    promotions = [{
        "name" : "halloween_promotion",
        "percentPromotedItems" : 20,
        "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
        "filterValues": {
            "CATEGORY" : ""halloween""
        }
    }]
}

This request returns 20% of recommendations that match the filter specified in the promotion: items with CATEGORY_L2 as halloween; and 80% personalized recommendations for the specified user that are the most relevant items for the user out of the items in the catalog.

You can use a filter combined with promotions. The filter in the top-level parameter block applies only to the non-promoted items.

The filter to select the promoted items is specified in the promotions parameter block. The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender and uses the dynamic filter we have been using twice. The first filter applies to non-promoted items, selecting items with CATEGORY_L2 as decorative, and the second filter applies to the promotion, promoting items with CATEGORY_L2 as halloween:

{
    recommenderArn = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20,
    "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
    "filterValues": {
        "CATEGORY" : ""decorative""
    }
    promotions = [{
        "name" : "halloween_promotion",
        "percentPromotedItems" : 20,
        "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
        "filterValues": {
            "CATEGORY" : ""halloween""
        }
    }]
}

This request returns 20% of recommendations that match the filter specified in the promotion: items with CATEGORY_L2 as halloween. The remaining 80% of recommended items are personalized recommendations for the specified user with CATEGORY_L2 as decorative. These are the most relevant items for the user out of the items in the catalog with CATEGORY_L2 as decorative.

Clean up

Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete filters, recommenders, datasets, and dataset groups via the AWS Management Console or using the Python SDK.

Summary

Adding promotions  in Amazon Personalize allows you to customize your recommendations for each user by including items that you want to explicitly increase visibility and engagement on. Promotions also allow you to specify what percentage of the recommended items should be promoted items, which tailors the recommendations to meet your business objectives at no extra cost. You can use promotions for recommendations using the User-Personalization and Similar-Items recipes, as well as use case optimized recommenders.

For more information about Amazon Personalize, see What Is Amazon Personalize?


About the authors

Anna Gruebler is a Solutions Architect at AWS.

Alex Burkleaux is a Solutions Architect at AWS. She focuses on helping customers apply machine learning and data analytics to solve problems in the media and entertainment industry.  In her free time, she enjoys spending time with family and volunteering as a ski patroller at her local ski hill.

Liam Morrison is a Solutions Architect Manager at AWS. He leads a team focused on Marketing Intelligence services. He has spent the last 5 years focused on practical applications of Machine Learning in Media & Entertainment, helping customers implement personalization, natural language processing, computer vision and more.

Read More

Amazon SageMaker JumpStart solutions now support custom IAM role settings

Amazon SageMaker JumpStart solutions are a feature within Amazon SageMaker Studio that allow a simple-click experience to set up your own machine learning (ML) workflows. When you launch a solution, various of AWS resources are set up in your account to demonstrate how the business problem can be solved using the pre-built architecture. The solutions use AWS CloudFormation templates for quick deployment, which means the resources are fully customizable. As of today, there are up to 18 end-to-end solutions that cover different aspects of real-world business problems, such as demand forecasting, product defect detection, and document understanding.

Starting today, we’re excited to announce that JumpStart solutions now supports custom AWS Identity and Access Management (IAM) roles be passed into services. This new feature enables you to take advantage of the rich security features offered by SageMaker and IAM.

In this post, we show you how to configure your SageMaker solution’s advanced parameters, and how this can benefit you when you use the pre-built solutions to start your ML journey.

New IAM advanced parameters

In order to allow JumpStart create the AWS resources for you, the IAM roles attached with Amazon managed policies are auto-created in your account. For the services created by JumpStart to be able to interact with each other, an IAM role needs to be passed into each service so they have the necessary permissions to call other services.

With the new Advanced Parameters option, you can select Default Roles, Find Roles, or Input Roles when you launch a solution. This means each service uses their own IAM role with dedicated IAM policy attached, and is fully customizable. This allows you to follow the least-privilege permissions principle, so that only the permissions required to perform a task are granted.

The policies attached to the default roles contain the least amount of permissions needed for the solution. In addition to the default roles, you can also select from a drop-down list, or input your own roles with the custom permissions you want to grant. This can greatly benefit you if you want to expand on the existing solution and perform even more tasks with these pre-built AWS services.

How to configure IAM advanced parameters

Before you use this feature, make sure you have the latest SageMaker domain enabled. You can create a new SageMaker domain if you haven’t done so, or update your SageMaker domain to create the default roles required for JumpStart solution. Then complete the following steps:

  1. On the SageMaker console, choose Control Panel in the navigation pane.
  2. Choose the gear icon to edit your domain settings.
  3. In the General Settings section, choose Next.
  4. In the SageMaker Projects and JumpStart section, select Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for this account and Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for Studio users.
  5. Choose Next.
    Done! Now you should be able to see the roles enabled on the SageMaker console.Now you can use JumpStart solutions with this new feature enabled.
  6. On the Studio console, choose JumpStart in the navigation pane.
  7. Choose Solutions.In the Launch Solution section, you can see a new drop-down menu called Advanced Parameters. Each solution requires different resources. Based on the services that the solution interacts with, there’s a dynamic list of roles you can pass in when launching the solution.
  8. Select your preferred method to specify roles.
    If you select Default Role, the roles are pre-populated for you. You can then proceed to launch the solution with one click. Under the hood, AWS CloudFormation uses a built-in template to provision all appropriate AWS resources, and the default roles are used by each service.If you select Find Role, you can select an existing IAM role in your account from the drop-down menu for each required service. In order to let the services work as they are designed, we recommend choosing a role that has the minimum permissions required. For more information about the permissions required for each service, refer to AWS Managed Policies for SageMaker projects and JumpStart.

    You can have more flexibility by selecting Input Role, which allows you to enter a role name directly. This works best if you know which role you want to use, so you don’t need to choose it from the Find Role list.
  9. After you specify the role you want to use for each service, launch the solution by choosing Launch.

The roles are passed into each service and grant each service permission to interact with other services. The CloudFormation template deploys these services in your account. You can then explore the ML solution for the business problem. Keep in mind that for each service, they now have the precise permissions you have granted them when you configured the advanced parameters. This gives you a fully controlled and secured environment when using JumpStart solutions.

Conclusion

Today, we announced support for configuring IAM roles when you launch a JumpStart solution. We also showed you how to configure the Advanced Parameters options before launching a solution.

Try out any JumpStart solution on Studio with this new feature enabled. If you have any questions and feedback regarding JumpStart solutions, please speak to your AWS support contact or post a message in the Amazon SageMaker discussion forums.


About the authors

Haotian An is a Software Development Engineer at Amazon SageMaker Jumpstart. He focuses on building tools and products to make machine learning easier to access for customers.

Manan Shah is a Software Development Manager at Amazon Web Services. He is a ML enthusiast and focuses on building no-code/low-code AI/ML products. I thrive empowering other talented, technical people to build great software.

Read More

Intelligent document processing with AWS AI services: Part 2

Amazon’s intelligent document processing (IDP) helps you speed up your business decision cycles and reduce costs. Across multiple industries, customers need to process millions of documents per year in the course of their business. For customers who process millions of documents, this is a critical aspect for the end-user experience and a top digital transformation priority. Because of the varied formats, most firms manually process documents such as W2s, claims, ID documents, invoices, and legal contracts, or use legacy OCR (optical character recognition) solutions that are time-consuming, error-prone, and costly. An IDP pipeline with AWS AI services empowers you to go beyond OCR with more accurate and versatile information extraction, process documents faster, save money, and shift resources to higher value tasks.

In this series, we give an overview of the IDP pipeline to reduce the amount of time and effort it takes to ingest a document and get the key information into downstream systems. The following figure shows the stages that are typically part of an IDP workflow.

phases of intelligent document processing with AWS AI services.

In this two-part series, we discuss how you can automate and intelligently process documents at scale using AWS AI services. In part 1, we discussed the first three phases of the IDP workflow. In this post, we discuss the remaining workflow phases.

Solution overview

The following reference architecture shows how you can use AWS AI services like Amazon Textract and Amazon Comprehend, along with other AWS services to implement the IDP workflow. In part 1, we described the data capture and document classification stages, where we categorized and tagged documents such as bank statements, invoices, and receipt documents. We also discussed the extraction stage, where you can extract meaningful business information from your documents. In this post, we extend the IDP pipeline by looking at Amazon Comprehend default and custom entities in the extraction phase, perform document enrichment, and also briefly look at the capabilities of Amazon Augmented AI (Amazon A2I) to include a human review workforce in the review and validation stage.

We also use Amazon Comprehend Medical as part of this solution, which is a service to extract information from unstructured medical text accurately and quickly and identify relationships among extracted health information, and link to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT.

Amazon A2I is a machine learning (ML) service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers whether it runs on AWS or not. Amazon A2I integrates with Amazon Textract and Amazon Comprehend to provide you the ability to introduce human review steps within your IDP workflow.

Prerequisites

Before you get started, refer to part 1 for a high-level overview of IDP and details about the data capture, classification, and extraction stages.

Extraction phase

In part 1 of this series, we discussed how we can use Amazon Textract features for accurate data extraction for any type of documents. To extend this phase, we use Amazon Comprehend pre-trained entities and an Amazon Comprehend custom entity recognizer for further document extraction. The purpose of the custom entity recognizer is to identify specific entities and generate custom metadata regarding our documents in CSV or human readable format to be later analyzed by the business users.

Named entity recognition

Named entity recognition (NER) is a natural language processing (NLP) sub-task that involves sifting through text data to locate noun phrases, called named entities, and categorizing each with a label, such as brand, date, event, location, organizations, person, quantity, or title. For example, in the statement “I recently subscribed to Amazon Prime,” Amazon Prime is the named entity and can be categorized as a brand.

Amazon Comprehend enables you to detect such custom entities in your document. Each entity also has a confidence level score that Amazon Comprehend returns for each entity type. The following diagram illustrates the entity recognition process.

Named entity recognition with Amazon Comprehend

To get entities from the text document, we call the comprehend.detect_entities() method and configure the language code and text as input parameters:

def get_entities(text):
    try:
        #detect entities
        entities = comprehend.detect_entities(LanguageCode="en", Text=text)  
        df = pd.DataFrame(entities["Entities"], columns = ['Text', 'Type'])
        display(HTML(df.to_html(index=False)))
    except Exception as e:
        print(e)

We run the get_entities() method on the bank document and obtain the entity list in the results.

Response from get_entities method from Comprehend.

Although entity extraction worked fairly well in identifying the default entity types for everything in the bank document, we want specific entities to be recognized for our use case. More specifically, we need to identify the customer’s savings and checking account numbers in the bank statement. We can extract these key business terms using Amazon Comprehend custom entity recognition.

Train an Amazon Comprehend custom entity recognition model

To detect the specific entities that we’re interested in from the customer’s bank statement, we train a custom entity recognizer with two custom entities: SAVINGS_AC and CHECKING_AC.

Then we train a custom entity recognition model. We can choose one of two ways to provide data to Amazon Comprehend: annotations or entity lists.

The annotations method can often lead to more refined results for image files, PDFs, or Word documents because you train a model by submitting more accurate context as annotations along with your documents. However, the annotations method can be time-consuming and work-intensive. For simplicity of this blog post, we use the entity lists method, which you can only use for plain text documents. This method gives us a CSV file that should contain the plain text and its corresponding entity type, as shown in the preceding example. The entities in this file are going to be specific to our business needs (savings and checking account numbers).

For more details on how to prepare the training data for different use cases using annotations or entity lists methods, refer to Preparing the training data.

The following screenshot shows an example of our entity list.

A snapshot of entity list.

Create an Amazon Comprehend custom NER real-time endpoint

Next, we create a custom entity recognizer real-time endpoint using the model that we trained. We use the CreateEndpoint API via the comprehend.create_endpoint() method to create the real-time endpoint:

#create comprehend endpoint
model_arn = entity_recognizer_arn
ep_name = 'idp-er-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ER_ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ER_ENDPOINT_ARN}')
    %store ER_ENDPOINT_ARN
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ER_ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:entity-recognizer-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ER_ENDPOINT_ARN}"')
        %store ER_ENDPOINT_ARN
    else:
        print(error)

After we train a custom entity recognizer, we use the custom real-time endpoint to extract some enriched information from the document and then perform document redaction with the help of the custom entities recognized by Amazon Comprehend and bounding box information from Amazon Textract.

Enrichment phase

In the document enrichment stage, we can perform document enrichment by redacting personally identifiable information (PII) data, custom business term extraction, and so on. Our previous sample document (a bank statement) contains the customers’ savings and checking account numbers, which we want to redact. Because we already know these custom entities by means of our Amazon Comprehend custom NER model, we can easily use the Amazon Textract geometry data type to redact these PII entities wherever they appear in the document. In the following architecture, we redact key business terms (savings and checking accounts) from the bank statement document.

Document enrichment phase.

As you can see in the following example, the checking and savings account numbers are hidden in the bank statement now.

Redacted bank statement sample.

Traditional OCR solutions struggle to extract data accurately from most unstructured and semi-structured documents because of significant variations in how the data is laid out across multiple versions and formats of these documents. You may then need to implement custom preprocessing logic or even manually extract the information out of these documents. In this case, the IDP pipeline supports two features that you can use: Amazon Comprehend custom NER and Amazon Textract queries. Both these services use NLP to extract insights about the content of documents.

Extraction with Amazon Textract queries

When processing a document with Amazon Textract, you can add the new queries feature to your analysis to specify what information you need. This involves passing an NLP question, such as “What is the customer’s social security number?” to Amazon Textract. Amazon Textract finds the information in the document for that question and returns it in a response structure separate from the rest of the document’s information. Queries can be processed alone, or in combination with any other FeatureType, such as Tables or Forms.

Queries based extraction using Amazon Textract.

With Amazon Textract queries, you can extract information with high accuracy irrespective of the how the data is laid out in a document structure, such as forms, tables, and checkboxes, or housed within nested sections in a document.

To demonstrate the queries feature, we extract valuable pieces of information like the patient’s first and last names, the dosage manufacturer, and so on from documents such as a COVID-19 vaccination card.

A sample vaccination card.

We use the textract.analyze_document() function and specify the FeatureType as QUERIES as well as add the queries in the form of natural language questions in the QueriesConfig.

The following code has been trimmed down for simplification purposes. For the full code, refer the GitHub sample code for analyze_document().

response = None
with open(image_filename, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Textract
response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
            "Queries": [{
                "Text": "What is the date for the 1st dose covid-19?",
                "Alias": "COVID_VACCINATION_FIRST_DOSE_DATE"
            },
# code trimmed down for simplification
#..
]
}) 

For the queries feature, the textract.analyze_document() function outputs all OCR WORDS and LINES, geometry information, and confidence scores in the response JSON. However, we can just print out the information that we queried for.

Document is a wrapper function used to help parse the JSON response from the API. It provides a high-level abstraction and makes the API output iterable and easy to get information out of. For more information, refer to the Textract Response Parser and Textractor GitHub repos. After we process the response, we get the following information as shown in the screenshot.

import trp.trp2 as t2
from tabulate import tabulate

d = t2.TDocumentSchema().load(response)
page = d.pages[0]

query_answers = d.get_query_answers(page=page)

print(tabulate(query_answers, tablefmt="github"))

Response from queries extraction.

Review and validation phase

This is the final stage of our IDP pipeline. In this stage, we can use our business rules to check for completeness of a document. For example, from an insurance claims document, the claim ID is extracted accurately and successfully. We can use AWS serverless technologies such as AWS Lambda for further automation of these business rules. Moreover, we can include a human workforce for document reviews to ensure the predictions are accurate. Amazon A2I accelerates building workflows required for human review for ML predictions.

With Amazon A2I, you can allow human reviewers to step in when a model is unable to make a high confidence prediction or to audit its predictions on an ongoing basis. The goal of the IDP pipeline is to reduce the amount of human input required to get accurate information into your decision systems. With IDP, you can reduce the amount of human input for your document processes as well as the total cost of document processing.

After you have all the accurate information extracted from the documents, you can further add business-specific rules using Lambda functions and finally integrate the solution with downstream databases or applications.

Human review and verification phase.

For more information on how to create an Amazon A2I workflow, follow the instructions from the Prep for Module 4 step at the end of 03-idp-document-enrichment.ipynb in our GitHub repo.

Clean up

To prevent incurring future charges to your AWS account, delete the resources that we provisioned in the setup of the repository by navigating to the Cleanup section in our repo.

Conclusion

In this two-part post, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We discussed the various stages of the pipeline and a hands-on solution with AWS AI services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I for designing and building industry-specific use cases. In the first post of the series, we demonstrated how to use Amazon Textract and Amazon Comprehend to extract information from various documents. In this post, we did a deep dive into how to train an Amazon Comprehend custom entity recognizer to extract custom entities from our documents. We also performed document enrichment techniques like redaction using Amazon Textract as well as the entity list from Amazon Comprehend. Finally, we saw how you can use an Amazon A2I human review workflow for Amazon Textract by including a private work team.

For more information about the full code samples in this post, refer to the GitHub repo.

We recommend you review the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and follow the guidelines provided. Also, take a moment to review and understand the pricing for Amazon Textract, Amazon Comprehend, and Amazon A2I.


About the authors

Chin Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core areas of focus are artificial intelligence and machine learning for intelligent document processing.

Anjan Biswas is an AI/ML specialist Senior Solutions Architect. Anjan works with enterprise customers and is passionate about developing, deploying and explaining AI/ML, data analytics, and big data solutions. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS.

Suprakash Dutta is a Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs intelligent document processing solutions.

Read More

Intelligent document processing with AWS AI services: Part 1

Organizations across industries such as healthcare, finance and lending, legal, retail, and manufacturing often have to deal with a lot of documents in their day-to-day business processes. These documents contain critical information that are key to making decisions on time in order to maintain the highest levels of customer satisfaction, faster customer onboarding, and lower customer churn. In most cases, documents are processed manually to extract information and insights, which is time-consuming, error-prone, expensive, and difficult to scale. There is limited automation available today to process and extract information from these documents. Intelligent document processing (IDP) with AWS artificial intelligence (AI) services helps automate information extraction from documents of different types and formats, quickly and with high accuracy, without the need for machine learning (ML) skills. Faster information extraction with high accuracy helps in making quality business decisions on time, while reducing overall costs.

Although the stages in an IDP workflow may vary and be influenced by use case and business requirements, the following figure shows the stages that are typically part of an IDP workflow. Processing documents such as tax forms, claims, medical notes, new customer forms, invoices, legal contracts, and more are just a few of the use cases for IDP.

Phases of intelligent document processing in AWS

In this two-part series, we discuss how you can automate and intelligently process documents at scale using AWS AI services. In this post, we discuss the first three phases of the IDP workflow. In part 2, we discuss the remaining workflow phases.

Solution overview

The following architecture diagram shows the stages of an IDP workflow. It starts with a data capture stage to securely store and aggregate different file formats (PDF, JPEG, PNG, TIFF) and layouts of documents. The next stage is classification, where you categorize your documents (such as contracts, claim forms, invoices, or receipts), followed by document extraction. In the extraction stage, you can extract meaningful business information from your documents. This extracted data is often used to gather insights via data analysis, or sent to downstream systems such as databases or transactional systems. The following stage is enrichment, where documents can be enriched by redacting protected health information (PHI) or personally identifiable information (PII) data, custom business term extraction, and so on. Finally, in the review and validation stage, you can include a human workforce for document reviews to ensure the outcome is accurate.

For the purposes of this post, we consider a set of sample documents such as bank statements, invoices, and store receipts. The document samples, along with sample code, can be found in our GitHub repository. In the following sections, we walk you through these code samples along with real practical application. We demonstrate how you can utilize ML capabilities with Amazon Textract, Amazon Comprehend, and Amazon Augmented AI (Amazon A2I) to process documents and validate the data extracted from them.

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.

Amazon Comprehend is a natural-language processing (NLP) service that uses ML to extract insights about the content of the documents. Amazon Comprehend can identify critical elements in documents, including references to language, people, and places, and classify them into relevant topics or clusters. It can perform sentiment analysis to determine the sentiment of a document in real time using single document or batch detection. For example, it can analyze the comments on a blog post to know if your readers like the post or not. Amazon Comprehend also detects PII like addresses, bank account numbers, and phone numbers in text documents in real time and asynchronous batch jobs. It can also redact PII entities in asynchronous batch jobs.

Amazon A2I is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether it runs on AWS or not. Amazon A2I integrates both with Amazon Textract and Amazon Comprehend to provide you the ability to introduce human review steps within your intelligent document processing workflow.

Data capture phase

You can store documents in a highly scalable and durable storage like Amazon Simple Storage Service (Amazon S3). Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon S3 is designed for 11 9’s of durability and stores data for millions of customers all around the world. Documents can come in various formats and layouts, and can come from different channels like web portals or email attachments.

Classification phase

In the previous step, we collected documents of various types and formats. In this step, we need to categorize the documents before we can do further extraction. For that, we use Amazon Comprehend custom classification. Document classification is a two-step process. First, you train an Amazon Comprehend custom classifier to recognize the classes that are of interest to you. Next, you deploy the model with a custom classifier real-time endpoint and send unlabeled documents to the real-time endpoint to be classified.

The following figure represents a typical document classification workflow.

Classification phase

To train the classifier, identify the classes you’re interested in and provide sample documents for each of the classes as training material. Based on the options you indicated, Amazon Comprehend creates a custom ML model that it trains based on the documents you provided. This custom model (the classifier) examines each document you submit. It returns either the specific class that best represents the content (if you’re using multi-class mode) or the set of classes that apply to it (if you’re using multi-label mode).

Prepare training data

The first step is to extract text from documents required for the Amazon Comprehend custom classifier. To extract the raw text information for all the documents in Amazon S3, we use the Amazon Textract detect_document_text() API. We also label the data according to the document type to be used to train a custom Amazon Comprehend classifier.

The following code has been trimmed down for simplification purposes. For the full code, refer to the GitHub sample code for textract_extract_text(). The function call_textract() is a wr4apper function that calls the AnalyzeDocument API internally, and the parameters passed to the method abstract some of the configurations that the API needs to run the extraction task.

def textract_extract_text(document, bucket=data_bucket):        
    try:
        print(f'Processing document: {document}')
        lines = ""
        row = []
        
        # using amazon-textract-caller
        response = call_textract(input_document=f's3://{bucket}/{document}') 
        # using pretty printer to get all the lines
        lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
        
        label = [name for name in names if(name in document)]  
        row.append(label[0])
        row.append(lines)        
        return row
    except Exception as e:
        print (e)        

Train a custom classifier

In this step, we use Amazon Comprehend custom classification to train our model for classifying the documents. We use the CreateDocumentClassifier API to create a classifier that trains a custom model using our labeled data. See the following code:

create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/{key}'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )

Deploy a real-time endpoint

To use the Amazon Comprehend custom classifier, we create a real-time endpoint using the CreateEndpoint API:

endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ENDPOINT_ARN=endpoint_response['EndpointArn']
print(f'Endpoint created with ARN: {ENDPOINT_ARN}')  

Classify documents with the real-time endpoint

After the Amazon Comprehend endpoint is created, we can use the real-time endpoint to classify documents. We use the comprehend.classify_document() function with the extracted document text and inference endpoint as input parameters:

response = comprehend.classify_document(
      Text= document,
      EndpointArn=ENDPOINT_ARN
      )

Amazon Comprehend returns all classes of documents with a confidence score linked to each class in an array of key-value pairs (name-score). We pick the document class with the highest confidence score. The following screenshot is a sample response.

Classify documents with the real-time endpoint

We recommend going through the detailed document classification sample code on GitHub.

Extraction phase

Amazon Textract lets you extract text and structured data information using the Amazon Textract DetectDocumentText and AnalyzeDocument APIs, respectively. These APIs respond with JSON data, which contains WORDS, LINES, FORMS, TABLES, geometry or bounding box information, relationships, and so on. Both DetectDocumentText and AnalyzeDocument are synchronous operations. To analyze documents asynchronously, use StartDocumentTextDetection.

Structured data extraction

You can extract structured data such as tables from documents while preserving the data structure and relationships between detected items. You can use the AnalyzeDocument API with the FeatureType as TABLE to detect all tables in a document. The following figure illustrates this process.

Structured data extraction

See the following code:

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])

We run the analyze_document() method with the FeatureType as TABLES on the employee history document and obtain the table extraction in the following results.

Analyze document API response for tables extraction

Semi-structured data extraction

You can extract semi-structured data such as forms or key-value pairs from documents while preserving the data structure and relationships between detected items. You can use the AnalyzeDocument API with the FeatureType as FORMS to detect all forms in a document. The following diagram illustrates this process.

Semi-structured data extraction

See the following code:

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["FORMS"])

Here, we run the analyze_document() method with the FeatureType as FORMS on the employee application document and obtain the table extraction in the results.

Unstructured data extraction

Amazon Textract is optimal for dense text extraction with industry-leading OCR accuracy. You can use the DetectDocumentText API to detect lines of text and the words that make up a line of text, as illustrated in the following figure.

Unstructured data extraction

See the following code:

response = textract.detect_document_text(Document={'Bytes': imageBytes})

# Print detected text
for item in response["Blocks"]:
	if item["BlockType"] == "LINE":
 		print (item["Text"])

Now we run the detect_document_text() method on the sample image and obtain raw text extraction in the results.

Invoices and receipts

Amazon Textract provides specialized support to process invoices and receipts at scale. The AnalyzeExpense API can extract explicitly labeled data, implied data, and line items from an itemized list of goods or services from almost any invoice or receipt without any templates or configuration. The following figure illustrates this process.

Invoices and receipts extraction

See the following code:

response = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo on the page without an explicit label called “vendor”. It can also find and extract expense items, quantity, and prices that aren’t labeled with column headers for line items.

Analyze expense API response

Identity documents

The Amazon Textract AnalyzeID API can help you automatically extract information from identification documents, such as driver’s licenses and passports, without the need for templates or configuration. We can extract specific information, such as date of expiry and date of birth, as well as intelligently identify and extract implied information, such as name and address. The following diagram illustrates this process.

Identity documents extraction

See the following code:

textract_client = boto3.client('textract')
j = call_textract_analyzeid(document_pages=["s3://amazon-textract-public-content/analyzeid/driverlicense.png"],boto3_textract_client=textract_client)

We can use tabulate to get a pretty printed output:

from tabulate import tabulate

print(tabulate([x[1:3] for x in result]))

We recommend going through the detailed document extraction sample code on GitHub. For more information about the full code samples in this post, refer to the GitHub repo.

Conclusion

In this first post of a two-part series, we discussed the various stages of IDP and a solution architecture. We also discussed document classification using an Amazon Comprehend custom classifier. Next, we explored the ways you can use Amazon Textract to extract information from unstructured, semi-structured, structured, and specialized document types.

In part 2 of this series, we continue the discussion with the extract and queries features of Amazon Textract. We look at how to use Amazon Comprehend pre-defined entities and custom entities to extract key business terms from documents with dense text, and how to integrate an Amazon A2I human-in-the-loop review in your IDP processes.

We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. Also, take a moment to review and understand the pricing for Amazon Textract, Amazon Comprehend, and Amazon A2I.


About the authors

Suprakash Dutta is a Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning.

Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and data analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Read More

Build an air quality anomaly detector using Amazon Lookout for Metrics

Today, air pollution is a familiar environmental issue that creates severe respiratory and heart conditions, which pose serious health threats. Acid rain, depletion of the ozone layer, and global warming are also adverse consequences of air pollution. There is a need for intelligent monitoring and automation in order to prevent severe health issues and in extreme cases life-threatening situations. Air quality is measured using the concentration of pollutants in the air. Identifying symptoms early and controlling the pollutant level before it’s dangerous is crucial. The process of identifying the air quality and the anomaly in the weight of pollutants, and quickly diagnosing the root cause, is difficult, costly, and error-prone.

The process of applying AI and machine learning (ML)-based solutions to find data anomalies involves a lot of complexity in ingesting, curating, and preparing data in the right format and then optimizing and maintaining the effectiveness of these ML models over long periods of time. This has been one of the barriers to quickly implementing and scaling the adoption of ML capabilities.

This post shows you how to use an integrated solution with Amazon Lookout for Metrics and Amazon Kinesis Data Firehose to break these barriers by quickly and easily ingesting streaming data, and subsequently detecting anomalies in the key performance indicators of your interest.

Lookout for Metrics automatically detects and diagnoses anomalies (outliers from the norm) in business and operational data. It’s a fully managed ML service that uses specialized ML models to detect anomalies based on the characteristics of your data. For example, trends and seasonality are two characteristics of time series metrics in which threshold-based anomaly detection doesn’t work. Trends are continuous variations (increases or decreases) in a metric’s value. On the other hand, seasonality is periodic patterns that occur in a system, usually rising above a baseline and then decreasing again. You don’t need ML experience to use Lookout for Metrics.

We demonstrate a common air quality monitoring scenario, in which we detect anomalies in the pollutant concentration in the air. By the end of this post, you’ll learn how to use these managed services from AWS to help prevent health issues and global warming. You can apply this solution to other use cases for better environment management, such as detecting anomalies in water quality, land quality, and power consumption patterns, to name a few.

Solution overview

The architecture consists of three functional blocks:

  • Wireless sensors placed at strategic locations to sense the concentration level of carbon monoxide (CO), sulfur dioxide (SO2), and nitrogen dioxide(NO2) in the air
  • Streaming data ingestion and storage
  • Anomaly detection and notification

The solution provides a fully automated data path from the sensors all the way to a notification being raised to the user. You can also interact with the solution using the Lookout for Metrics UI in order to analyze the identified anomalies.

The following diagram illustrates our solution architecture.

Prerequisites

You need the following prerequisites before you can proceed with solution. For this post, we use the us-east-1 Region.

  1. Download the Python script (publish.py) and data file from the GitHub repo.
  2. Open the live_data.csv file in your preferred editor and replace the dates to be today’s and tomorrow’s date. For example, if today’s date is July 8, 2022, then replace 2022-03-25 with 2022-07-08. Keep the format the same. This is required to simulate sensor data for the current date using the IoT simulator script.
  3. Create an Amazon Simple Storage Service (Amazon S3) bucket and a folder named air-quality. Create a subfolder inside air-quality named historical. For instructions, see Creating a folder.
  4. Upload the live_data.csv file in the root S3 bucket and historical_data.json in the historical folder.
  5. Create an AWS Cloud9 development environment, which we use to run the Python simulator program to create sensor data for this solution.

Ingest and transform data using AWS IoT Core and Kinesis Data Firehose

We use a Kinesis Data Firehose delivery stream to ingest the streaming data from AWS IoT Core and deliver it to Amazon S3. Complete the following steps:

  1. On the Kinesis Data Firehose console, choose Create delivery stream.
  2. For Source, choose Direct PUT.
  3. For Destination, choose Amazon S3.
  4. For Delivery stream name, enter a name for your delivery stream.
  5. For S3 bucket, enter the bucket you created as a prerequisite.
  6. Enter values for S3 bucket prefix and S3 bucket error output prefix.One of the key points to note is the configuration of the custom prefix that is configured for the Amazon S3 destination. This prefix pattern makes sure that the data is created in the S3 bucket as per the prefix hierarchy expected by Lookout for Metrics. (More on this later in this post.) For more information about custom prefixes, see Custom Prefixes for Amazon S3 Objects.
  7. For Buffer interval, enter 60.
  8. Choose Create or update IAM role.
  9. Choose Create delivery stream.

    Now we configure AWS IoT Core and run the air quality simulator program.
  10. On the AWS IoT Core console, create an AWS IoT policy called admin.
  11. In the navigation pane under Message Routing, choose Rules.
  12. Choose Create rule.
  13. Create a rule with the Kinesis Data Firehose(firehose) action.
    This sends data from an MQTT message to a Kinesis Data Firehose delivery stream.
  14. Choose Create.
  15. Create an AWS IoT thing with name Test-Thing and attach the policy you created.
  16. Download the certificate, public key, private key, device certificate, and root CA for AWS IoT Core.
  17. Save each of the downloaded files to the certificates subdirectory that you created earlier.
  18. Upload publish.py to the iot-test-publish folder.
  19. On the AWS IoT Core console, in the navigation pane, choose Settings.
  20. Under Custom endpoint, copy the endpoint.
    This AWS IoT Core custom endpoint URL is personal to your AWS account and Region.
  21. Replace customEndpointUrl with your AWS IoT Core custom endpoint URL, certificates with the name of certificate, and Your_S3_Bucket_Name with your S3 bucket name.
    Next, you install pip and the AWS IoT SDK for Python.
  22. Log in to AWS Cloud9 and create a working directory in your development environment. For example: aq-iot-publish.
  23. Create a subdirectory for certificates in your new working directory. For example: certificates.
  24. Install the AWS IoT SDK for Python v2 by running the following from the command line.
    pip install awsiotsdk

  25. To test the data pipeline, run the following command:
    python3 publish.py

You can see the payload in the following screenshot.

Finally, the data is delivered to the specified S3 bucket in the prefix structure.

The data of the files is as follows:

  • {"TIMESTAMP":"2022-03-20 00:00","LOCATION_ID":"B-101","CO":2.6,"SO2":62,"NO2":57}
  • {"TIMESTAMP":"2022-03-20 00:05","LOCATION_ID":"B-101","CO":3.9,"SO2":60,"NO2":73}

The timestamps show that each file contains data for 5-minute intervals.

With minimal code, we have now ingested the sensor data, created an input stream from the ingested data, and stored the data in an S3 bucket based on the requirements for Lookout for Metrics.

In the following sections, we take a deeper look at the constructs within Lookout for Metrics, and how easy it is to configure these concepts using the Lookout for Metrics console.

Create a detector

A detector is a Lookout for Metrics resource that monitors a dataset and identifies anomalies at a predefined frequency. Detectors use ML to find patterns in data and distinguish between expected variations in data and legitimate anomalies. To improve its performance, a detector learns more about your data over time.

In our use case, the detector analyzes data from the sensor every 5 minutes.

To create the detector, navigate to the Lookout for Metrics console and choose Create detector. Provide the name and description (optional) for the detector, along with the interval of 5 minutes.

Your data is encrypted by default with a key that AWS owns and manages for you. You can also configure if you want to use a different encryption key from the one that is used by default.

Now let’s point this detector to the data that you want it to run anomaly detection on.

Create a dataset

A dataset tells the detector where to find your data and which metrics to analyze for anomalies. To create a dataset, complete the following steps:

  1. On the Amazon Lookout for Metrics console, navigate to your detector.
  2. Choose Add a dataset.
  3. For Name, enter a name (for example, air-quality-dataset).
  4. For Datasource, choose your data source (for this post, Amazon S3).
  5. For Detector mode, select your mode (for this post, Continuous).

With Amazon S3, you can create a detector in two modes:

    • Backtest – This mode is used to find anomalies in historical data. It needs all records to be consolidated in a single file.
    • Continuous – This mode is used to detect anomalies in live data. We use this mode with our use case because we want to detect anomalies as we receive air pollutant data from the air monitoring sensor.
  1. Enter the S3 path for the live S3 folder and path pattern.
  2. For Datasource interval, choose 5 minute intervals.If you have historical data from which the detector can learn patterns, you can provide it during this configuration. The data is expected to be in the same format that you use to perform a backtest. Providing historical data speeds up the ML model training process. If this isn’t available, the continuous detector waits for sufficient data to be available before making inferences.
  3. For this post, we already have historical data, so select Use historical data.
  4. Enter the S3 path of historical_data.json.
  5. For File format, select JSON lines.

At this point, Lookout for Metrics accesses the data source and validates whether it can parse the data. If the parsing is successful, it gives you a “Validation successful” message and takes you to the next page, where you configure measures, dimensions, and timestamps.

Configure measures, dimensions, and timestamps

Measures define KPIs that you want to track anomalies for. You can add up to five measures per detector. The fields that are used to create KPIs from your source data must be of numeric format. The KPIs can be currently defined by aggregating records within the time interval by doing a SUM or AVERAGE.

Dimensions give you the ability to slice and dice your data by defining categories or segments. This allows you to track anomalies for a subset of the whole set of data for which a particular measure is applicable.

In our use case, we add three measures, which calculate the AVG of the objects seen in the 5-minute interval, and have only one dimension, for which pollutants concentration is measured.

Every record in the dataset must have a timestamp. The following configuration allows you to choose the field that represents the timestamp value and also the format of the timestamp.

The next page allows you to review all the details you added and then save and activate the detector.

The detector then begins learning the data streaming into the data source. At this stage, the status of the detector changes to Initializing.

It’s important to note the minimum amount of data that is required before Lookout for Metrics can start detecting anomalies. For more information about requirements and limits, see Lookout for Metrics quotas.

With minimal configuration, you have created your detector, pointed it at a dataset, and defined the metrics that you want Lookout for Metrics to find anomalies in.

Visualize anomalies

Lookout for Metrics provides a rich UI experience for users who want to use the AWS Management Console to analyze the anomalies being detected. It also provides the capability to query the anomalies via APIs.

Let’s look at an example anomaly detected from our air quality data use case. The following screenshot shows an anomaly detected in CO concentration in the air at the designated time and date with a severity score of 93. It also shows the percentage contribution of the dimension towards the anomaly. In this case, 100% contribution comes from the location ID B-101 dimension.

Create alerts

Lookout for Metrics allows you to send alerts using a variety of channels. You can configure the anomaly severity score threshold at which the alerts must be triggered.

In our use case, we configure alerts to be sent to an Amazon Simple Notification Service (Amazon SNS) channel, which in turn sends an SMS. The following screenshots show the configuration details.

You can also use an alert to trigger automations using AWS Lambda functions in order to drive API-driven operations on AWS IoT Core.

Conclusion

In this post, we showed you how easy to use Lookout for Metrics and Kinesis Data Firehose to remove the undifferentiated heavy lifting involved in managing the end-to-end lifecycle of building ML-powered anomaly detection applications. This solution can help you accelerate your ability to find anomalies in key business metrics and allow you focus your efforts on growing and improving your business.

We encourage you to learn more by visiting the Amazon Lookout for Metrics Developer Guide and try out the end-to-end solution enabled by these services with a dataset relevant to your business KPIs.


About the author

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration, and strategy. He is passionate about technology and enjoys building and experimenting in the analytics and AI/ML space.

Read More

Build a GNN-based real-time fraud detection solution using Amazon SageMaker, Amazon Neptune, and the Deep Graph Library

Fraudulent activities severely impact many industries, such as e-commerce, social media, and financial services. Frauds could cause a significant loss for businesses and consumers. American consumers reported losing more than $5.8 billion to frauds in 2021, up more than 70% over 2020. Many techniques have been used to detect fraudsters—rule-based filters, anomaly detection, and machine learning (ML) models, to name a few.

In real-world data, entities often involve rich relationships with other entities. Such a graph structure can provide valuable information for anomaly detection. For example, in the following figure, users are connected via shared entities such as Wi-Fi IDs, physical locations, and phone numbers. Due to the large number of unique values of these entities, like phone numbers, it’s difficult to use them in the traditional feature-based models—for example, one-hot encoding all phone numbers wouldn’t be viable. But such relationships could help predict whether a user is a fraudster. If a user has shared several entities with a known fraudster, the user is more likely a fraudster.

Recently, graph neural network (GNN) has become a popular method for fraud detection. GNN models can combine both graph structure and attributes of nodes or edges, such as users or transactions, to learn meaningful representations to distinguish malicious users and events from legitimate ones. This capability is crucial for detecting frauds where fraudsters collude to hide their abnormal features but leave some traces of relations.

Current GNN solutions mainly rely on offline batch training and inference mode, which detect fraudsters after malicious events have happened and losses have occurred. However, catching fraudulent users and activities in real time is crucial for preventing losses. This is particularly true in business cases where there is only one chance to prevent fraudulent activities. For example, in some e-commerce platforms, account registration is wide open. Fraudsters can behave maliciously just once with an account and never use the same account again.

Predicting fraudsters in real time is important. Building such a solution, however, is challenging. Because GNNs are still new to the industry, there are limited online resources on converting GNN models from batch serving to real-time serving. Additionally, it’s challenging to construct a streaming data pipeline that can feed incoming events to a GNN real-time serving API. To the best of the authors’ knowledge, no reference architectures and examples are available for GNN-based real-time inference solutions as of this writing.

To help developers apply GNNs to real-time fraud detection, this post shows how to use Amazon Neptune, Amazon SageMaker, and the Deep Graph Library (DGL), among other AWS services, to construct an end-to-end solution for real-time fraud detection using GNN models.

We focus on four tasks:

  • Processing a tabular transaction dataset into a heterogeneous graph dataset
  • Training a GNN model using SageMaker
  • Deploying the trained GNN models as a SageMaker endpoint
  • Demonstrating real-time inference for incoming transactions

This post extends the previous work in Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library, which focuses on the first two tasks. You can refer to that post for more details on heterogeneous graphs, GNNs, and semi-supervised training of GNNs.

Businesses looking for a fully-managed AWS AI service for fraud detection can also use Amazon Fraud Detector, which makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud.

Solution overview

This solution contains two major parts.

The first part is a pipeline that processes the data, trains GNN models, and deploys the trained models. It uses AWS Glue to process the transaction data, and saves the processed data to both Amazon Neptune and Amazon Simple Storage Service (Amazon S3). Then, a SageMaker training job is triggered to train a GNN model on the data saved in Amazon S3 to predict whether a transaction is fraudulent. The trained model along with other assets are saved back to Amazon S3 upon the completion of the training job. Finally, the saved model is deployed as a SageMaker endpoint. The pipeline is orchestrated by AWS Step Functions, as shown in the following figure.

The second part of the solution implements real-time fraudulent transaction detection. It starts from a RESTful API that queries the graph database in Neptune to extract the subgraph related to an incoming transaction. It also has a web portal that can simulate business activities, generating online transactions with both fraudulent and legitimate ones. The web portal provides a live visualization of the fraud detection. This part uses Amazon CloudFront, AWS Amplify, AWS AppSync, Amazon API Gateway, Step Functions, and Amazon DocumentDB to rapidly build the web application. The following diagram illustrates the real-time inference process and web portal.

The implementation of this solution, along with an AWS CloudFormation template that can launch the architecture in your AWS account, is publicly available through the following GitHub repo.

Data processing

In this section, we briefly describe how to process an example dataset and convert it from raw tables into a graph with relations identified among different columns.

This solution uses the same dataset, the IEEE-CIS fraud dataset, as the previous post Detecting fraud in heterogeneous networks using Amazon SageMaker and Deep Graph Library. Therefore, the basic principle of the data process is the same. In brief, the fraud dataset includes a transactions table and an identities table, having nearly 500,000 anonymized transaction records along with contextual information (for example, devices used in transactions). Some transactions have a binary label, indicating whether a transaction is fraudulent. Our task is to predict which unlabeled transactions are fraudulent and which are legitimate.

The following figure illustrates the general process of how to convert the IEEE tables into a heterogeneous graph. We first extract two columns from each table. One column is always the transaction ID column, where we set each unique TransactionID as one node. Another column is picked from the categorical columns, such as the ProductCD and id_03 columns, where each unique category was set as a node. If a TransactionID and a unique category appear in the same row, we connect them with one edge. This way, we convert two columns in a table into one bipartite. Then we combine those bipartites along with the TransactionID nodes, where the same TransactionID nodes are merged into one unique node. After this step, we have a heterogeneous graph built from bipartites.

For the rest of the columns that aren’t used to build the graph, we join them together as the feature of the TransactionID nodes. TransactionID values that have the isFraud values are used as the label for model training. Based on this heterogeneous graph, our task becomes a node classification task of the TransactionID nodes. For more details on preparing the graph data for training GNNs, refer to the Feature extraction and Constructing the graph sections of the previous blog post.

The code used in this solution is available in src/scripts/glue-etl.py. You can also experiment with data processing through the Jupyter notebook src/sagemaker/01.FD_SL_Process_IEEE-CIS_Dataset.ipynb.

Instead of manually processing the data, as done in the previous post, this solution uses a fully automatic pipeline orchestrated by Step Functions and AWS Glue that supports processing huge datasets in parallel via Apache Spark. The Step Functions workflow is written in AWS Cloud Development Kit (AWS CDK). The following is a code snippet to create this workflow:

import { LambdaInvoke, GlueStartJobRun } from 'aws-cdk-lib/aws-stepfunctions-tasks';
    
    const parametersNormalizeTask = new LambdaInvoke(this, 'Parameters normalize', {
      lambdaFunction: parametersNormalizeFn,
      integrationPattern: IntegrationPattern.REQUEST_RESPONSE,
    });
    
    ...
    
    const dataProcessTask = new GlueStartJobRun(this, 'Data Process', {
      integrationPattern: IntegrationPattern.RUN_JOB,
      glueJobName: etlConstruct.jobName,
      timeout: Duration.hours(5),
      resultPath: '$.dataProcessOutput',
    });
    
    ...    
    
    const definition = parametersNormalizeTask
      .next(dataIngestTask)
      .next(dataCatalogCrawlerTask)
      .next(dataProcessTask)
      .next(hyperParaTask)
      .next(trainingJobTask)
      .next(runLoadGraphDataTask)
      .next(modelRepackagingTask)
      .next(createModelTask)
      .next(createEndpointConfigTask)
      .next(checkEndpointTask)
      .next(endpointChoice);

Besides constructing the graph data for GNN model training, this workflow also batch loads the graph data into Neptune to conduct real-time inference later on. This batch data loading process is demonstrated in the following code snippet:

from neptune_python_utils.endpoints import Endpoints
from neptune_python_utils.bulkload import BulkLoad

...

bulkload = BulkLoad(
        source=targetDataPath,
        endpoints=endpoints,
        role=args.neptune_iam_role_arn,
        region=args.region,
        update_single_cardinality_properties=True,
        fail_on_error=True)
        
load_status = bulkload.load_async()
status, json = load_status.status(details=True, errors=True)
load_status.wait()

GNN model training

After the graph data for model training is saved in Amazon S3, a SageMaker training job, which is only charged when the training job is running, is triggered to start the GNN model training process in the Bring Your Own Container (BYOC) mode. It allows you to pack your model training scripts and dependencies in a Docker image, which it uses to create SageMaker training instances. The BYOC method could save significant effort in setting up the training environment. In src/sagemaker/02.FD_SL_Build_Training_Container_Test_Local.ipynb, you can find details of the GNN model training.

Docker image

The first part of the Jupyter notebook file is the training Docker image generation (see the following code snippet):

*!* aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 763104351884.dkr.ecr.us-east-1.amazonaws.com
image_name *=* 'fraud-detection-with-gnn-on-dgl/training'
*!* docker build -t $image_name ./FD_SL_DGL/gnn_fraud_detection_dgl

We used a PyTorch-based image for the model training. The Deep Graph Library (DGL) and other dependencies are installed when building the Docker image. The GNN model code in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder is copied to the image as well.

Because we process the transaction data into a heterogeneous graph, in this solution we choose the Relational Graph Convolutional Network (RGCN) model, which is specifically designed for heterogeneous graphs. Our RGCN model can train learnable embeddings for the nodes in heterogeneous graphs. Then, the learned embeddings are used as inputs of a fully connected layer for predicting the node labels.

Hyperparameters

To train the GNN, we need to define a few hyperparameters before the training process, such as the file names of the graph constructed, the number of layers of GNN models, the training epochs, the optimizer, the optimization parameters, and more. See the following code for a subset of the configurations:

edges *=* ","*.*join(map(*lambda* x: x*.*split("/")[*-*1], [file *for* file *in* processed_files *if* "relation" *in* file]))

params *=* {'nodes' : 'features.csv',
          'edges': edges,
          'labels': 'tags.csv',
          'embedding-size': 64,
          'n-layers': 2,
          'n-epochs': 10,
          'optimizer': 'adam',
          'lr': 1e-2}

For more information about all the hyperparameters and their default values, see estimator_fns.py in the src/sagemaker/FD_SL_DGL/gnn_fraud_detection_dgl folder.

Model training with SageMaker

After the customized container Docker image is built, we use the preprocessed data to train our GNN model with the hyperparameters we defined. The training job uses the DGL, with PyTorch as the backend deep learning framework, to construct and train the GNN. SageMaker makes it easy to train GNN models with the customized Docker image, which is an input argument of the SageMaker estimator. For more information about training GNNs with the DGL on SageMaker, see Train a Deep Graph Network.

The SageMaker Python SDK uses Estimator to encapsulate training on SageMaker, which runs SageMaker-compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK. The following code snippet demonstrates training the model with SageMaker (either in a local environment or cloud instances):

from sagemaker.estimator import Estimator
from time import strftime, gmtime
from sagemaker.local import LocalSession

localSageMakerSession = LocalSession(boto_session=boto3.session.Session(region_name=current_region))
estimator = Estimator(image_uri=image_name,
                      role=sagemaker_exec_role,
                      instance_count=1,
                      instance_type='local',
                      hyperparameters=params,
                      output_path=output_path,
                      sagemaker_session=localSageMakerSession)

training_job_name = "{}-{}".format('GNN-FD-SL-DGL-Train', strftime("%Y-%m-%d-%H-%M-%S", gmtime()))
print(training_job_name)

estimator.fit({'train': processed_data}, job_name=training_job_name)

After training, the GNN model’s performance on the test set is displayed like the following outputs. The RGCN model normally can achieve around 0.87 AUC and more than 95% accuracy. For a comparison of the RGCN model with other ML models, refer to the Results section of the previous blog post for more details.

Epoch 00099 | Time(s) 7.9413 | Loss 0.1023 | f1 0.3745
Metrics
Confusion Matrix:
                        labels positive labels negative
    predicted positive  4343            576
    predicted negative  13494           454019

    f1: 0.3817, precision: 0.8829, recall: 0.2435, acc: 0.9702, roc: 0.8704, pr: 0.4782, ap: 0.4782

Finished Model training

Upon the completion of model training, SageMaker packs the trained model along with other assets, including the trained node embeddings, into a ZIP file and then uploads it to a specified S3 location. Next, we discuss the deployment of the trained model for real-time fraudulent detection.

GNN model deployment

SageMaker makes the deployment of trained ML models simple. In this stage, we use the SageMaker PyTorchModel class to deploy the trained model, because our DGL model depends on PyTorch as the backend framework. You can find the deployment code in the src/sagemaker/03.FD_SL_Endpoint_Deployment.ipynb file.

Besides the trained model file and assets, SageMaker requires an entry point file for the deployment of a customized model. The entry point file is run and stored in the memory of an inference endpoint instance to respond to the inference request. In our case, the entry point file is the fd_sl_deployment_entry_point.py file in the src/sagemaker/FD_SL_DGL/code folder, which performs four major functions:

  • Receive requests and parse contents of requests to obtain the to-be-predicted nodes and their associated data
  • Convert the data to a DGL heterogeneous graph as input for the RGCN model
  • Perform the real-time inference via the trained RGCN model
  • Return the prediction results to the requester

Following SageMaker conventions, the first two functions are implemented in the input_fn method. See the following code (for simplicity, we delete some commentary code):

def input_fn(request_body, request_content_type='application/json'):

    # --------------------- receive request ------------------------------------------------ #
    input_data = json.loads(request_body)

    subgraph_dict = input_data['graph']
    n_feats = input_data['n_feats']
    target_id = input_data['target_id']

    graph, new_n_feats, new_pred_target_id = recreate_graph_data(subgraph_dict, n_feats, target_id)

    return (graph, new_n_feats, new_pred_target_id)

The constructed DGL graph and features are then passed to the predict_fn method to fulfill the third function. predict_fn takes two input arguments: the outputs of input_fn and the trained model. See the following code:

def predict_fn(input_data, model):

    # ---------------------  Inference ------------------------------------------------ #
    graph, new_n_feats, new_pred_target_id = input_data

    with th.no_grad():
        logits = model(graph, new_n_feats)
        res = logits[new_pred_target_id].cpu().detach().numpy()

    return res[1]

The model used in perdict_fn is created by the model_fn method when the endpoint is called the first time. The function model_fn loads the saved model file and associated assets from the model_dir argument and the SageMaker model folder. See the following code:

def model_fn(model_dir):

    # ------------------ Loading model -------------------
    ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size = 
    initialize_arguments(os.path.join(BASE_PATH, 'metadata.pkl'))

    rgcn_model = HeteroRGCN(ntype_dict, etypes, in_size, hidden_size, out_size, n_layers, embedding_size)

    stat_dict = th.load('model.pth')

    rgcn_model.load_state_dict(stat_dict)

    return rgcn_model

The output of the predict_fn method is a list of two numbers, indicating the logits for class 0 and class 1, where 0 means legitimate and 1 means fraudulent. SageMaker takes this list and passes it to an inner method called output_fn to complete the final function.

To deploy our GNN model, we first wrap the GNN model into a SageMaker PyTorchModel class with the entry point file and other parameters (the path of the saved ZIP file, the PyTorch framework version, the Python version, and so on). Then we call its deploy method with instance settings. See the following code:

env = {
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'
}

print(f'Use model {repackged_model_path}')

sagemakerSession = sm.session.Session(boto3.session.Session(region_name=current_region))
fd_sl_model = PyTorchModel(model_data=repackged_model_path, 
                           role=sagemaker_exec_role,
                           entry_point='./FD_SL_DGL/code/fd_sl_deployment_entry_point.py',
                           framework_version='1.6.0',
                           py_version='py3',
                           predictor_cls=JSONPredictor,
                           env=env,
                           sagemaker_session=sagemakerSession)
                           
fd_sl_predictor *=* fd_sl_model*.*deploy(instance_type*=*'ml.c5.4xlarge',
                                     initial_instance_count*=*1,)

The preceding procedures and code snippets demonstrate how to deploy your GNN model as an online inference endpoint from a Jupyter notebook. However, for production, we recommend using the previously mentioned MLOps pipeline orchestrated by Step Functions for the entire workflow, including processing data, training the model, and deploying an inference endpoint. The entire pipeline is implemented by an AWS CDK application, which can be easily replicated in different Regions and accounts.

Real-time inference

When a new transaction arrives, to perform real-time prediction, we need to complete four steps:

  1. Node and edge insertion – Extract the transaction’s information such as the TransactionID and ProductCD as nodes and edges, and insert the new nodes into the existing graph data stored at the Neptune database.
  2. Subgraph extraction – Set the to-be-predicted transaction node as the center node, and extract a n-hop subgraph according to the GNN model’s input requirements.
  3. Feature extraction – For the nodes and edges in the subgraph, extract their associated features.
  4. Call the inference endpoint – Pack the subgraph and features into the contents of a request, then send the request to the inference endpoint.

In this solution, we implement a RESTful API to achieve real-time fraudulent predication described in the preceding steps. See the following pseudo-code for real-time predictions. The full implementation is in the complete source code file.

For prediction in real time, the first three steps require lower latency. Therefore, a graph database is an optimal choice for these tasks, particularly for the subgraph extraction, which could be achieved efficiently with graph database queries. The underline functions that support the pseudo-code are based on Neptune’s gremlin queries.

def handler(event, context):
    
    graph_input = GraphModelClient(endpoints)
    
    # Step 1: node and edge insertion
    trans_dict, identity_dict, target_id, transaction_value_cols, union_li_cols = 
        load_data_from_event(event, transactions_id_cols, transactions_cat_cols, dummied_col)
    graph_input.insert_new_transaction_vertex_and_edge(trans_dict, identity_dict , target_id, vertex_type = 'Transaction')
    
    
    # Setp 2: subgraph extraction
    subgraph_dict, transaction_embed_value_dict = 
        graph_input.query_target_subgraph(target_id, trans_dict, transaction_value_cols, union_li_cols, dummied_col)
    

    # Step 3 & 4: feature extraction & call the inference endpoint
    transaction_id = int(target_id[(target_id.find('-')+1):])
    pred_prob = invoke_endpoint_with_idx(endpointname = ENDPOINT_NAME, target_id = transaction_id, subgraph_dict = subgraph_dict, n_feats = transaction_embed_value_dict)
       
    function_res = {
                    'id': event['transaction_data'][0]['TransactionID'],
                    'flag': pred_prob > MODEL_BTW,
                    'pred_prob': pred_prob
                    }
       
    return function_res

One caveat about real-time fraud detection using GNNs is the GNN inference mode. To fulfill real-time inference, we need to convert the GNN model inference from transductive mode to inductive mode. GNN models in transductive inference mode can’t make predictions for newly appeared nodes and edges, whereas in inductive mode, GNN models can handle new nodes and edges. A demonstration of the difference between transductive and inductive mode is shown in the following figure.

In transductive mode, predicted nodes and edges coexist with labeled nodes and edges during training. Models identify them before inference, and they could be inferred in training. Models in inductive mode are trained on the training graph but need to predict unseen nodes (those in red dotted circles on the right) with their associated neighbors, which might be new nodes, like the gray triangle node on the right.

Our RGCN model is trained and tested in transductive mode. It has access to all nodes in training, and also trained an embedding for each featureless node, such as IP address and card types. In the testing stage, the RGCN model uses these embeddings as node features to predict nodes in the test set. When we do real-time inference, however, some of the newly added featureless nodes have no such embeddings because they’re not in the training graph. One way to tackle this issue is to assign the mean of all embeddings in the same node type to the new nodes. In this solution, we adopt this method.

In addition, this solution provides a web portal (as seen in the following screenshot) to demonstrate real-time fraudulent predictions from business operators’ perspectives. It can generate the simulated online transactions, and provide a live visualization of detected fraudulent transaction information.

Clean up

When you’re finished exploring the solution, you can clean the resources to avoid incurring charges.

Conclusion

In this post, we showed how to build a GNN-based real-time fraud detection solution using SageMaker, Neptune, and the DGL. This solution has three major advantages:

  • It has good performance in terms of prediction accuracy and AUC metrics
  • It can perform real-time inference via a streaming MLOps pipeline and SageMaker endpoints
  • It automates the total deployment process with the provided CloudFormation template so that interested developers can easily test this solution with custom data in their account

For more details about the solution, see the GitHub repo.

After you deploy this solution, we recommend customizing the data processing code to fit your own data format and modify the real-time inference mechanism while keeping the GNN model unchanged. Note that we split the real-time inference into four steps without further optimization of the latency. These four steps take a few seconds to get a prediction on the demo dataset. We believe that optimizing the Neptune graph data schema design and queries for subgraph and feature extraction can significantly reduce the inference latency.


About the authors

Jian Zhang is an applied scientist who has been using machine learning techniques to help customers solve various problems, such as fraud detection, decoration image generation, and more. He has successfully developed graph-based machine learning, particularly graph neural network, solutions for customers in China, USA, and Singapore. As an enlightener of AWS’s graph capabilities, Zhang has given many public presentations about the GNN, the Deep Graph Library (DGL), Amazon Neptune, and other AWS services.

Mengxin Zhu is a manager of Solutions Architects at AWS, with a focus on designing and developing reusable AWS solutions. He has been engaged in software development for many years and has been responsible for several startup teams of various sizes. He also is an advocate of open-source software and was an Eclipse Committer.

Haozhu Wang is a research scientist at Amazon ML Solutions Lab, where he co-leads the Reinforcement Learning Vertical. He helps customers build advanced machine learning solutions with the latest research on graph learning, natural language processing, reinforcement learning, and AutoML. Haozhu received his PhD in Electrical and Computer Engineering from the University of Michigan.

Read More