Improve the streaming transcription experience with Amazon Transcribe partial results stabilization

Whether you’re watching a live broadcast of your favorite soccer team, having a video chat with a vendor, or calling your bank about a loan payment, streaming speech content is everywhere. You can apply a streaming transcription service to generate subtitles for content understanding and accessibility, to create metadata to enable search, or to extract insights for call analytics. These transcription services process streaming audio content and generate partial transcription results until it provides a final transcription for a segment of continuous speech. However, some words or phrases in these partial results might change, as the service further understands the context of the audio.

We’re happy to announce that Amazon Transcribe now allows you to enable and configure partial results stabilization for streaming audio transcriptions. Amazon Transcribe is an automatic speech recognition (ASR) service that enables developers to add real-time speech-to-text capabilities into their applications for on-demand and streaming content. Instead of waiting for an entire sentence to be transcribed, you can now control the stabilization level of partial results. Transcribe offers 3 settings: High, Medium and Low. Setting the stabilization “High” allows a greater portion of the partial results to be fixed with only the last few words changing during the transcription process. This feature helps you have more flexibility in your streaming transcription workflows based on the user experience you want to create.

In this post, we walk through the benefits of this feature and how to enable it via the Amazon Transcribe console or the API.

How partial results stabilization works

Let’s dive deeper into this with an example.

During your daily conversations, you may think you hear a certain word or phrase, but later realize that it was incorrect based on additional context. Let’s say you were talking to someone about food, and you heard them say “Tonight, I will eat a pear…” However, when the speaker finishes, you realize they actually said “Tonight I will eat a pair of pancakes.” Just as humans may change our understanding based on the information at hand, Amazon Transcribe uses machine learning (ML) to self-correct the transcription of streaming audio based on the context it receives. To enable this, Amazon Transcribe uses partial results.

During the streaming transcription process, Amazon Transcribe outputs chunks of the results with an isPartial flag. Results with this flag marked as true are the ones that Amazon Transcribe may change in the future depending on the additional context received. After Amazon Transcribe classifies that it has sufficient context to be over a certain confidence threshold, the results are stabilized and the isPartial flag for that specific partial result is marked false. The window size of these partial results could range from a few words to multiple sentences depending on the stream context.

The following image displays how the partial results are generated (and edited) in Amazon Transcribe for streaming transcription.

Results stabilization enables more control over the latency and accuracy of transcription results. Depending on the use case, you may prioritize one over the other. For example, when providing live subtitles, high stabilization of results may be preferred because speed is more important than accuracy. On the other hand for use cases like content moderation, lower stabilization is preferred because accuracy may be more important than latency.

A high stability level enables quicker stabilization of transcription results by limiting the window of context for stabilizing results, but can lead to lower overall accuracy. On the other hand, a low stability level leads to more accurate transcription results, but the partial transcription results are more likely to change.

With the streaming transcription API, you can now control the stability of the partial results in your transcription stream.

Now let’s look at how to use the feature.

Access partial results stabilization via the Amazon Transcribe console

To start using partial results stabilization on the Amazon Transcribe console, complete the following steps:

  1. On the Amazon Transcribe console, make sure you’re in a Region that supports Amazon Transcribe Streaming.

For this post, we use us-east-1.

  1. In the navigation pane, choose Real-time transcription.
  2. Under Additional settings, enable Partial results stabilization.

  1. Select your stability level.

You can choose between three levels:

  • High – Provides the most stable partial transcription results with lower accuracy compared to Medium and Low settings. Results are less likely to change as additional context is gathered.
  • Medium – Provides partial transcription results that have a balance between stability and accuracy
  • Low – Provides relatively less stable partial transcription results with higher accuracy compared to High and Medium settings. Results get updated as additional context is gathered and utilized.

  1. Choose Start streaming to play a stream and check the results.

Access partial results stabilization via the API

In this section, we demonstrate streaming with HTTP/2. You can enable your preferred level of partial results stabilization in an API request.

You enable this feature via the enable-partial-results-stabilization flag and the partial-results-stability level input parameters:

POST /stream-transcription HTTP/2 
x-amzn-transcribe-language-code: LanguageCode 
x-amzn-transcribe-sample-rate: MediaSampleRateHertz 
x-amzn-transcribe-media-encoding: MediaEncoding 
x-amzn-transcribe-session-id: SessionId 
x-amzn-transcribe-enable-partial-results-stabilization= true
x-amzn-transcribe-partial-results-stability = low | medium | high

Enabling partial results stabilization introduces the additional parameter flag Stable in the API response at the item level in the transcription results. If a partial results item in the streaming transcription result has the Stable flag marked as true, the corresponding item transcription in the partial results doesn’t change irrespective of any subsequent context identified by Amazon Transcribe. If the Stable flag is marked as false, there is still a chance that the corresponding item may change in the future, until the IsPartial flag is marked as false.

The following code shows our API response:

{
    "Alternatives": [
        {
            "Items": [
                {
                    "Confidence": 0,
                    "Content": "Amazon",
                    "EndTime": 1.22,
                    "Stable": true,
                    "StartTime": 0.78,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 0,
                    "Content": "is",
                    "EndTime": 1.63,
                    "Stable": true,
                    "StartTime": 1.46,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 0,
                    "Content": "the",
                    "EndTime": 1.76,
                    "Stable": true,
                    "StartTime": 1.64,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 0,
                    "Content": "largest",
                    "EndTime": 2.31,
                    "Stable": true,
                    "StartTime": 1.77,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },
                {
                    "Confidence": 1,
                    "Content": "rainforest",
                    "EndTime": 3.34,
                    "Stable": true,
                    "StartTime": 2.4,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                },      
            ],
            "Transcript": "Amazon is the largest rainforest "
        }
    ],
    "EndTime": 4.33,
    "IsPartial": false,
    "ResultId": "f4b5d4dd-b685-4736-b883-795dc3f7f636",
    "StartTime": 0.78
}

Conclusion

This post introduces the recently launched partial results stabilization feature in Amazon Transcribe. For more information, see the Amazon Transcribe Partial results stabilization documentation.

To learn more about the Amazon Transcribe Streaming Transcription API, check out Using Amazon Transcribe streaming With HTTP/2 and Using Amazon Transcribe streaming with WebSockets.


About the Author

Alex Chirayath is an SDE in the Amazon Machine Learning Solutions Lab. He helps customers adopt AWS AI services by building solutions to address common business problems.

Read More

The Washington Post Launches Audio Articles Voiced by Amazon Polly 

AWS is excited to announce that The Washington Post is integrating Amazon Polly to provide their readers with audio access to stories across The Post’s entire spectrum of web and mobile platforms, starting with technology stories. Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Post subscribers live busy lives with limited time to read the news. The goal is to unlock the Post’s world-class written journalism in audio form and give readers a convenient way to stay up to date on the news, like listening while doing other things.

In The Post’s announcement, Kat Down Mulder, managing editor says, “Whether you’re listening to a story while multitasking or absorbing a compelling narrative while on a walk, audio unlocks new opportunities to engage with our journalism in more convenient ways. We saw that trend throughout last year as readers who listened to audio articles on our apps engaged more than three times longer with our content. We’re doubling-down on our commitment to audio and will be experimenting rapidly and boldly in this space. The full integration of Amazon Polly within our publishing ecosystem is a big step that offers readers this powerful convenience feature at scale, while ensuring a high-quality and consistent audio experience across all our platforms for our subscribers and readers.”

Integrating Amazon Polly into The Post’s publishing workflow has been easy and straightforward. When an article is ready for publication, the written content management system (CMS) publishes the text article and simultaneously sends the text to the audio CMS, where the article text is processed by Amazon Polly to produce an audio recording of the article. The audio is delivered as an mp3 and published in conjunction with the written portion of the article.

Figure 1 High-level architecture Washington Post article creation

Last year, The Post began testing article narration using the text-to-speech, accessibility capabilities in iOS and Android operating systems. While there were promising signs around engagement, some noted that the voices sounded robotic. The Post started testing other options and ended up choosing Amazon Polly because of its high-quality automated voices. “We’ve tested users’ perceptions to both human and automated voices and found high levels of satisfaction with Amazon Polly’s offering. Integrating Amazon Polly into our publishing workflow also gives us the ability to offer a consistent listening experience across platforms and experiment with new functions that we believe our subscribers will enjoy.” says Ryan Luu, senior product manager at The Post.

Over the coming months, The Post will be adding voice support for new sections, new languages and better usability. “We plan to introduce new features like more playback controls, text highlighting as you listen, and audio versions of Spanish articles,” said Luu. “We also hope to give readers the ability to create audio playlists to make it easy for subscribers to queue up stories they’re interested in and enjoy that content on the go.”

Amazon Polly is a text-to-speech service that powers audio access to news articles for media publishers like Gannett (the publisher of USA Today), The Globe and Mail (the biggest newspaper in Canada), and leading publishing companies such as BlueToad and Trinity Audio. In addition, Amazon Polly provides natural sounding voices in a variety of languages and personas to give content a voice in other sectors such as education, healthcare, and gaming.

For more information, see What Is Amazon Polly? and log in to the Amazon Polly console to try it out for free. To experience The Post’s new audio articles, listen to the story “Did you get enough steps in today? Maybe one day you’ll ask your ‘smart’ shirt.”


About the Author

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

Read More

Build an anomaly detection model from scratch with Amazon Lookout for Vision

A common problem in manufacturing is verifying that products meet quality standards. You can use manual inspection on a subset of the products, but it’s usually not scalable enough to meet demand as production grows. In this post, I go through the steps of creating an end-to-end machine vision solution that identifies visual anomalies in products using Amazon Lookout for Vision. I’ll show you how to train a model that performs anomaly detection, use the model in real-time, update the model when new data is available, and how to monitor the model.

Solution overview

Imagine a factory producing Lego bricks. The bricks are transported on a conveyor belt in front of a camera that determines if they meet the factory’s quality standards. When a brick on the belt breaks a light beam, the device takes a photo and sends it to Amazon Lookout for Vision for anomaly detection. If a defective brick is identified, it’s pushed off the belt by a pusher.

The following diagram illustrates the architecture of our anomaly detection solution, which uses Amazon Lookout for Vision, Amazon Simple Storage Service (Amazon S3), and a Raspberry Pi.

Amazon Lookout for Vision is a machine learning (ML) service that uses machine vision to help you identify visual defects in products without needing any ML experience. It uses deep learning to remove the need for carefully calibrated environments in terms of lighting and camera angle, which many existing machine vision techniques require.

To get started with Amazon Lookout for Vision, you need to provide data for the service to use when training the underlying deep learning models. The dataset used in this post consists of 289 normal and 116 anomalous images of a Lego brick, which are hosted in an S3 bucket that I have made public so you can download the dataset.

To make the scenario more realistic, I’ve varied the lighting and camera position between images. Additionally, I use 20 test images and 9 new images to update the model later on with both normal and anomalous images. The anomalous images were created by drawing on and scratching the brick, changing the brick color, adding other bricks, and breaking off small pieces to simulate production defects. The following image shows the physical setup used when collecting training images.

Pre-requisites

To follow along with this post, you’ll need the following:

  • An AWS account to train and use Amazon Lookout for Vision
  • A camera (for this post, I use a Pi camera)
  • A device that can run code (I use a Raspberry Pi 4)

Train the model

To use the dataset when training a model, you first upload the training data to Amazon S3 and create an Amazon Lookout for Vision project. A project is an abstraction around the training dataset and multiple model versions. You can think of a project as a collection of the resources that relate to a specific machine vision use case. For instance, in this post, I use one dataset but create multiple model versions as I gradually optimize the model for the use case with new data, all within the boundaries of one project.

You can use the SDK, AWS Command Line Interface (AWS CLI), and AWS Management Console to perform all the steps required to create and train a model. For this post, I use a combination of the AWS CLI and the console to train and start the model, and use the SDK to send images for anomaly detection from the Raspberry Pi.

To train the model, we complete the following high-level steps:

  1. Upload the training data to Amazon S3.
  2. Create an Amazon Lookout for Vision project.
  3. Create an Amazon Lookout for Vision dataset.
  4. Train the model.

Upload the training data to Amazon S3

To get started, complete the following steps:

  1. Download the dataset to your computer.
  2. Create an S3 bucket and upload the training data.

I named my bucket l4vdemo, but bucket names need to be globally unique, so make sure to change it if you copy the following code. Make sure to keep the folder structure in the dataset, because Amazon Lookout for Vision uses it to label normal and anomalous images automatically based on folder name. You could use the integrated labeling tool on the Amazon Lookout for Vision console or Amazon SageMaker Ground Truth to label the data, but the automatic labeler allows you to keep the folder structure and save some time.

aws s3 sync s3://aws-ml-blog/artifacts/Build-an-anomaly-detection-model-from-scratch-with-L4V/ data/

aws s3 mb s3://l4vdemo

aws s3 sync data s3://l4vdemo

Create an Amazon Lookout for Vision project

You’re now ready to create your project.

  1. On the Amazon Lookout for Vision console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter a name.
  4. Choose Create project.

Create the dataset

For this post, I create a single dataset and import the training data from the S3 bucket I uploaded the data to in Step 1.

  1. Choose Create dataset.
  2. Select import images from S3 bucket.
  3. For S3 URI, enter the URI for your bucket (for this post, s3://l4vdemo/, but make sure to use the unique bucket name you created).

  1. For Automatic labeling, select Automatically attach labels to images based on the folder name.

This allows you to use the existing folder structure to infer whether your images are normal or anomalous.

  1. Choose Create dataset.

Train the model

After we create the dataset, the number of labeled and unlabeled images should be visible in the Filters pane, as well as the number of normal and anomalous images.

  1. To start training a deep learning model, choose Train model.

 

Model training can take a few hours depending on the number of images in the training dataset.

  1. When training is complete, in the navigation pane, chose Models under your project.

You should see the newly created model listed with a status of Training complete.

  1. Choose the model to see performance metrics like precision, recall and F1 score, training duration, and more model metadata.

Use the model

Now that a model is trained, let’s test it on data it hasn’t seen before. To use the model, you must first start hosting it to provision all backend resources required to perform real-time inference.

aws lookoutvision start-model 
--project-name lego-demo 
--model-version 1 
--min-inference-units 1

When starting the model hosting, you pass both project name and model version as arguments to identify the model. You also need to specify the number of inference units to use; each unit enables approximately five requests per second.

To use the hosted model, use the detect-anomalies command and pass in the project and model version along with the image to perform inference on:

aws lookoutvision detect-anomalies 
--project-name lego-demo 
--model-version 1 
--content-type image/jpeg 
--body test/test-1611853160.2488434.jpg

The dataset we use in this post contains 20 images, and I encourage you to test the model with different images.

When performing inference on an anomalous brick, the response could look like the following:

{
    "DetectAnomalyResult": {
        "Source": {
            "Type": "direct"
        },
        "IsAnomalous": true,
        "Confidence": 0.9754859209060669
    }
}

The flag IsAnomalous is true and Amazon Lookout for Vision also provides a confidence score that tells you how sure the model is of its classification. The service always provides a binary classification, but you can use the confidence score to make more well-informed decisions, such as whether to scrap the brick directly or send it for manual inspection. You could also persist images with lower confidence scores and use them to update the model, which I show you how to do in the next section.

Keep in mind that you’re charged for the model as long as it’s running, so stop it when you no longer need it:

aws lookoutvision stop-model 
--project-name lego-demo 
--model-version 1

Update the model

As new data becomes available, you may want to maintain or update the model to accommodate for new types of defects and increase the model’s overall performance. The dataset contains nine images in the new-data folder, which I use to update the model. To update an Amazon Lookout for Vision model, you run a trial detection and verify the machine predictions to correct the model predictions, and add the verified images to your training dataset.

Run a trial detection

To run a trial detection, complete the following steps:

  1. On the Amazon Lookout for Vision console, under your model in the navigation pane, choose Trial detections.
  2. Choose Run trial detection.
  3. For Trial name, enter a name.
  4. For Import images, select Import images from S3 bucket.
  5. For S3 URI, enter the URI of the new-data folder that you uploaded in Step 1 of training the model

  1. Choose Run trial

Verify machine predictions

When the trial detection is complete, you can verify the machine predictions.

  1. Choose Verify machine predictions.
  2. Select either Correct or Incorrect to label the images

 

  1. When all the images have been labeled, choose Add verified images to dataset.

This updates your training dataset with the new data.

Retrain the model

After you update your training dataset with the new data, you can see that the number of labeled images in your dataset has increased, along with the number of verified images.

  1. Choose Train model to train a new version of the model.

 

  1. When the new model is training, on the Models page, you can verify that a new model version is being trained. When the training is complete, you can view model performance metrics on the Models page and start using the new version.

 

Anomaly detection application

Now that I’ve trained my model, let’s use it with the Raspberry Pi to sort Lego bricks. In this use case, I’ve set up a Raspberry Pi with a camera that gets triggered whenever a break beam sensor senses a Lego brick. We use the following code:

import boto3
from picamera import PiCamera
import my_break_bream_sensor
import my_pusher

l4v_client = boto3.client('lookoutvision')
image_path = '/home/pi/Desktop/my_image.jpg'

with PiCamera() as camera:
    while(True):
        if my_break_bream_sensor.isBroken():  # Replace with your own sensor.
            camera.capture(image_path)
            with open(image_path, 'rb') as image:
                response = l4v_client.detect_anomalies(ProjectName='lego-demo',
                                                       ContentType='image/jpeg',
                                                       Body=image.read(),
                                                       ModelVersion='2')

            is_anomalous = response['DetectAnomalyResult']['IsAnomalous']
            if (is_anomalous):
                my_pusher.push()  # Replace with your own pusher.

Monitoring the model

When the system is up and running, you can use the Amazon Lookout for Vision dashboard to visualize metadata from the projects you have running, such as the number of detected anomalies during a specific time period. The dashboard provides an overview of all current projects, as well as aggregated information like total anomaly ratio.

Pricing

The cost of the solution is based on the time to train the model and the time the model is running. You can divide the cost across all analyzed products to get a per-product cost. Assuming one brick is analyzed per second nonstop for a month, the cost of the solution, excluding hardware and training, is around $0.001 per brick, assuming we’re using 1 inference unit. However, if you increase production speed and analyze five bricks per second, the cost is around $0.0002 per brick.

Conclusion

Now you know how to use Amazon Lookout for Vision to train, run, update, and monitor an anomaly detection application. The use case in this post is of course simplified; you will have other requirements specific to your needs. Many factors affect the total end-to-end latency when performing inference on an image. The Amazon Lookout for Vision model runs in the cloud, which means that you need to evaluate and test network availability and bandwidth to ensure that the requirements can be met. To avoid creating bottlenecks, you can use a circuit breaker in your application to manage timeouts and prevent congestion in case of network issues.

Now that you know how to train, test and use and update an ML model for anomaly detection, try it out with your own data! To get further details about Amazon Lookout for Vision, please visit the webpage!


About the Authors

Niklas Palm is a Solutions Architect at AWS in Stockholm, Sweden, where he helps customers across the Nordics succeed in the cloud. He’s particularly passionate about serverless technologies along with IoT and machine learning. Outside of work, Niklas is an avid cross-country skier and snowboarder as well as a master egg boiler.

Read More

Build an intelligent search solution with automated content enrichment

Unstructured data belonging to the enterprise continues to grow, making it a challenge for customers and employees to get the information they need. Amazon Kendra is a highly accurate intelligent search service powered by machine learning (ML). It helps you easily find the content you’re looking for, even when it’s scattered across multiple locations and content repositories.

Amazon Kendra leverages deep learning and reading comprehension to deliver precise answers. It offers natural language search for a user experience that’s like interacting with a human expert. When documents don’t have a clear answer or if the question is ambiguous, Amazon Kendra returns a list of the most relevant documents for the user to choose from.

To help narrow down a list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities, for an experience similar to the Amazon.com retail site where you’re presented with filtering options on the left side of the webpage. But what if the original documents have no metadata, or users have a preference for how this information is categorized? You can automatically generate metadata using ML in order to enrich the content and make it easier to search and discover.

This post outlines how you can automate and simplify metadata generation using Amazon Comprehend Medical, a natural language processing (NLP) service that uses ML to find insights related to healthcare and life sciences (HCLS) such as medical entities and relationships in unstructured medical text. The metadata generated is then ingested as custom attributes alongside documents into an Amazon Kendra index. For repositories with documents containing generic information or information related to domains other than HCLS, you can use a similar approach with Amazon Comprehend to automate metadata generation.

To demonstrate an intelligent search solution with enriched data, we use Wikipedia pages of the medicines listed in the World Health Organization (WHO) Model List of Essential Medicines. We combine this content with metadata automatically generated using Amazon Comprehend Medical, into a unified Amazon Kendra index to make it searchable. You can visit the search application and try asking it some questions of your own, such as “What is the recommended paracetamol dose for an adult?” The following screenshot shows the results.

Solution overview

We take a two-step approach to custom content enrichment during the content ingestion process:

  1. Identify the metadata for each document using Amazon Comprehend Medical.
  2. Ingest the document along with the metadata in the search solution based on an Amazon Kendra index.

Amazon Comprehend Medical uses NLP to extract medical insights about the content of documents by extracting medical entities such as medication, medical condition, anatomical location, the relationships between entities such as route and medication, and traits such as negation. In this example, for the Wikipedia page of each medicine from the WHO Model List of Essential Medicines, we use the DetectEntitiesV2 operation of Amazon Comprehend Medical to detect the entities in the categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. We use these entities to generate the document metadata.

We prepare the Amazon Kendra index by defining custom attributes of type STRING_LIST corresponding to the entity categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. For each document, the DetectEntitiesV2 operation of Amazon Comprehend Medical returns a categorized list of entities. Each entity from this list with a sufficiently high confidence score (for this use case, greater than 0.97) is added to the custom attribute corresponding to its category. After all the detected entities are processed in this way, the populated attributes are used to generate the metadata JSON file corresponding to that document. Amazon Kendra has an upper limit of 10 strings for an attribute of STRING_LIST type. In this example, we take the top 10 entities with the highest frequency of occurrence in the processed document.

After the metadata JSON files for all the documents are created, they’re copied to the Amazon Simple Storage Service (Amazon S3) bucket configured as a data source to the Amazon Kendra index, and a data source sync is performed to ingest the documents in the index along with the metadata.

Prerequisites

To deploy and work with the solution in this post, make sure you have the following:

Architecture

We use the AWS CloudFormation template medkendratemplate.yaml to deploy an Amazon Kendra index with the custom attributes of type STRING_LIST corresponding to the entity categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION.

The following diagram illustrates our solution architecture.

Based on this architecture, the steps to build and use the solution are as follows:

  1. On CloudShell, a Bash script called getpages.sh downloads Wikipedia pages of the medicines and store them as text files.
  2. A Python script called meds.py, which contains the core logic of the automation of the metadata generation, makes the detect_entities_v2 API call to Amazon Comprehend Medical to detect entities for each of the Wikipedia pages and generate metadata based on the entities returned. The steps used in this script are as follows:
    1. Split the Wikipedia page text into chunks smaller than the maximum text size allowed by the detect_entities_v2 API call.
    2. Make the detect_entities_v2 call.
    3. Filter the entities detected by the detect_entities_v2 call using a threshold confidence score (0.97 for this example).
    4. Keep track of each unique entity corresponding to its category and the frequency of occurrence of that entity.
    5. For each entity category, sort the entities in that category from highest to lowest frequency of occurrence and select the top 10 entities.
    6. Create a metadata object based on the selected entities and output it in JSON format.
  3. We use the AWS CLI to copy the text data and the metadata to the S3 bucket that is configured as a data source to the Amazon Kendra index using the S3 connector.
  4. We perform a data source sync using the Amazon Kendra console to ingest the contents of the documents along with the metadata in the Amazon Kendra index.
  5. Finally, we use the Amazon Kendra search console to make queries to the index.

Create an Amazon S3 bucket to be used as a data source

Create an Amazon S3 bucket that you will use as a data source for the Amazon Kendra index.

Deploy the infrastructure as a CloudFormation stack

To deploy the infrastructure and resources for this solution, complete the following steps:

In a separate browser tab, open the AWS Management Console, and make sure that you’re logged in to your AWS account. Click the following button to launch the CloudFormation stack to deploy the infrastructure.

After that you should see a page similar to the following image:

For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.

Stack creation can take 30–45 minutes to complete. You can monitor the stack creation status on the Stack info tab. You can also look at the different tabs, such as Events, Resources, and Template. While the stack is being created, you can work on getting the data and generating the metadata in the next few steps.

Get the data and generate the metadata

To fetch your data and start generating metadata, complete the following steps:

  1. On the AWS Management Console, click icon shown by a red circle in the following picture to start AWS CloudShell.

  1. Copy the filecode-data.tgz and extract the contents by using the following commands on AWS CloudShell:
aws s3 cp s3://aws-ml-blog/artifacts/build-an-intelligent-search-solution-with-automated-content-enrichment/code-data.tgz .
tar zxvf code-data.tgz

  1. Change the working directory to code-data:
cd code-data

At this point, you can choose to run the end-to-end workflow of getting the data, creating the metadata using Amazon Comprehend Medical (which takes about 35–40 minutes), and then ingesting the data along with the metadata in the Amazon Kendra index, or just complete the last step to ingest the data with the metadata that has been generated using Amazon Comprehend Medical and supplied in the package for convenience.

  1. To use the metadata supplied in the package, enter the following code and then jump to Step 6:
tar zxvf med-data-meta.tgz
  1. Perform this step to get a hands-on experience of building the end-to-end solution. The following command runs a bash script called main.sh, which calls the following scripts:
    1. prereq.sh to install prerequisites and create subdirectories to store data and metadata
    2. getpages.sh to get the Wikipedia pages of medicines in the list
    3. getmetapar.sh to call the meds.py Python script for each document
./main.sh

The Python script meds.py contains the logic to make the get_entities_v2 call to Amazon Comprehend Medical and then process the output to produce the JSON metadata file. It takes about 30–40 minutes for this to complete.

While performing Step 5, if CloudShell times out, security tokens get refreshed, or the script stops before all the data is processed, start the CloudShell session again and run getmetapar.sh, which starts the data processing from the point it was stopped:

./getmetapar.sh
  1. Upload the data and metadata to the S3 bucket being used as the data source for the Amazon Kendra index using the following AWS CLI commands:
aws cp Data/ s3://<REPLACE-WITH-NAME-OF-YOUR-S3-BUCKET>/Data/ —recursive
aws cp Meta/ s3://<REPLACE-WITH-NAME-OF-YOUR-S3-BUCKET>/Meta/ —recursive

Review Amazon Kendra configuration and start the data source sync

Before starting this step, make sure that the CloudFormation stack creation is complete. In the following steps, we start the data source sync to begin crawling and indexing documents.

  1. On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

  1. In the navigation pane, choose Data sources.
  2. On the Settings tab, you can see the data source bucket being configured.
  3. Choose the data source and choose Sync now.

The data source sync can take 10–15 minutes to complete.

Observe Amazon Kendra index facet definition

In the navigation pane, choose Facet definition. The following screenshot shows the entries for ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. These are the categories of the entities detected by Amazon Comprehend Medical. These are defined as custom attributes in the CloudFormation template that we used to create the Amazon Kendra index. The facetable check boxes for PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION aren’t selected, therefore these aren’t shown in the facets of the search user interface.

Query the repository of WHO Model List of Essential Medicines

We’re now ready to make queries to our search solution.

  1. On the Amazon Kendra console, navigate to your index and choose Search console.
  2. In the search field, enter What is the treatment for diabetes?

The following screenshot shows the results.

  1. Choose Filter search results to see the facets.

The headings of MEDICATION, ANATOMY, MEDICAL_CONDITION, and TEST_TREATMENT_PROCEDURE are the categories defined as Amazon Kendra facets, and the list of items underneath them are the entities of these categories as detected by Amazon Comprehend Medical in the documents being searched. PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION are not shown.

  1. Under MEDICAL_CONDITION, select pregnancy to refine the search results.

You can go back to the Facet definition page and make PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION facetable and save the configuration. Go back to the search console, make a new query, and observe the facets again. Experiment with these facets to see what suits your needs best.

Make additional queries and use the facets to refine the search results. You can use the following queries to get started, but you can also experiment with your own:

  • What is a common painkiller?
  • Is parcetamol safe for children?
  • How to manage high blood pressure?
  • When should BCG vaccine be administered?

You can observe how domain-specific facets improve the search experience.

Infrastructure cleanup

To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.

When the stack status shows as Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the Amazon Kendra console that the index is deleted.

You must delete your data source bucket separately because it wasn’t created as part of the CloudFormation stack.

Conclusion

In this post, we demonstrated how to automate the process to enrich the content by generating domain-specific metadata for an Amazon Kendra index using Amazon Comprehend or Amazon Comprehend Medical, thereby improving the user experience for the search solution.

This example used the entities detected by Amazon Comprehend Medical to generate the Amazon Kendra metadata. Depending on the domain of the content repository, you can use a similar approach with the pretrained model or custom trained models of Amazon Comprehend. Try out our solution and let us know what you think! You can further enhance the metadata by using other elements such as protected health information (PHI) for Amazon Comprehend Medical and events, key phrases, personally identifiable information (PII), dominant language, sentiment, and syntax for Amazon Comprehend.


About the Authors

Abhinav JawadekarAbhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.

 

 

 

Udi Hershkovich has been a Principal WW AI/ML Service Specialist at AWS since 2018. Prior to AWS, Udi held multiple leadership positions with AI startups and Enterprise initiatives including co-founder and CEO at LeanFM Technologies, offering ML-powered predictive maintenance in facilities management, CEO of Safaba Translation Solutions, a machine translation startup acquired by Amazon in 2015, and Head of Professional Services for Contact Center Intelligence at Amdocs. Udi holds Law and Business degrees from the Interdisciplinary Center in Herzliya, Israel, and lives in Pittsburgh, Pennsylvania, USA.

Read More

Create a serverless pipeline to translate large documents with Amazon Translate

In our previous post, we described how to translate documents using the real-time translation API from Amazon Translate and AWS Lambda. However, this method may not work for files that are too large. They may take too much time, triggering the 15-minute timeout limit of Lambda functions. One can use batch API, but this is available only in seven AWS Regions (as of this blog’s publication). To enable translation of large files in regions where Batch Translation is not supported, we created the following solution.

In this post, we walk you through performing translation of large documents.

Architecture overview

Compared to the architecture featured in the post Translating documents with Amazon Translate, AWS Lambda, and the new Batch Translate API, our architecture has one key difference: the presence of AWS Step Functions, a serverless function orchestrator that makes it easy to sequence Lambda functions and multiple services into business-critical applications. Step Functions allows us to keep track of running the translation, managing retrials in case of errors or timeouts, and orchestrating event-driven workflows.

The following diagram illustrates our solution architecture.

This event-driven architecture shows the flow of actions when a new document lands in the input Amazon Simple Storage Service (Amazon S3) bucket. This event triggers the first Lambda function, which acts as the starting point of the Step Functions workflow.

The following diagram illustrates the state machine and the flow of actions.

The Process Document Lambda function is triggered when the state machine starts; this function performs all the activities required to translate the documents. It accesses the file from the S3 bucket, downloads it locally in the environment in which the function is run, reads the file contents, extracts short segments from the document that can be passed through the real-time translation API, and uses the API’s output to create the translated document.

Other mechanisms are implemented within the code to avoid failures, such as handling an Amazon Translate throttling error and Lambda function timeout by taking action and storing the progress that was made in a /temp folder 30 seconds before the function times out. These mechanisms are critical for handling large text documents.

When the function has successfully finished processing, it uploads the translated text document in the output S3 bucket inside a folder for the target language code, such as en for English. The Step Functions workflow ends when the Lambda function moves the input file from the /drop folder to the /processed folder within the input S3 bucket.

We now have all the pieces in place to try this in action.

Deploy the solution using AWS CloudFormation

You can deploy this solution in your AWS account by launching the provided AWS CloudFormation stack. The CloudFormation template provisions the necessary resources needed for the solution. The template creates the stack the us-east-1 Region, but you can use the template to create your stack in any Region where Amazon Translate is available. As of this writing, Amazon Translate is available in 16 commercial Regions and AWS GovCloud (US-West). For the latest list of Regions, see the AWS Regional Services List.

To deploy the application, complete the following steps:

  1. Launch the CloudFormation template by choosing Launch Stack:

  1. Choose Next.

Alternatively, on the AWS CloudFormation console, choose Create stack with new resources (standard), choose Amazon S3 URL as the template source, enter https://s3.amazonaws.com/aws-ml-blog/artifacts/create-a-serverless-pipeline-to-translate-large-docs-amazon-translate/translate.yml, and choose Next.

  1. For Stack name, enter a unique stack name for this account; for example, serverless-document-translation.
  2. For InputBucketName, enter a unique name for the S3 bucket the stack creates; for example, serverless-translation-input-bucket.

The documents are uploaded to this bucket before they are translated. Use only lower-case characters and no spaces when you provide the name of the input S3 bucket. This operation creates a new bucket, so don’t use the name of an existing bucket. For more information, see Bucket naming rules.

  1. For OutputBucketName, enter a unique name for your output S3 bucket; for example, serverless-translation-output-bucket.

This bucket stores the documents after they are translated. Follow the same naming rules as your input bucket.

  1. For SourceLanguageCode, enter the language code that your input documents are in; for this post we enter auto to detect the dominant language.
  2. For TargetLanguageCode, enter the language code that you want your translated documents in; for example, en for English.

For more information about supported language codes, see Supported Languages and Language Codes.

  1. Choose Next.

  1. On the Configure stack options page, set any additional parameters for the stack, including tags.
  2. Choose Next.
  3. Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  4. Choose Create stack.

Stack creation takes about a minute to complete.

Translate your documents

You can now upload a text document that you want to translate into the input S3 bucket, under the drop/ folder.

The following screenshot shows our sample document, which contains a sentence in Greek.

This action starts the workflow, and the translated document automatically shows up in the output S3 bucket, in the folder for the target language (for this example, en). The length of time for the file to appear depends on the size of the input document.

Our translated file looks like the following screenshot.

You can also track the state machine’s progress on the Step Functions console, or with the relevant API calls.

Let’s try the solution with a larger file. The test_large.txt file contains content from multiple AWS blog posts and other content written in German (for example, we use all the text from the post AWS DeepLens (Version 2019) kommt nach Deutschland und in weitere Länder).

This file is much bigger than the file in previous test. We upload the file in the drop/ folder of the input bucket.

On the Step Functions console, you can confirm that the pipeline is running by checking the status of the state machine.

On the Graph inspector page, you can get more insights on the status of the state machine at any given point. When you choose a step, the Step output tab shows the completion percentage.

When the state machine is complete, you can retrieve the translated file from the output bucket.

The following screenshot shows that our file is translated in English.

Troubleshooting

If you don’t see the translated document in the output S3 bucket, check Amazon CloudWatch Logs for the corresponding Lambda function and look for potential errors. For cost-optimization, by default, the solution uses 256 MB of memory for the Process Document Lambda function. While processing a large document, if you see Runtime.ExitError for the function in the CloudWatch Logs, increase the function memory.

Other considerations

It’s worth highlighting the power of the automatic language detection feature of Amazon Translate, captured as auto in the SourceLanguageCode field that we specified when deploying the CloudFormation stack. In the previous examples, we submitted a file containing text in Greek and another file in German, and they were both successfully translated into English. With our solution, you don’t have to redeploy the stack (or manually change the source language code in the Lambda function) every time you upload a source file with a different language. Amazon Translate detects the source language and starts the translation process. Post deployment, if you need to change the target language code, you can either deploy a new CloudFormation stack or update the existing stack.

This solution uses the Amazon Translate synchronous real-time API. It handles the maximum document size limit (5,000 bytes) by splitting the document into paragraphs (ending with a newline character). If needed, it further splits each paragraph into sentences (ending with a period). You can modify these delimiters based on your source text. This solution can support a maximum of 5,000 bytes for a single sentence and it only handles UTF-8 formatted text documents with .txt or .text file extensions. You can modify the Python code in the Process Document Lambda function to handle different file formats.

In addition to Amazon S3 costs, the solution incurs usage costs from Amazon Translate, Lambda, and Step Functions. For more information, see Amazon Translate pricing, Amazon S3 pricing, AWS Lambda pricing, and AWS Step Functions pricing.

Conclusion

In this post, we showed the implementation of a serverless pipeline that can translate documents in real time using the real-time translation feature of Amazon Translate and the power of Step Functions as orchestrators of individual Lambda functions. This solution allows for more control and for adding sophisticated functionality to your applications. Come build your advanced document translation pipeline with Amazon Translate!

For more information, see the Amazon Translate Developer Guide and Amazon Translate resources. If you’re new to Amazon Translate, try it out using our Free Tier, which offers 2 million characters per month for free for the first 12 months, starting from your first translation request.


About the Authors

Jay Rao is a Senior Solutions Architect at AWS. He enjoys providing technical guidance to customers and helping them design and implement solutions on AWS.

 

 

 

 Seb Kasprzak is a Solutions Architect at AWS. He spends his days at Amazon helping customers solve their complex business problems through use of Amazon technologies.

 

 

 

Nikiforos Botis is a Solutions Architect at AWS. He enjoys helping his customers succeed in their cloud journey, and is particularly interested in AI/ML technologies.

 

 

 

Bobbie Couhbor is a Senior Solutions Architect for Digital Innovation at AWS, helping customers solve challenging problems with emerging technology, such as machine learning, robotics, and IoT.

Read More

How Genworth built a serverless ML pipeline on AWS using Amazon SageMaker and AWS Glue

This post is co-written with Liam Pearson, a Data Scientist at Genworth Mortgage Insurance Australia Limited.

Genworth Mortgage Insurance Australia Limited is a leading provider of lenders mortgage insurance (LMI) in Australia; their shares are traded on Australian Stock Exchange as ASX: GMA.

Genworth Mortgage Insurance Australia Limited is a lenders mortgage insurer with over 50 years of experience and volumes of data collected, including data on dependencies between mortgage repayment patterns and insurance claims. Genworth wanted to use this historical information to train Predictive Analytics for Loss Mitigation (PALM) machine learning (ML) models. With the ML models, Genworth could analyze recent repayment patterns for each of the insurance policies to prioritize them in descending order of likelihood (chance of a claim) and impact (amount insured). Genworth wanted to run batch inference on ML models in parallel and on schedule while keeping the amount of effort to build and operate the solution to the minimum. Therefore, Genworth and AWS chose Amazon SageMaker batch transform jobs and serverless building blocks to ingest and transform data, perform ML inference, and process and publish the results of the analysis.

Genworth’s Advanced Analytics team engaged in an AWS Data Lab program led by Data Lab engineers and solutions architects. In a pre-lab phase, they created a solution architecture to fit specific requirements Genworth had, especially around security controls, given the nature of the financial services industry. After the architecture was approved and all AWS building blocks identified, training needs were determined. AWS Solutions Architects conducted a series of hands-on workshops to provide the builders at Genworth with the skills required to build the new solution. In a 4-day intensive collaboration, called a build phase, the Genworth Advanced Analytics team used the architecture and learnings to build an ML pipeline that fits their functional requirements. The pipeline is fully automated and is serverless, meaning that there is no maintenance, scaling issues, or downtime. Post-lab activities were focused on productizing the pipeline and adopting it as a blueprint for other ML use cases.

In this post, we (the joint team of Genworth and AWS Architects) explain how we approached the design and implementation of the solution, the best practices we followed, the AWS services we used, and the key components of the solution architecture.

Solution overview

We followed the modern ML pipeline pattern to implement a PALM solution for Genworth. The pattern allows ingestion of data from various sources, followed by transformation, enrichment, and cleaning of the data, then ML prediction steps, finishing up with the results made available for consumption with or without data wrangling of the output.

In short, the solution implemented has three components:

  • Data ingestion and preparation
  • ML batch inference using three custom developed ML models
  • Data post processing and publishing for consumption

The following is the architecture diagram of the implemented solution.

Let’s discuss the three components in more detail.

Component 1: Data ingestion and preparation

Genworth source data is published weekly into a staging table in their Oracle on-premises database. The ML pipeline starts with an AWS Glue job (Step 1, Data Ingestion, in the diagram) connecting to the Oracle database over an AWS Direct Connect connection secured with VPN to ingest raw data and store it in an encrypted Amazon Simple Storage Service (Amazon S3) bucket. Then a Python shell job runs using AWS Glue (Step 2, Data Preparation) to select, clean, and transform the features used later in the ML inference steps. The results are stored in another encrypted S3 bucket used for curated datasets that are ready for ML consumption.

Component 2: ML batch inference

Genworth’s Advanced Analytics team has already been using ML on premises. They wanted to reuse pretrained model artifacts to implement a fully automated ML inference pipeline on AWS. Furthermore, the team wanted to establish an architectural pattern for future ML experiments and implementations, allowing them to iterate and test ideas quickly in a controlled environment.

The three existing ML artifacts forming the PALM model were implemented as a hierarchical TensorFlow neural network model using Keras. The models seek to predict the probability of an insurance policy submitting a claim, the estimated probability of a claim being paid, and the magnitude of that possible claim.

Because each ML model is trained on different data, the input data needs to be standardized accordingly. Individual AWS Glue Python shell jobs perform this data standardization specific to each model. Three ML models are invoked in parallel using SageMaker batch transform jobs (Step 3, ML Batch Prediction) to perform the ML inference and store the prediction results in the model outputs S3 bucket. SageMaker batch transform manages the compute resources, installs the ML model, handles data transfer between Amazon S3 and the ML model, and easily scales out to perform inference on the entire dataset.

Component 3: Data postprocessing and publishing

Before the prediction results from the three ML models are ready for use, they require a series of postprocessing steps, which were performed using AWS Glue Python shell jobs. The results are aggregated and scored (Step 4, PALM Scoring), business rules applied (Step 5, Business Rules), the files generated (Step 6, User Files Generation), and data in the files validated (Step 7, Validation) before publishing the output of these steps back to a table in the on-premises Oracle database (Step 8, Delivering the Results). The solution uses Amazon Simple Notification Service (Amazon SNS) and Amazon CloudWatch Events to notify users via email when the new data becomes available or any issues occur (Step 10, Alerts & Notifications).

All of the steps in the ML pipeline are decoupled and orchestrated using AWS Step Functions, giving Genworth the ease of implementation, the ability to focus on the business logic instead of the scaffolding, and the flexibility they need for future experiments and other ML use cases. The following diagram shows the ML pipeline orchestration using a Step Functions state machine.

Business benefit and what’s next

By building a modern ML platform, Genworth was able to automate an end-to-end ML inference process, which ingests data from an Oracle database on premises, performs ML operations, and helps the business make data-driven decisions. Machine learning helps Genworth simplify high-value manual work performed by the Loss Mitigation team.

This Data Lab engagement has demonstrated the importance of making modern ML and analytics tools available to teams within an organization. It has been a remarkable experience witnessing how quickly an idea can be piloted and, if successful, productionized.

In this post, we showed you how easy it is to build a serverless ML pipeline at scale with AWS Data Analytics and ML services. As we discussed, you can use AWS Glue for a serverless, managed ETL processing job and SageMaker for all your ML needs. All the best on your build!

Genworth, Genworth Financial, and the Genworth logo are registered service marks of Genworth Financial, Inc. and used pursuant to license.


About the Authors

 Liam Pearson is a Data Scientist at Genworth Mortgage Insurance Australia Limited who builds and deploys ML models for various teams within the business. In his spare time, Liam enjoys seeing live music, swimming and—like a true millennial—enjoying some smashed avocado.

 

 

Maria Sokolova is a Solutions Architect at Amazon Web Services. She helps enterprise customers modernize legacy systems and accelerates critical projects by providing technical expertise and transformations guidance where they’re needed most.

 

 

Vamshi Krishna Enabothala is a Data Lab Solutions Architect at AWS. Vamshi works with customers on their use cases, architects a solution to solve their business problems, and helps them build a scalable prototype. Outside of work, Vamshi is an RC enthusiast, building and playing with RC equipment (cars, boats, and drones), and also enjoys gardening.

Read More

Perform batch fraud predictions with Amazon Fraud Detector without writing code or integrating an API

Amazon Fraud Detector is a fully managed service that makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud. Unlike general-purpose machine learning (ML) packages, Amazon Fraud Detector is designed specifically to detect fraud. Amazon Fraud Detector combines your data, the latest in ML science, and more than 20 years of fraud detection experience from Amazon.com and AWS to build ML models tailor-made to detect fraud in your business.

After you train a fraud detection model that is customized to your business, you create rules to interpret the model’s outputs and create a detector to contain both the model and rules. You can then evaluate online activities for fraud in real time by calling your detector through the GetEventPrediction API and passing details about a single event in each request. But what if you don’t have the engineering support to integrate the API, or you want to quickly evaluate many events at once? Previously, you needed to create a custom solution using AWS Lambda and Amazon Simple Storage Service (Amazon S3). This required you to write and maintain code, and it could only evaluate a maximum of 4,000 events at once. Now, you can generate batch predictions in Amazon Fraud Detector to quickly and easily evaluate a large number of events for fraud.

Solution overview

To use the batch predictions feature, you must complete the following high-level steps:

  1. Create and publish a detector that contains your fraud prediction model and rules, or simply a ruleset.
  2. Create an input S3 bucket to upload your file to and, optionally, an output bucket to store your results.
  3. Create a CSV file that contains all the events you want to evaluate.
  4. Perform a batch prediction job through the Amazon Fraud Detector console.
  5. Review your results in the CSV file that is generated and stored to Amazon S3.

Create and publish a detector

You can create and publish a detector version using the Amazon Fraud Detector console or via the APIs. For console instructions, see Get started (console).

Create the input and output S3 buckets

Create an S3 bucket on the Amazon S3 console where you upload your CSV files. This is your input bucket. Optionally, you can create a second output bucket where Amazon Fraud Detector stores the results of your batch predictions as CSV files. If you don’t specify an output bucket, Amazon Fraud Detector stores both your input and output files in the same bucket.

Make sure you create your buckets in the same Region as your detector. For more information, see Creating a bucket.

Create a sample CSV file of event records

Prepare a CSV file that contains the events you want to evaluate. In this file, include a column for each variable in the event type associated to your detector. In addition, include columns for:

  • EVENT_ID – An identifier for the event, such as a transaction number. The field values must satisfy the following regular expression pattern: ^[0-9a-z_-]+$.
  • ENTITY_ID – An identifier for the entity performing the event, such as an account number. The field values must also satisfy the following regular expression pattern: ^[0-9a-z_-]+$.
  • EVENT_TIMESTAMP – A timestamp, in ISO 8601 format, for when the event occurred.
  • ENTITY_TYPE – The entity that performs the event, such as a customer or a merchant.

Column header names must match their corresponding Amazon Fraud Detector variable names exactly. The preceding four required column header names must be uppercase, and the column header names for the variables associated to your event type must be lowercase. You receive an error for any events in your file that have missing values.

In your CSV file, each row corresponds to one event for which you want to generate a prediction. The CSV file can be up to 50 MB, which allows for about 50,000-100,000 events depending on your event size. The following screenshot shows an example of an input CSV file.

For more information about Amazon Fraud Detector variable data types and formatting, see Create a variable.

Perform a batch prediction

Upload your CSV file to your input bucket. Now it’s time to start a batch prediction job.

  1. On the Amazon Fraud Detector console, choose Batch predictions in the navigation pane.

This page contains a summary of past batch prediction jobs.

  1. Choose New batch prediction.

  1. For Job name¸ you can enter a name for your job or let Amazon Fraud Detector assign a random name.
  2. For Detector and Detector version, choose the detector and version you want to use for your batch prediction.
  3. For IAM role, if you already have an AWS Identity and Access Management (IAM) role, you can choose it from the drop-down menu. Alternatively, you can create one by choosing Create IAM role.

When creating a new IAM role, you can specify different buckets for the input and output files or enter the same bucket name for both.

If you use an existing IAM role such as the one that you use for accessing datasets for model training, you need to ensure the role has the s3:PutObject permission attached before starting a batch predictions job.

  1. After you choose your IAM role, for Data Location, enter the S3 URI for your input file.
  2. Choose Start.

You’re returned to the Batch predictions page, where you can see the job you just created. Batch prediction job processing times vary based on how many events you’re evaluating. For example, a 20 MB file (about 20,000 events) takes about 12 minutes. You can view the status of the job at any time on the Amazon Fraud Detector console. Choosing the job name opens a job detail page with additional information like the input and output data locations.

Review your batch prediction results

After the job is complete, you can download your output file from the S3 bucket you designated. To find the file quickly, choose the link under Output data location on the job detail page.

The output file has all the columns you provided in your input file, plus three additional columns:

  • STATUS – Shows Success if the event was successfully evaluated or an error code if the event couldn’t be evaluated
  • OUTCOMES – Denotes which outcomes were returned by your ruleset
  • MODEL_SCORES – Denotes the risk scores that were returned by any models called by your ruleset

The following screenshot shows an example of an output CSV file.

Conclusion

Congrats! You have successfully performed a batch of fraud predictions. You can use the batch predictions feature to test changes to your fraud detection logic, such as a new model version or updated rules. You can also use batch predictions to perform asynchronous fraud evaluations, like a daily check of all accounts created in the past 24 hours.

Depending on your use case, you may want to use your prediction results in other AWS services. For example, you can analyze the prediction results in Amazon QuickSight or send results that are high risk to Amazon Augmented AI (Amazon A2I) for a human review of the prediction. You may also want to use Amazon CloudWatch to schedule recurring batch predictions.

Amazon Fraud Detector has a 2-month free trial that includes 30,000 predictions per month. After that, pricing starts at $0.005 per prediction for rules-only predictions and $0.03 for ML-based predictions. For more information, see Amazon Fraud Detector pricing. For more information about Amazon Fraud Detector, including links to additional blog posts, sample notebooks, user guide, and API documentation, see Amazon Fraud Detector.

If you have any questions or comments, let us know in the comments!


About the Author

Bilal Ali is a Sr. Product Manager working on Amazon Fraud Detector. He listens to customers’ problems and finds ways to help them better fight fraud and abuse. He spends his free time watching old Jeopardy episodes and searching for the best tacos in Austin, TX.

Read More