Amazon AWS – Page 278

From pure mathematician to Amazon applied scientist

May 10, 2021

by admin Amazon AWS

Early on, Giovanni Paolini knew little about machine learning — now he’s leading new science on artificial intelligence that could inform AWS products.Read More

Build an intelligent search solution with automated content enrichment

May 7, 2021

by Abhinav Jawadekar Amazon AWS

Unstructured data belonging to the enterprise continues to grow, making it a challenge for customers and employees to get the information they need. Amazon Kendra is a highly accurate intelligent search service powered by machine learning (ML). It helps you easily find the content you’re looking for, even when it’s scattered across multiple locations and content repositories.

Amazon Kendra leverages deep learning and reading comprehension to deliver precise answers. It offers natural language search for a user experience that’s like interacting with a human expert. When documents don’t have a clear answer or if the question is ambiguous, Amazon Kendra returns a list of the most relevant documents for the user to choose from.

To help narrow down a list of relevant documents, you can assign metadata at the time of document ingestion to provide filtering and faceting capabilities, for an experience similar to the Amazon.com retail site where you’re presented with filtering options on the left side of the webpage. But what if the original documents have no metadata, or users have a preference for how this information is categorized? You can automatically generate metadata using ML in order to enrich the content and make it easier to search and discover.

This post outlines how you can automate and simplify metadata generation using Amazon Comprehend Medical, a natural language processing (NLP) service that uses ML to find insights related to healthcare and life sciences (HCLS) such as medical entities and relationships in unstructured medical text. The metadata generated is then ingested as custom attributes alongside documents into an Amazon Kendra index. For repositories with documents containing generic information or information related to domains other than HCLS, you can use a similar approach with Amazon Comprehend to automate metadata generation.

To demonstrate an intelligent search solution with enriched data, we use Wikipedia pages of the medicines listed in the World Health Organization (WHO) Model List of Essential Medicines. We combine this content with metadata automatically generated using Amazon Comprehend Medical, into a unified Amazon Kendra index to make it searchable. You can visit the search application and try asking it some questions of your own, such as “What is the recommended paracetamol dose for an adult?” The following screenshot shows the results.

Solution overview

We take a two-step approach to custom content enrichment during the content ingestion process:

Identify the metadata for each document using Amazon Comprehend Medical.
Ingest the document along with the metadata in the search solution based on an Amazon Kendra index.

Amazon Comprehend Medical uses NLP to extract medical insights about the content of documents by extracting medical entities such as medication, medical condition, anatomical location, the relationships between entities such as route and medication, and traits such as negation. In this example, for the Wikipedia page of each medicine from the WHO Model List of Essential Medicines, we use the DetectEntitiesV2 operation of Amazon Comprehend Medical to detect the entities in the categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. We use these entities to generate the document metadata.

We prepare the Amazon Kendra index by defining custom attributes of type STRING_LIST corresponding to the entity categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. For each document, the DetectEntitiesV2 operation of Amazon Comprehend Medical returns a categorized list of entities. Each entity from this list with a sufficiently high confidence score (for this use case, greater than 0.97) is added to the custom attribute corresponding to its category. After all the detected entities are processed in this way, the populated attributes are used to generate the metadata JSON file corresponding to that document. Amazon Kendra has an upper limit of 10 strings for an attribute of STRING_LIST type. In this example, we take the top 10 entities with the highest frequency of occurrence in the processed document.

After the metadata JSON files for all the documents are created, they’re copied to the Amazon Simple Storage Service (Amazon S3) bucket configured as a data source to the Amazon Kendra index, and a data source sync is performed to ingest the documents in the index along with the metadata.

Prerequisites

To deploy and work with the solution in this post, make sure you have the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS and the AWS Command Line Interface (AWS CLI). For more information about the AWS CLI, see AWS CLI Command Reference.
An S3 bucket to store the documents and metadata. For more information, see Creating a bucket and What is Amazon S3?
Access to AWS CloudShell, Amazon Kendra, and Amazon Comprehend Medical.

Architecture

We use the AWS CloudFormation template medkendratemplate.yaml to deploy an Amazon Kendra index with the custom attributes of type STRING_LIST corresponding to the entity categories ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION.

The following diagram illustrates our solution architecture.

Based on this architecture, the steps to build and use the solution are as follows:

On CloudShell, a Bash script called getpages.sh downloads Wikipedia pages of the medicines and store them as text files.
A Python script called meds.py, which contains the core logic of the automation of the metadata generation, makes the detect_entities_v2 API call to Amazon Comprehend Medical to detect entities for each of the Wikipedia pages and generate metadata based on the entities returned. The steps used in this script are as follows:
1. Split the Wikipedia page text into chunks smaller than the maximum text size allowed by the detect_entities_v2 API call.
2. Make the detect_entities_v2 call.
3. Filter the entities detected by the detect_entities_v2 call using a threshold confidence score (0.97 for this example).
4. Keep track of each unique entity corresponding to its category and the frequency of occurrence of that entity.
5. For each entity category, sort the entities in that category from highest to lowest frequency of occurrence and select the top 10 entities.
6. Create a metadata object based on the selected entities and output it in JSON format.
We use the AWS CLI to copy the text data and the metadata to the S3 bucket that is configured as a data source to the Amazon Kendra index using the S3 connector.
We perform a data source sync using the Amazon Kendra console to ingest the contents of the documents along with the metadata in the Amazon Kendra index.
Finally, we use the Amazon Kendra search console to make queries to the index.

Create an Amazon S3 bucket to be used as a data source

Create an Amazon S3 bucket that you will use as a data source for the Amazon Kendra index.

Deploy the infrastructure as a CloudFormation stack

To deploy the infrastructure and resources for this solution, complete the following steps:

In a separate browser tab, open the AWS Management Console, and make sure that you’re logged in to your AWS account. Click the following button to launch the CloudFormation stack to deploy the infrastructure.

After that you should see a page similar to the following image:

For S3DataSourceBucket, enter your data source bucket name without the s3:// prefix, select I acknowledge that AWS CloudFormation might create IAM resources with custom names, and then choose Create stack.

Stack creation can take 30–45 minutes to complete. You can monitor the stack creation status on the Stack info tab. You can also look at the different tabs, such as Events, Resources, and Template. While the stack is being created, you can work on getting the data and generating the metadata in the next few steps.

Get the data and generate the metadata

To fetch your data and start generating metadata, complete the following steps:

On the AWS Management Console, click icon shown by a red circle in the following picture to start AWS CloudShell.

Copy the filecode-data.tgz and extract the contents by using the following commands on AWS CloudShell:

aws s3 cp s3://aws-ml-blog/artifacts/build-an-intelligent-search-solution-with-automated-content-enrichment/code-data.tgz .
tar zxvf code-data.tgz

Change the working directory to code-data:

cd code-data

At this point, you can choose to run the end-to-end workflow of getting the data, creating the metadata using Amazon Comprehend Medical (which takes about 35–40 minutes), and then ingesting the data along with the metadata in the Amazon Kendra index, or just complete the last step to ingest the data with the metadata that has been generated using Amazon Comprehend Medical and supplied in the package for convenience.

To use the metadata supplied in the package, enter the following code and then jump to Step 6:

tar zxvf med-data-meta.tgz

Perform this step to get a hands-on experience of building the end-to-end solution. The following command runs a bash script called main.sh, which calls the following scripts:
1. prereq.sh to install prerequisites and create subdirectories to store data and metadata
2. getpages.sh to get the Wikipedia pages of medicines in the list
3. getmetapar.sh to call the meds.py Python script for each document

./main.sh

The Python script meds.py contains the logic to make the get_entities_v2 call to Amazon Comprehend Medical and then process the output to produce the JSON metadata file. It takes about 30–40 minutes for this to complete.

While performing Step 5, if CloudShell times out, security tokens get refreshed, or the script stops before all the data is processed, start the CloudShell session again and run getmetapar.sh, which starts the data processing from the point it was stopped:

./getmetapar.sh

Upload the data and metadata to the S3 bucket being used as the data source for the Amazon Kendra index using the following AWS CLI commands:

aws cp Data/ s3://<REPLACE-WITH-NAME-OF-YOUR-S3-BUCKET>/Data/ —recursive
aws cp Meta/ s3://<REPLACE-WITH-NAME-OF-YOUR-S3-BUCKET>/Meta/ —recursive

Review Amazon Kendra configuration and start the data source sync

Before starting this step, make sure that the CloudFormation stack creation is complete. In the following steps, we start the data source sync to begin crawling and indexing documents.

On the Amazon Kendra console, choose the index AuthKendraIndex, which was created as part of the CloudFormation stack.

In the navigation pane, choose Data sources.
On the Settings tab, you can see the data source bucket being configured.
Choose the data source and choose Sync now.

The data source sync can take 10–15 minutes to complete.

Observe Amazon Kendra index facet definition

In the navigation pane, choose Facet definition. The following screenshot shows the entries for ANATOMY, MEDICAL_CONDITION, MEDICATION, PROTECTED_HEALTH_INFORMATION, TEST_TREATMENT_PROCEDURE, and TIME_EXPRESSION. These are the categories of the entities detected by Amazon Comprehend Medical. These are defined as custom attributes in the CloudFormation template that we used to create the Amazon Kendra index. The facetable check boxes for PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION aren’t selected, therefore these aren’t shown in the facets of the search user interface.

Query the repository of WHO Model List of Essential Medicines

We’re now ready to make queries to our search solution.

On the Amazon Kendra console, navigate to your index and choose Search console.
In the search field, enter What is the treatment for diabetes?

The following screenshot shows the results.

Choose Filter search results to see the facets.

The headings of MEDICATION, ANATOMY, MEDICAL_CONDITION, and TEST_TREATMENT_PROCEDURE are the categories defined as Amazon Kendra facets, and the list of items underneath them are the entities of these categories as detected by Amazon Comprehend Medical in the documents being searched. PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION are not shown.

Under MEDICAL_CONDITION, select pregnancy to refine the search results.

You can go back to the Facet definition page and make PROTECTED_HEALTH_INFORMATION and TIME_EXPRESSION facetable and save the configuration. Go back to the search console, make a new query, and observe the facets again. Experiment with these facets to see what suits your needs best.

Make additional queries and use the facets to refine the search results. You can use the following queries to get started, but you can also experiment with your own:

What is a common painkiller?
Is parcetamol safe for children?
How to manage high blood pressure?
When should BCG vaccine be administered?

You can observe how domain-specific facets improve the search experience.

Infrastructure cleanup

To delete the infrastructure that was deployed as part of the CloudFormation stack, delete the stack from the AWS CloudFormation console. Stack deletion can take 20–30 minutes.

When the stack status shows as Delete Complete, go to the Events tab and confirm that each of the resources has been removed. You can also cross-verify by checking on the Amazon Kendra console that the index is deleted.

You must delete your data source bucket separately because it wasn’t created as part of the CloudFormation stack.

Conclusion

In this post, we demonstrated how to automate the process to enrich the content by generating domain-specific metadata for an Amazon Kendra index using Amazon Comprehend or Amazon Comprehend Medical, thereby improving the user experience for the search solution.

This example used the entities detected by Amazon Comprehend Medical to generate the Amazon Kendra metadata. Depending on the domain of the content repository, you can use a similar approach with the pretrained model or custom trained models of Amazon Comprehend. Try out our solution and let us know what you think! You can further enhance the metadata by using other elements such as protected health information (PHI) for Amazon Comprehend Medical and events, key phrases, personally identifiable information (PII), dominant language, sentiment, and syntax for Amazon Comprehend.

About the Authors

Abhinav Jawadekar is a Senior Partner Solutions Architect at Amazon Web Services. Abhinav works with AWS partners to help them in their cloud journey.

Udi Hershkovich has been a Principal WW AI/ML Service Specialist at AWS since 2018. Prior to AWS, Udi held multiple leadership positions with AI startups and Enterprise initiatives including co-founder and CEO at LeanFM Technologies, offering ML-powered predictive maintenance in facilities management, CEO of Safaba Translation Solutions, a machine translation startup acquired by Amazon in 2015, and Head of Professional Services for Contact Center Intelligence at Amdocs. Udi holds Law and Business degrees from the Interdisciplinary Center in Herzliya, Israel, and lives in Pittsburgh, Pennsylvania, USA.

Using hyperboloids to embed knowledge graphs

May 7, 2021

by admin Amazon AWS

Novel embedding scheme enables a 7% to 33% improvement over its best-performing predecessors in handling graph queries.Read More

Create a serverless pipeline to translate large documents with Amazon Translate

May 6, 2021

by Jay Rao Amazon AWS

In our previous post, we described how to translate documents using the real-time translation API from Amazon Translate and AWS Lambda. However, this method may not work for files that are too large. They may take too much time, triggering the 15-minute timeout limit of Lambda functions. One can use batch API, but this is available only in seven AWS Regions (as of this blog’s publication). To enable translation of large files in regions where Batch Translation is not supported, we created the following solution.

In this post, we walk you through performing translation of large documents.

Architecture overview

Compared to the architecture featured in the post Translating documents with Amazon Translate, AWS Lambda, and the new Batch Translate API, our architecture has one key difference: the presence of AWS Step Functions, a serverless function orchestrator that makes it easy to sequence Lambda functions and multiple services into business-critical applications. Step Functions allows us to keep track of running the translation, managing retrials in case of errors or timeouts, and orchestrating event-driven workflows.

The following diagram illustrates our solution architecture.

This event-driven architecture shows the flow of actions when a new document lands in the input Amazon Simple Storage Service (Amazon S3) bucket. This event triggers the first Lambda function, which acts as the starting point of the Step Functions workflow.

The following diagram illustrates the state machine and the flow of actions.

The Process Document Lambda function is triggered when the state machine starts; this function performs all the activities required to translate the documents. It accesses the file from the S3 bucket, downloads it locally in the environment in which the function is run, reads the file contents, extracts short segments from the document that can be passed through the real-time translation API, and uses the API’s output to create the translated document.

Other mechanisms are implemented within the code to avoid failures, such as handling an Amazon Translate throttling error and Lambda function timeout by taking action and storing the progress that was made in a /temp folder 30 seconds before the function times out. These mechanisms are critical for handling large text documents.

When the function has successfully finished processing, it uploads the translated text document in the output S3 bucket inside a folder for the target language code, such as en for English. The Step Functions workflow ends when the Lambda function moves the input file from the /drop folder to the /processed folder within the input S3 bucket.

We now have all the pieces in place to try this in action.

Deploy the solution using AWS CloudFormation

You can deploy this solution in your AWS account by launching the provided AWS CloudFormation stack. The CloudFormation template provisions the necessary resources needed for the solution. The template creates the stack the us-east-1 Region, but you can use the template to create your stack in any Region where Amazon Translate is available. As of this writing, Amazon Translate is available in 16 commercial Regions and AWS GovCloud (US-West). For the latest list of Regions, see the AWS Regional Services List.

To deploy the application, complete the following steps:

Launch the CloudFormation template by choosing Launch Stack:

Choose Next.

Alternatively, on the AWS CloudFormation console, choose Create stack with new resources (standard), choose Amazon S3 URL as the template source, enter https://s3.amazonaws.com/aws-ml-blog/artifacts/create-a-serverless-pipeline-to-translate-large-docs-amazon-translate/translate.yml, and choose Next.

For Stack name, enter a unique stack name for this account; for example, serverless-document-translation.
For InputBucketName, enter a unique name for the S3 bucket the stack creates; for example, serverless-translation-input-bucket.

The documents are uploaded to this bucket before they are translated. Use only lower-case characters and no spaces when you provide the name of the input S3 bucket. This operation creates a new bucket, so don’t use the name of an existing bucket. For more information, see Bucket naming rules.

For OutputBucketName, enter a unique name for your output S3 bucket; for example, serverless-translation-output-bucket.

This bucket stores the documents after they are translated. Follow the same naming rules as your input bucket.

For SourceLanguageCode, enter the language code that your input documents are in; for this post we enter auto to detect the dominant language.
For TargetLanguageCode, enter the language code that you want your translated documents in; for example, en for English.

For more information about supported language codes, see Supported Languages and Language Codes.

Choose Next.

On the Configure stack options page, set any additional parameters for the stack, including tags.
Choose Next.
Select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
Choose Create stack.

Stack creation takes about a minute to complete.

Translate your documents

You can now upload a text document that you want to translate into the input S3 bucket, under the drop/ folder.

The following screenshot shows our sample document, which contains a sentence in Greek.

This action starts the workflow, and the translated document automatically shows up in the output S3 bucket, in the folder for the target language (for this example, en). The length of time for the file to appear depends on the size of the input document.

Our translated file looks like the following screenshot.

You can also track the state machine’s progress on the Step Functions console, or with the relevant API calls.

Let’s try the solution with a larger file. The test_large.txt file contains content from multiple AWS blog posts and other content written in German (for example, we use all the text from the post AWS DeepLens (Version 2019) kommt nach Deutschland und in weitere Länder).

This file is much bigger than the file in previous test. We upload the file in the drop/ folder of the input bucket.

On the Step Functions console, you can confirm that the pipeline is running by checking the status of the state machine.

On the Graph inspector page, you can get more insights on the status of the state machine at any given point. When you choose a step, the Step output tab shows the completion percentage.

When the state machine is complete, you can retrieve the translated file from the output bucket.

The following screenshot shows that our file is translated in English.

Troubleshooting

If you don’t see the translated document in the output S3 bucket, check Amazon CloudWatch Logs for the corresponding Lambda function and look for potential errors. For cost-optimization, by default, the solution uses 256 MB of memory for the Process Document Lambda function. While processing a large document, if you see Runtime.ExitError for the function in the CloudWatch Logs, increase the function memory.

Other considerations

It’s worth highlighting the power of the automatic language detection feature of Amazon Translate, captured as auto in the SourceLanguageCode field that we specified when deploying the CloudFormation stack. In the previous examples, we submitted a file containing text in Greek and another file in German, and they were both successfully translated into English. With our solution, you don’t have to redeploy the stack (or manually change the source language code in the Lambda function) every time you upload a source file with a different language. Amazon Translate detects the source language and starts the translation process. Post deployment, if you need to change the target language code, you can either deploy a new CloudFormation stack or update the existing stack.

This solution uses the Amazon Translate synchronous real-time API. It handles the maximum document size limit (5,000 bytes) by splitting the document into paragraphs (ending with a newline character). If needed, it further splits each paragraph into sentences (ending with a period). You can modify these delimiters based on your source text. This solution can support a maximum of 5,000 bytes for a single sentence and it only handles UTF-8 formatted text documents with .txt or .text file extensions. You can modify the Python code in the Process Document Lambda function to handle different file formats.

In addition to Amazon S3 costs, the solution incurs usage costs from Amazon Translate, Lambda, and Step Functions. For more information, see Amazon Translate pricing, Amazon S3 pricing, AWS Lambda pricing, and AWS Step Functions pricing.

Conclusion

In this post, we showed the implementation of a serverless pipeline that can translate documents in real time using the real-time translation feature of Amazon Translate and the power of Step Functions as orchestrators of individual Lambda functions. This solution allows for more control and for adding sophisticated functionality to your applications. Come build your advanced document translation pipeline with Amazon Translate!

For more information, see the Amazon Translate Developer Guide and Amazon Translate resources. If you’re new to Amazon Translate, try it out using our Free Tier, which offers 2 million characters per month for free for the first 12 months, starting from your first translation request.

About the Authors

Jay Rao is a Senior Solutions Architect at AWS. He enjoys providing technical guidance to customers and helping them design and implement solutions on AWS.

Seb Kasprzak is a Solutions Architect at AWS. He spends his days at Amazon helping customers solve their complex business problems through use of Amazon technologies.

Nikiforos Botis is a Solutions Architect at AWS. He enjoys helping his customers succeed in their cloud journey, and is particularly interested in AI/ML technologies.

Bobbie Couhbor is a Senior Solutions Architect for Digital Innovation at AWS, helping customers solve challenging problems with emerging technology, such as machine learning, robotics, and IoT.

How Genworth built a serverless ML pipeline on AWS using Amazon SageMaker and AWS Glue

May 6, 2021

by Liam Pearson Amazon AWS

This post is co-written with Liam Pearson, a Data Scientist at Genworth Mortgage Insurance Australia Limited.

Genworth Mortgage Insurance Australia Limited is a leading provider of lenders mortgage insurance (LMI) in Australia; their shares are traded on Australian Stock Exchange as ASX: GMA.

Genworth Mortgage Insurance Australia Limited is a lenders mortgage insurer with over 50 years of experience and volumes of data collected, including data on dependencies between mortgage repayment patterns and insurance claims. Genworth wanted to use this historical information to train Predictive Analytics for Loss Mitigation (PALM) machine learning (ML) models. With the ML models, Genworth could analyze recent repayment patterns for each of the insurance policies to prioritize them in descending order of likelihood (chance of a claim) and impact (amount insured). Genworth wanted to run batch inference on ML models in parallel and on schedule while keeping the amount of effort to build and operate the solution to the minimum. Therefore, Genworth and AWS chose Amazon SageMaker batch transform jobs and serverless building blocks to ingest and transform data, perform ML inference, and process and publish the results of the analysis.

Genworth’s Advanced Analytics team engaged in an AWS Data Lab program led by Data Lab engineers and solutions architects. In a pre-lab phase, they created a solution architecture to fit specific requirements Genworth had, especially around security controls, given the nature of the financial services industry. After the architecture was approved and all AWS building blocks identified, training needs were determined. AWS Solutions Architects conducted a series of hands-on workshops to provide the builders at Genworth with the skills required to build the new solution. In a 4-day intensive collaboration, called a build phase, the Genworth Advanced Analytics team used the architecture and learnings to build an ML pipeline that fits their functional requirements. The pipeline is fully automated and is serverless, meaning that there is no maintenance, scaling issues, or downtime. Post-lab activities were focused on productizing the pipeline and adopting it as a blueprint for other ML use cases.

In this post, we (the joint team of Genworth and AWS Architects) explain how we approached the design and implementation of the solution, the best practices we followed, the AWS services we used, and the key components of the solution architecture.

Solution overview

We followed the modern ML pipeline pattern to implement a PALM solution for Genworth. The pattern allows ingestion of data from various sources, followed by transformation, enrichment, and cleaning of the data, then ML prediction steps, finishing up with the results made available for consumption with or without data wrangling of the output.

In short, the solution implemented has three components:

Data ingestion and preparation
ML batch inference using three custom developed ML models
Data post processing and publishing for consumption

The following is the architecture diagram of the implemented solution.

Let’s discuss the three components in more detail.

Component 1: Data ingestion and preparation

Genworth source data is published weekly into a staging table in their Oracle on-premises database. The ML pipeline starts with an AWS Glue job (Step 1, Data Ingestion, in the diagram) connecting to the Oracle database over an AWS Direct Connect connection secured with VPN to ingest raw data and store it in an encrypted Amazon Simple Storage Service (Amazon S3) bucket. Then a Python shell job runs using AWS Glue (Step 2, Data Preparation) to select, clean, and transform the features used later in the ML inference steps. The results are stored in another encrypted S3 bucket used for curated datasets that are ready for ML consumption.

Component 2: ML batch inference

Genworth’s Advanced Analytics team has already been using ML on premises. They wanted to reuse pretrained model artifacts to implement a fully automated ML inference pipeline on AWS. Furthermore, the team wanted to establish an architectural pattern for future ML experiments and implementations, allowing them to iterate and test ideas quickly in a controlled environment.

The three existing ML artifacts forming the PALM model were implemented as a hierarchical TensorFlow neural network model using Keras. The models seek to predict the probability of an insurance policy submitting a claim, the estimated probability of a claim being paid, and the magnitude of that possible claim.

Because each ML model is trained on different data, the input data needs to be standardized accordingly. Individual AWS Glue Python shell jobs perform this data standardization specific to each model. Three ML models are invoked in parallel using SageMaker batch transform jobs (Step 3, ML Batch Prediction) to perform the ML inference and store the prediction results in the model outputs S3 bucket. SageMaker batch transform manages the compute resources, installs the ML model, handles data transfer between Amazon S3 and the ML model, and easily scales out to perform inference on the entire dataset.

Component 3: Data postprocessing and publishing

Before the prediction results from the three ML models are ready for use, they require a series of postprocessing steps, which were performed using AWS Glue Python shell jobs. The results are aggregated and scored (Step 4, PALM Scoring), business rules applied (Step 5, Business Rules), the files generated (Step 6, User Files Generation), and data in the files validated (Step 7, Validation) before publishing the output of these steps back to a table in the on-premises Oracle database (Step 8, Delivering the Results). The solution uses Amazon Simple Notification Service (Amazon SNS) and Amazon CloudWatch Events to notify users via email when the new data becomes available or any issues occur (Step 10, Alerts & Notifications).

All of the steps in the ML pipeline are decoupled and orchestrated using AWS Step Functions, giving Genworth the ease of implementation, the ability to focus on the business logic instead of the scaffolding, and the flexibility they need for future experiments and other ML use cases. The following diagram shows the ML pipeline orchestration using a Step Functions state machine.

Business benefit and what’s next

By building a modern ML platform, Genworth was able to automate an end-to-end ML inference process, which ingests data from an Oracle database on premises, performs ML operations, and helps the business make data-driven decisions. Machine learning helps Genworth simplify high-value manual work performed by the Loss Mitigation team.

This Data Lab engagement has demonstrated the importance of making modern ML and analytics tools available to teams within an organization. It has been a remarkable experience witnessing how quickly an idea can be piloted and, if successful, productionized.

In this post, we showed you how easy it is to build a serverless ML pipeline at scale with AWS Data Analytics and ML services. As we discussed, you can use AWS Glue for a serverless, managed ETL processing job and SageMaker for all your ML needs. All the best on your build!

Genworth, Genworth Financial, and the Genworth logo are registered service marks of Genworth Financial, Inc. and used pursuant to license.

About the Authors

Liam Pearson is a Data Scientist at Genworth Mortgage Insurance Australia Limited who builds and deploys ML models for various teams within the business. In his spare time, Liam enjoys seeing live music, swimming and—like a true millennial—enjoying some smashed avocado.

Maria Sokolova is a Solutions Architect at Amazon Web Services. She helps enterprise customers modernize legacy systems and accelerates critical projects by providing technical expertise and transformations guidance where they’re needed most.

Vamshi Krishna Enabothala is a Data Lab Solutions Architect at AWS. Vamshi works with customers on their use cases, architects a solution to solve their business problems, and helps them build a scalable prototype. Outside of work, Vamshi is an RC enthusiast, building and playing with RC equipment (cars, boats, and drones), and also enjoys gardening.

Perform batch fraud predictions with Amazon Fraud Detector without writing code or integrating an API

May 6, 2021

by Bilal Ali Amazon AWS

Amazon Fraud Detector is a fully managed service that makes it easy to identify potentially fraudulent online activities, such as the creation of fake accounts or online payment fraud. Unlike general-purpose machine learning (ML) packages, Amazon Fraud Detector is designed specifically to detect fraud. Amazon Fraud Detector combines your data, the latest in ML science, and more than 20 years of fraud detection experience from Amazon.com and AWS to build ML models tailor-made to detect fraud in your business.

After you train a fraud detection model that is customized to your business, you create rules to interpret the model’s outputs and create a detector to contain both the model and rules. You can then evaluate online activities for fraud in real time by calling your detector through the GetEventPrediction API and passing details about a single event in each request. But what if you don’t have the engineering support to integrate the API, or you want to quickly evaluate many events at once? Previously, you needed to create a custom solution using AWS Lambda and Amazon Simple Storage Service (Amazon S3). This required you to write and maintain code, and it could only evaluate a maximum of 4,000 events at once. Now, you can generate batch predictions in Amazon Fraud Detector to quickly and easily evaluate a large number of events for fraud.

Solution overview

To use the batch predictions feature, you must complete the following high-level steps:

Create and publish a detector that contains your fraud prediction model and rules, or simply a ruleset.
Create an input S3 bucket to upload your file to and, optionally, an output bucket to store your results.
Create a CSV file that contains all the events you want to evaluate.
Perform a batch prediction job through the Amazon Fraud Detector console.
Review your results in the CSV file that is generated and stored to Amazon S3.

Create and publish a detector

You can create and publish a detector version using the Amazon Fraud Detector console or via the APIs. For console instructions, see Get started (console).

Create the input and output S3 buckets

Create an S3 bucket on the Amazon S3 console where you upload your CSV files. This is your input bucket. Optionally, you can create a second output bucket where Amazon Fraud Detector stores the results of your batch predictions as CSV files. If you don’t specify an output bucket, Amazon Fraud Detector stores both your input and output files in the same bucket.

Make sure you create your buckets in the same Region as your detector. For more information, see Creating a bucket.

Create a sample CSV file of event records

Prepare a CSV file that contains the events you want to evaluate. In this file, include a column for each variable in the event type associated to your detector. In addition, include columns for:

EVENT_ID – An identifier for the event, such as a transaction number. The field values must satisfy the following regular expression pattern: ^[0-9a-z_-]+$.
ENTITY_ID – An identifier for the entity performing the event, such as an account number. The field values must also satisfy the following regular expression pattern: ^[0-9a-z_-]+$.
EVENT_TIMESTAMP – A timestamp, in ISO 8601 format, for when the event occurred.
ENTITY_TYPE – The entity that performs the event, such as a customer or a merchant.

Column header names must match their corresponding Amazon Fraud Detector variable names exactly. The preceding four required column header names must be uppercase, and the column header names for the variables associated to your event type must be lowercase. You receive an error for any events in your file that have missing values.

In your CSV file, each row corresponds to one event for which you want to generate a prediction. The CSV file can be up to 50 MB, which allows for about 50,000-100,000 events depending on your event size. The following screenshot shows an example of an input CSV file.

For more information about Amazon Fraud Detector variable data types and formatting, see Create a variable.

Perform a batch prediction

Upload your CSV file to your input bucket. Now it’s time to start a batch prediction job.

On the Amazon Fraud Detector console, choose Batch predictions in the navigation pane.

This page contains a summary of past batch prediction jobs.

Choose New batch prediction.

For Job name¸ you can enter a name for your job or let Amazon Fraud Detector assign a random name.
For Detector and Detector version, choose the detector and version you want to use for your batch prediction.
For IAM role, if you already have an AWS Identity and Access Management (IAM) role, you can choose it from the drop-down menu. Alternatively, you can create one by choosing Create IAM role.

When creating a new IAM role, you can specify different buckets for the input and output files or enter the same bucket name for both.

If you use an existing IAM role such as the one that you use for accessing datasets for model training, you need to ensure the role has the s3:PutObject permission attached before starting a batch predictions job.

After you choose your IAM role, for Data Location, enter the S3 URI for your input file.
Choose Start.

You’re returned to the Batch predictions page, where you can see the job you just created. Batch prediction job processing times vary based on how many events you’re evaluating. For example, a 20 MB file (about 20,000 events) takes about 12 minutes. You can view the status of the job at any time on the Amazon Fraud Detector console. Choosing the job name opens a job detail page with additional information like the input and output data locations.

Review your batch prediction results

After the job is complete, you can download your output file from the S3 bucket you designated. To find the file quickly, choose the link under Output data location on the job detail page.

The output file has all the columns you provided in your input file, plus three additional columns:

STATUS – Shows Success if the event was successfully evaluated or an error code if the event couldn’t be evaluated
OUTCOMES – Denotes which outcomes were returned by your ruleset
MODEL_SCORES – Denotes the risk scores that were returned by any models called by your ruleset

The following screenshot shows an example of an output CSV file.

Conclusion

Congrats! You have successfully performed a batch of fraud predictions. You can use the batch predictions feature to test changes to your fraud detection logic, such as a new model version or updated rules. You can also use batch predictions to perform asynchronous fraud evaluations, like a daily check of all accounts created in the past 24 hours.

Depending on your use case, you may want to use your prediction results in other AWS services. For example, you can analyze the prediction results in Amazon QuickSight or send results that are high risk to Amazon Augmented AI (Amazon A2I) for a human review of the prediction. You may also want to use Amazon CloudWatch to schedule recurring batch predictions.

Amazon Fraud Detector has a 2-month free trial that includes 30,000 predictions per month. After that, pricing starts at $0.005 per prediction for rules-only predictions and $0.03 for ML-based predictions. For more information, see Amazon Fraud Detector pricing. For more information about Amazon Fraud Detector, including links to additional blog posts, sample notebooks, user guide, and API documentation, see Amazon Fraud Detector.

If you have any questions or comments, let us know in the comments!

About the Author

Bilal Ali is a Sr. Product Manager working on Amazon Fraud Detector. He listens to customers’ problems and finds ways to help them better fight fraud and abuse. He spends his free time watching old Jeopardy episodes and searching for the best tacos in Austin, TX.

Michael Bronstein aims to unite the deep learning community

May 6, 2021

by admin Amazon AWS

The ARA recipient is pioneering geometric deep learning, an approach that not only promises breakthroughs, but also a way to unify the machine learning “zoo”.Read More

Improving explainable AI’s explanations

May 6, 2021

by admin Amazon AWS

Causal analysis improves both the classification accuracy and the relevance of the concepts identified by popular concept-based explanatory models.Read More

Automatically scale Amazon Kendra query capacity units with Amazon EventBridge and AWS Lambda

May 5, 2021

by Juan Bustos Amazon AWS

Data is proliferating inside the enterprise and employees are using more applications than ever before to get their jobs done, in fact according to Okta Inc., the number of software apps deployed by large firms across all industries world-wide has increased 68%, reaching an average of 129 apps per company.

As employees continue to self-serve and the number of applications they use grows, so will the likelihood that critical business information will remain hard to find or get lost between systems, negatively impacting workforce productivity and operating costs.

Amazon Kendra is an intelligent search service powered by machine learning (ML). Unlike conventional search technologies, Amazon Kendra reimagines search by unifying unstructured data across multiple data sources as part of a single searchable index. It’s deep learning and natural language processing capabilities then make it easy for you to get relevant answers when you need them.

Amazon Kendra Enterprise Edition includes storage capacity for 500,000 documents (150 GB of storage) and a query capacity of 40,000 queries per day (0.5 queries per second), and allows you to adjust index capacity by increasing or decreasing your query and storage capacity units as needed.

However, usage patterns and business needs are not always predictable. In this post we’ll demonstrate how you can automatically scale your Amazon Kendra index based on a time schedule using Amazon EventBridge and AWS Lambda. By doing this you can increase capacity for peak usage, avoid service throttling, maintain flexibility, and control costs.

Solution overview

Amazon Kendra provides a dashboard that allows you to evaluate the average number of queries per second for your index. With this information, you can estimate the number of additional capacity units your workload requires at a specific point in time.

For example, the following graph shows that during business hours, a surge occurs in the average queries per second, but after hours, the number of queries reduces. We base our solution on this pattern to set up an EventBridge scheduled event that triggers the automatic scaling Lambda function.

The following diagram illustrates our architecture.

You can deploy the solution into your account two different ways:

Deploy an AWS Serverless Application Model (AWS SAM) template:
- Clone the project from the aws-samples repository on GitHub and follow the instructions.
Create the resources by using the AWS Management Console. In this post, we walk you through the following steps:
- Set up the Lambda function for scaling
- Configure permissions for the function
- Test the function
- Set up an EventBridge scheduled event

Set up the Lambda function

To create the Lambda function that we use for scaling, we create a function using the Python runtime (for this post, we use the Python 3.8 runtime).

Use the following code as the content of your lambda_function.py code:

#
# Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
# 
# Permission is hereby granted, free of charge, to any person obtaining a copy of this
# software and associated documentation files (the "Software"), to deal in the Software
# without restriction, including without limitation the rights to use, copy, modify,
# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so.
# 
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A
# PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT
# HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
# OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
# SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
#

'''
 Changes the number of Amazon Kendra Enterprise Edition index capacity units

 Parameters
 ----------
 event : dict
 Lambda event

 Returns
 -------
 The additional capacity action or an error
'''

import json
import boto3
from botocore.exceptions import ClientError

# Variable declaration
KENDRA = boto3.client("kendra")
# Define your Amazon Kendra Enterprise Edition index ID
INDEX_ID = "<YOUR-INDEX-ID>"
# Define your baseline units
DEFAULT_UNITS = 0
# Define your the number of Query Capacity Units needed for increased capacity
ADDITIONAL_UNITS= 1


def add_capacity(INDEX_ID,capacity_units):
    try:
        response = KENDRA.update_index(
            Id=INDEX_ID,
            CapacityUnits={
                'QueryCapacityUnits': int(capacity_units),
                'StorageCapacityUnits': 0
                
            })
        return(response)
    except Exception as e:
        raise e

    
def reset_capacity(INDEX_ID,DEFAULT_UNITS):
    try:
        response = KENDRA.update_index(
            Id=INDEX_ID,
            CapacityUnits={
            'QueryCapacityUnits': DEFAULT_UNITS,
            'StorageCapacityUnits': 0
        })
    except Exception as e:
        raise e

  
def current_capacity(INDEX_ID):
    try:
        response = KENDRA.describe_index(
        Id=INDEX_ID)
        return(response)
    except Exception as e:
        raise e  


def lambda_handler(event,context):
    print("Checking for query capacity units......")
    response = current_capacity(INDEX_ID)
    currentunits = response['CapacityUnits']['QueryCapacityUnits']
    print ("Current query capacity units are: "+str(currentunits))
    status = response['Status']
    print ("Current index status is: "+status)
    # If index is stuck in UPDATE state, don't attempt changing the capacity
    if status == "UPDATING":
        return ("Index is currently being updated. No changes have been applied")
    if status == "ACTIVE":
        if currentunits == 0:
            print ("Adding query capacity...")
            response = add_capacity(INDEX_ID,ADDITIONAL_UNITS)    
            print(response)
            return response
        else:
            print ("Removing query capacity....")
            response = reset_capacity(INDEX_ID, DEFAULT_UNITS)
            print(response)
            return response
    else:
         response = "Index is not ready to modify capacity. No changes have been applied."
         return(response)

You must modify the following variables to match with your environment:

# Define your Amazon Kendra Enterprise Edition index ID
INDEX_ID = "<YOUR-INDEX-ID>"
# Define your baseline units
DEFAULT_UNITS = 1
# Define your the number of Query Capacity Units needed for increased capacity
ADDITIONAL_UNITS = 4

INDEX_ID – The ID for your index; you can check it on the Amazon Kendra console.
DEFAULT_UNITS – The number of query processing units that your Amazon Kendra Enterprise Edition requires to operate at minimum capacity. This number can range from 0–20 (you can request more capacity). 0 represents that no extra capacity units are provisioned to your Amazon Kendra Enterprise Edition index, which leaves it with a default capacity of 0.5 queries per second.
ADDITIONAL_UNITS – The number of query capacity units you require at those times where additional capacity is required. This value can range from 1–20 (you can request additional capacity).

Configure function permissions

To query the status of your index and to modify the number of query capacity units, you need to attach a policy to your Lambda function AWS Identity and Access Management (IAM) execution role with those permissions.

On the Lambda console, navigate to your function.
On the Permissions tab, choose the execution role.

The IAM console opens automatically.

On the Permissions tab, choose Attach policies.

Choose Create policy.

A new tab opens.

On the JSON tab, add the following content (make sure to provide your account and user information):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "MyPolicy",
            "Effect": "Allow",
            "Action": [
                "kendra:UpdateIndex",
                "kendra:DescribeIndex"
            ],
            "Resource": "arn:aws:kendra:<YOUR-AWS-REGION>:<YOUR-ACCOUNT-ID>:index/<YOUR-INDEX-ID>"
        }
    ]
}

Choose Next: Tags.
Choose Next: Review.
For Name, enter a policy name (for this post, we use AmazonKendra_UpdateIndex).
Choose Create policy.
On the Attach permissions page, choose the refresh icon.
Filter to find the policy you created.
Select the policy and choose Attach policy.

Test the function

You can test your Lambda function by running a test event. For more information, see Invoke the Lambda function.

On the Lambda console, navigate to your function.
Create a new test event by choosing Test.

Select Create new test event.
For Event template, because your function doesn’t require any input from the event, you can choose the hello-world event template.

Choose Create.
Choose Test.

On the Lambda function logs, you can see the following messages:

Function Logs
START RequestId: 9b2382b7-0229-4b2b-883e-ba0f6b149513 Version: $LATEST
Checking for capacity units......
Current capacity units are: 1
Current index status is: ACTIVE
Adding capacity...

Set up an EventBridge scheduled event

An EventBridge scheduled event is an EventBridge event that is triggered on a regular schedule. This section shows how to create an EventBridge scheduled event that runs every day at 7 AM UTC and at 8 PM UTC to trigger the kendra-index-scaler Lambda function. This allows your index to scale up with the additional query capacity units at 7 AM and scale down at 8 PM.

When you set up EventBridge scheduled events, you do so for the UTC time zone, so you need to calculate the time offset. For example, to run the event at 7 AM Central Standard Time (CST), you need to set the time to 1 PM UTC. If you want to accommodate for daylight savings, you have to create a different rule to account for the difference.

On the EventBridge console, in the navigation pane, under Events, choose Rules.
Choose Create rule.

For Name, enter a name for your rule (for this post, we use kendra-index-scaler).

In the Define pattern section, select Schedule.
Select Cron expression and enter 0 7,20 * * ? *.

We use this cron expression to trigger the EventBridge event every day at 7 AM and 8 PM.

In the Select event bus section, select AWS default event bus.

In the Select targets section, for Target, choose Lambda function.
For Function, enter the function you created earlier (lambda_function_kendra_index_handler).

Choose Create.

You can check Amazon CloudWatch Logs for the lambda_function_kendra_index_handler function and see how it behaves depending on your index’s query capacity units.

Conclusion

In this post, you deployed a mechanism to automatically scale additional query processing units for your Amazon Kendra Enterprise Edition index.

As a next step, you could periodically review your usage patterns in order to plan the schedule to accommodate your query volume. To learn more about Amazon Kendra’s use cases, benefits, and how to get started with it, visit the webpage!

About the Authors

Juan Bustos is an AI Services Specialist Solutions Architect at Amazon Web Services, based in Dallas, TX. Outside of work, he loves spending time writing and playing music as well as trying random restaurants with his family.

Tapodipta Ghosh is a Senior Architect. He leads the Content And Knowledge Engineering Machine Learning team that focuses on building models related to AWS Technical Content. He also helps our customers with AI/ML strategy and implementation using our AI Language services like Amazon Kendra.

Tom McMahon is a Product Marketing Manager on the AI Services team at AWS. He’s passionate about technology and storytelling and has spent time across a wide-range of industries including healthcare, retail, logistics, and ecommerce. In his spare time he enjoys spending time with family, music, playing golf, and exploring the amazing Pacific northwest and its surrounds.

Automate multi-modality, parallel data labeling workflows with Amazon SageMaker Ground Truth and AWS Step Functions

May 5, 2021

by Vidya Sagar Ravipati Amazon AWS

This is the first in a two-part series on the Amazon SageMaker Ground Truth hierarchical labeling workflow and dashboards. In Part 1, we look at creating multi-step labeling workflows for hierarchical label taxonomies using AWS Step Functions. In Part 2 (coming soon), we look at how to build dashboards for analyzing dataset annotations and worker performance metrics on data lakes generated as output from the complex workflows and derive insights.

Data labeling often requires a single data object to include multiple types of annotations, or multi-type, such as 2D boxes (bounding boxes), lines, and segmentation masks, all on a single image. Additionally, to create high-quality machine learning (ML) models using labeled data, you need a way to monitor the quality of the labels. You can do this by creating a workflow in which labeled data is audited and adjusted as needed. This post introduces a solution to address both of these labeling challenges using an automotive dataset, and you can extend this solution for use with any type of dataset.

For our use case, assume you have a large quantity of automotive video data filmed from one or more angles on a moving vehicle (for example, some Multi-Object Tracking (MOT) scenes) and you want to annotate the data using multiple types of annotations. You plan to use this data to train a cruise control, lane-keeping ML algorithm. Given the task at hand, it’s imperative that you use high-quality labels to train the model.

First, you must identify the types of annotations you want to add to your video frames. Some of the most important objects to label for this use case are other vehicles in the frame, road boundaries, and lanes. To do this, you define a hierarchical label taxonomy, which defines the type of labels you want to add to each video, and the order in which you want the labels to be added. The Ground Truth video tracking labeling job supports bounding box, polyline, polygon, and keypoint annotations. In this use case, vehicles are annotated using 2-dimensional boxes, or bounding boxes, and road boundaries and curves are annotated with a series of flexible lines segments, referred to as polylines.

Second, you need to establish a workflow to ensure label quality. To do this, you can create an audit workflow to verify the labels generated by your pipeline are of high enough quality to be useful for model training. In this audit workflow, you can greatly improve label accuracy by building a multi-step review pipeline that allows annotations to be audited, and if necessary, adjusted by a second reviewer who may be a subject matter expert.

Based on the size of the dataset and data objects, you should also consider the time and resources required to create and maintain this pipeline. Ideally, you want this series of labeling jobs to be started automatically, only requiring human operation to specify the input data and workflow.

The solution used in this post uses Ground Truth, AWS CloudFormation, Step Functions, and Amazon DynamoDB to create a series of labeling jobs that run in a parallel and hierarchical fashion. You use a hierarchical label taxonomy to create labeling jobs of different modalities (polylines and bounding boxes), and you add secondary human review steps to improve annotation quality and final results.

For this post, we demonstrate the solution in the context of the automotive space, but you can easily apply this general pipeline to labeling pipelines involving images, videos, text, and more. In addition, we demonstrate a workflow that is extensible, allowing you to reduce the total number of frames that need human review by adding automated quality checks and maintaining data quality at scale. In this use case, we use these checks to find anomalies in MOT time series data like video object tracking annotations.

We walk through a use case in which we generate multiple types of annotations for an automotive scene. Specifically, we run four labeling jobs per input video clip: an initial labeling of vehicles, initial labeling of lanes, and then an adjustment job per initial job with a separate quality assurance workforce.

We demonstrate the various extension points in our Step Function workflow that can allow you to run automated quality assurance checks. This allows for clip filtering between and after jobs have completed, which can result in high-quality annotations for a fraction of the cost.

AWS services used to implement this solution

This solution creates and manages Ground Truth labeling jobs to label video frames using multiple types of annotations. Ground Truth has native support for video datasets through its video frame object tracking task type.

This task type allows workers to create annotations across a series of video frames, providing tools to predict the next location of a bounding box in subsequent frames. It also supports multiple annotation types such as bounding boxes or polylines through the label category configuration files provided during job creation. We use these tools in this tutorial, running a job for vehicle bounding boxes and a job for lane polylines.

We use Step Functions to manage the labeling job. This solution abstracts labeling job creation so that you specify the overall workflow you want to run using a hierarchical label taxonomy, and all job management is handled by Step Functions.

The solution is implemented using CloudFormation templates that you can deploy in your AWS account. The interface to the solution is an API managed by Amazon API Gateway, which provides the ability to submit annotation tasks to the solution, which are then translated into Ground Truth labeling jobs.

Estimated costs

By deploying and using this solution, you incur the maximum cost of approximately $20 other than human labeling costs because it only uses fully managed compute resources on demand. Amazon Simple Storage Service (Amazon S3), AWS Lambda, Amazon SageMaker, API Gateway, Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), AWS Glue, and Step Functions are included in the AWS Free Tier, with charges for additional use. For more information, see the following pricing pages:

Ground Truth pricing depends on the type of workforce that you use. If you’re a new user of Ground Truth, we suggest that you use a private workforce and include yourself as a worker to test your labeling job configuration. For more information, see Amazon SageMaker Ground Truth pricing.

Solution overview

In this two-part series, we discuss an architecture pattern that allows you to build a pipeline for orchestrating multi-step data labeling workflows that have workers add different types of annotation in parallel using Ground Truth. You also learn how you can analyze the dataset annotations produced by the workflow as well as worker performance. The first post covers the Step Functions workflow that automates advanced ML data labeling workflows using Ground Truth for chaining and hierarchical label taxonomies. The second post describes how to build data lakes on dataset annotations from Ground Truth and worker metrics and use these data lakes to derive insights or analyze the performance of your workers and dataset annotation quality using advanced analytics.

The following diagram depicts the hierarchical workflow, which you can use to run groups of labeling jobs in sequential steps, or levels, in which each labeling job in a single level runs in parallel.

The solution is composed of two main parts:

Use an API to trigger the orchestration workflow.
Run the individual steps of the workflow to achieve the labeling pipeline.

Trigger the orchestration workflow with an API

The CloudFormation template launched in this solution uses API Gateway to expose an endpoint for you to trigger batch labeling jobs. After you send the post request to the API Gateway endpoint, it runs a Lambda function to trigger the workflow.

The following table contains the two main user-facing APIs relevant to running batch, which represents multi-level labeling jobs.

URL	Request Type	Description
{endpointUrl}/batch/create	POST	API triggers a new batch of labeling jobs
{endpointUrl}/batch/show	GET	APIs describe current state of the batch job run

Run the workflow

For the orchestration of steps, we use Step Functions as a managed solution. When the batch job creation API is triggered, a Lambda function triggers a Step Functions workflow like the following. This begins the annotation input processing.

Let’s discuss the steps in more detail.

Transformation step

The first step is to preprocess the data. The current implementation converts the notebook inputs into the internal manifest file data type shared across multiple steps. This step doesn’t currently perform any complex processing, but you can further customize this step by adding custom data preprocessing logic to this function. For example, if your dataset was encoded in raw videos, you could perform frame splitting and manifest generation within transformation rather than in a separate notebook. Alternatively, if you’re using this solution to create a 3D point cloud labeling pipeline, you may want to add logic to extract pose data in a world coordinate system using the camera and LiDAR extrinsic matrices.

TriggerLabelingFirstLevel

When the data preprocessing is complete, the Ground Truth API operation CreateLabelingJob is used to launch labeling jobs. These labeling jobs are responsible for annotating datasets that are tied to the first level.

CheckForFirstLevelComplete

This step waits for the FIRST_LEVEL Ground Truth labeling jobs triggered from the TriggerLabelingFirstStep. When the job trigger is complete, this step waits for all the created labeling jobs to complete. An external listener Lambda function monitors the status of the labeling jobs, and when all the pending labeling jobs are done, it runs the sendTokenSucess API to signal to this state to proceed to the next step. Failure cases are handled using appropriate error clauses and timeouts in the step definition.

SendSecondLevelSNSAndCheckResponse

This step performs postprocessing on the output of the first-level job. For example, if your requirements are to only send 10% of frames to the adjustment jobs, you can implement this logic here by filtering the set of outputs from the first job.

TriggerLabelingSecondLevel

When the data postprocessing from the first-level is complete, CreateLabelingJobs is used to launch labeling jobs to complete annotations at the second level. At this stage, a private workforce reviews the quality of annotations of the first-level labeling jobs and updates annotations as needed.

CheckForSecondLevelComplete

This step is the same wait step as CheckForFirstLevelComplete, but this step simply waits for the jobs that are created from the second level.

SendThirdLevelSNSAndCheckResponse

This step is the same post-processing step as SendSecondLevelSNSAndCheckResponse, but this step does the postprocessing of the second-level output and feeds as input to the third-level labeling job.

TriggerLabelingThirdLevel

This is the same logic as TriggerLabelingSecondLevel, but the labeling jobs are triggered that are annotated as third level. At this stage, the private workforce is updating annotations for quality of the second-level labeling job.

CopyLogsAndSendBatchCompleted

This Lambda function logs and sends SNS messages to notify users that the batch is complete. It’s also a placeholder for any post-processing logic that you may want to run. Common postprocessing includes transforming the labeled data into a format compatible with a customer-specific data format.

Prerequisites

Before getting started, make sure you have the following prerequisites:

An AWS account.
A notebook AWS Identity and Access Management (IAM) role with the permissions required to complete this walkthrough. Your IAM role must have the required permissions attached. If you don’t require granular permission, attach the following AWS managed policies:
- AmazonS3FullAccess
- AmazonAPIGatewayInvokeFullAccess
- AmazonSageMakerFullAccess
Familiarity with Ground Truth, AWS CloudFormation, and Step Functions.
A SageMaker workforce. For this post, we use a private workforce. You can create a workforce on the SageMaker console. Note the Amazon Cognito user pool identifier and the app client identifier after your workforce is created. You use these values to tell the CloudFormation stack deployment which workforce to create work teams, which represent the group of labelers. You can find these values in the Private workforce summary section on the console after you create your workforce, or when you call DescribeWorkteam.

The following GIF demonstrates how to create a private workforce. For step-by-step instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.

Launch the CloudFormation stack

Now that we’ve seen the structure of the solution, we deploy it into our account so we can run an example workflow. All our deployment steps are managed by AWS CloudFormation—it creates resources in Lambda, Step Functions, DynamoDB, and API Gateway for you.

You can launch the stack in AWS Region us-east-1 on the CloudFormation console by choosing Launch Stack:

On the CloudFormation console, select Next, and then modify the following template parameters to customize the solution.

You can locate the CognitoUserPoolClientId and CognitoUserPoolId in the SageMaker console.

CognitoUserPoolClientId: App client ID of your private workforce.
CognitoUserPoolId: ID of the user pool associated with your private workforce.

To locate these values in the console:

Open the SageMaker console at https://console.aws.amazon.com/sagemaker/
Select Labeling workforces in the navigation pane.
Choosing the Private
Use the values in the Private work team summary Use the App client for the CognitoUserPoolClientId and use Amazon Cognito user pool for the CognitoUserPoolId.

For this tutorial, you can use the default values for the following parameters.

GlueJobTriggerCron: Cron expression to use when scheduling reporting AWS Glue cron job. The results from annotations generated with SageMaker Ground Truth and worker performance metrics are used to create a dashboard in Amazon QuickSight. This will be explained in detail as part of second part. The outputs from SageMaker annotations and worker performance metrics shows up in Athena queries after processing the data with AWS Glue. By default, AWS Glue cron jobs run every hour.
JobCompletionTimeout: Number of seconds to wait before treating a labeling job as failed and moving to the BatchError state.
LoggingLevel: This is used internally and can be ignored. Logging level to change verbosity of logs. Accepts values DEBUG and PROD.

Prefix: A prefix to use when naming resources used to creating and manage labeling jobs and worker metrics.

To launch the stack in a different AWS Region, use the instructions found in the README of the GitHub repository.

After you deploy the solution, two new work teams are in the private workforce you created earlier: smgt-workflow-first-level and smgt-workflow-second-level. These are the default work teams used by the solution if no overrides are specified, and the smgt-workflow-second-level work team is used for labeling second-level and third-level jobs. You should add yourself to both work teams to see labeling tasks created by the solution. To learn how to add yourself to a private work team, see Add or Remove Workers.

You also need to go the the API Gateway console and look for the deployed API prefixed with smgt-workflow and note its ID. The notebook needs to reference this ID so it can determine which API URL to call.

Launch the notebook

After you deploy the solution into your account, you’re ready to launch a notebook to interact with it and start new workflows. In this section, we walk through the following steps:

Set up and access the notebook instance.
Obtain the example dataset.
Prepare Ground Truth input files.

Set up the SageMaker notebook instance

In this example notebook, you learn how to map a simple taxonomy consisting of a vehicle class and a lane class to Ground Truth label category configuration files. A label category configuration file is used to define the labels that workers use to annotation your images. Next, you learn how to launch and configure the solution that runs the pipeline using a CloudFormation template. You can also further customize this code, for example by customizing the batch creation API call to run labeling for a different combination of task types.

To create a notebook instance and access the notebook used in this post, complete the following steps:

Create a notebook instance with the following parameters:
1. Use ml.t2.medium to launch the notebook instance.
2. Increase the ML storage volume size to at least 10 GB.
3. Select the notebook IAM role described in prerequisites. This role allows your notebook to upload your dataset to Amazon S3 and call the solution APIs.
Open Jupyter Lab or Jupyter to access your notebook instances.
In Jupyter, choose the SageMaker Examples In Jupyter Lab, choose the SageMaker icon.
Choose Ground Truth Labeling Jobs and then choose the job sagemaker_ground_truth_workflows.ipynb.
If you’re using Jupyter, choose Use to copy the notebook to your instance and run it. If you’re in Jupyter lab, choose Create a Copy.

Obtain the example dataset

Complete the following steps to set up your dataset:

Download MOT17.zip using the Download Dataset section of the notebook.

This download is approximately 5 GB and takes several minutes.

Unzip MOT17.zip using the notebook’s Unzip dataset
Under the Copy Data to S3 header, run the cell to copy one set of video frames dataset to Amazon S3.

Prepare the Ground Truth input files

To use the solution, we need to create a manifest file. This file tells Ground Truth where your dataset is. We also need two label category configuration files to describe our label names, and the labeling tool to use for each (bounding box or polyline).

Run the cells under Generate Manifest to obtain a list of frames in a video from the dataset. We take 150 frames at half the frame rate of the video as an example.
Continue running cells under Generate Manifest to build a sequence file describing our video frames, and then to create a manifest file referring to our sequence file.
Run the cell under Generate Label Category Configuration Files to create two new files: a vehicle label configuration file (which uses the bounding box tool), and a lane label configuration file (which uses the polyline tool).
Copy the manifest file and label the category configuration files to Amazon S3 by running the Send data to S3

At this point, you have prepared all inputs to the labeling jobs and are ready to begin operating the solution.

To learn more about Ground Truth video frame labeling jobs and chaining, see the following references:

Run an example workflow

In this section, we walk through the steps to run an example workflow on the automotive dataset. We create a multi-modality workflow, perform both initial and audit labeling, then view our completed annotations.

Create a workflow batch

This solution orchestrates a workflow of Ground Truth labeling jobs to run both video object tracking bounding box jobs and polyline jobs, as well as automatically create adjustment jobs after the initial labeling. This workflow batch is configured through the batch_create API available to the solution.

Run the cell under Batch Creation Demo in the notebook. This passes your input manifest and label category configuration S3 URIs to a new workflow batch.

The cell should output the ID of the newly created workflow batch, for example:

Batch processor successfully triggered with BatchId : nb-ccb0514c

Complete the first round of labeling tasks

To simulate workers completing labeling, we log in as a worker in the first-level Ground Truth work team and complete the labeling task.

Run the cell under Sign-in To Worker Portal to get a link to log in to the worker portal.

An invitation should have already been sent to your email address if you invited yourself to the solution-generated first-level and second-level work teams.

Two tasks should be available, one with ending in vehicle and one ending in lane, corresponding to the two jobs we created during workflow batch creation.

Open each task and add some dummy labels by choosing and dragging on the image frames.
Choose Submit on each task.

Complete the second round of labeling tasks

Our workflow specified we wanted adjustment jobs auto-launched for each first-level job. We now complete the second round of labeling tasks.

Still in the worker portal, wait for tasks with vehicle-audit and lane-audit to appear.
Open each task in the worker portal, noting that the prior level’s labels are still visible.

These adjustment tasks could be performed by a more highly trained quality assurance group in a different work team.

Make adjustments as desired and choose Pass or Fail on each annotation.
When you’re finished, choose Submit.

View the completed annotations

We can view details about the completed workflow batch by running the batch show API.

Run the cell under Batch Show Demo.

This queries the solution’s database for all complete workflow run batches, and should output your batch ID when your batch is complete.

We can get more specific details about a batch by running the cell under Batch Detailed Show Demo.

This takes the ID of a batch in the system and returns status information and the locations of all input and output manifests for each created job.

Copy and enter the field jobOutputS3Url for any of the jobs and verify the manifest file for that job is downloaded.

This file contains a reference to your input data sequence as well as the S3 URI of the output annotations for each sequence.

Final results

When all labeling jobs in the pipeline are complete, an SNS message is published on the default status SNS topic. You can subscribe to SNS topics using an email address for verifying the solution’s functionality. The message includes the batch ID used during batch creation, a message about the batch completion, and the same information the batch/show API provides under a batchInfo key. You can parse this message to extract metadata about the completed labeling jobs in the second level of the pipeline.

{
    "batchId": "nb-track-823f6d3e",
    "message": "Batch processing has completed successfully.",
    "batchInfo": {
        "batchId": "nb-track-823f6d3e",
        "status": "COMPLETE",
        "inputLabelingJobs": [
            {
                "jobName": "nb-track-823f6d3e-vehicle",
                "taskAvailabilityLifetimeInSeconds": "864000",
                "inputConfig": {
                    "inputManifestS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/MOT17-13-SDP.manifest"
                },
                "jobModality": "VideoObjectTracking",
                "taskTimeLimitInSeconds": "604800",
                "maxConcurrentTaskCount": "100",
                "workteamArn": "arn:aws:sagemaker:us-west-2:322552456788:workteam/private-crowd/smgt-workflow-1-first-level",
                "jobType": "BATCH",
                "jobLevel": "1",
                "labelCategoryConfigS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/vehicle_label_category.json"
            },
            {
                "jobName": "nb-track-823f6d3e-lane",
                "taskAvailabilityLifetimeInSeconds": "864000",
                "inputConfig": {
                    "inputManifestS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/MOT17-13-SDP.manifest"
                },
                "jobModality": "VideoObjectTracking",
                "taskTimeLimitInSeconds": "604800",
                "maxConcurrentTaskCount": "100",
                "workteamArn": "arn:aws:sagemaker:us-west-2:322552456788:workteam/private-crowd/smgt-workflow-1-first-level",
                "jobType": "BATCH",
                "jobLevel": "1",
                "labelCategoryConfigS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/lane_label_category.json"
            },
            {
                "jobName": "nb-track-823f6d3e-vehicle-audit",
                "taskAvailabilityLifetimeInSeconds": "864000",
                "inputConfig": {
                    "chainFromJobName": "nb-track-823f6d3e-vehicle"
                },
                "jobModality": "VideoObjectTrackingAudit",
                "taskTimeLimitInSeconds": "604800",
                "maxConcurrentTaskCount": "100",
                "workteamArn": "arn:aws:sagemaker:us-west-2:322552456788:workteam/private-crowd/smgt-workflow-1-first-level",
                "jobType": "BATCH",
                "jobLevel": "2"
            },
            {
                "jobName": "nb-track-823f6d3e-lane-audit",
                "taskAvailabilityLifetimeInSeconds": "864000",
                "inputConfig": {
                    "chainFromJobName": "nb-track-823f6d3e-lane"
                },
                "jobModality": "VideoObjectTrackingAudit",
                "taskTimeLimitInSeconds": "604800",
                "maxConcurrentTaskCount": "100",
                "workteamArn": "arn:aws:sagemaker:us-west-2:322552456788:workteam/private-crowd/smgt-workflow-1-first-level",
                "jobType": "BATCH",
                "jobLevel": "2"
            }
        ],
        "firstLevel": {
            "status": "COMPLETE",
            "numChildBatches": "2",
            "numChildBatchesComplete": "2",
            "jobLevels": [
                {
                    "batchId": "nb-track-823f6d3e-first_level-nb-track-823f6d3e-lane",
                    "batchStatus": "COMPLETE",
                    "labelingJobName": "nb-track-823f6d3e-lane",
                    "labelAttributeName": "nb-track-823f6d3e-lane-ref",
                    "labelCategoryS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/lane_label_category.json",
                    "jobInputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/MOT17-13-SDP.manifest",
                    "jobInputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-input.s3.amazonaws.com/tracking_manifests/MOT17-13-SDP.manifest?...",
                    "jobOutputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectDetection/nb-track-823f6d3e-first_level-nb-track-823f6d3e-lane/output/nb-track-823f6d3e-lane/manifests/output/output.manifest",
                    "jobOutputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-processing.s3.amazonaws.com/batch_manifests/VideoObjectDetection/nb-track-823f6d3e-first_level-nb-track-823f6d3e-lane/output/nb-track-823f6d3e-lane/manifests/output/output.manifest?..."
                },
                {
                    "batchId": "nb-track-823f6d3e-first_level-nb-track-823f6d3e-vehicle",
                    "batchStatus": "COMPLETE",
                    "labelingJobName": "nb-track-823f6d3e-vehicle",
                    "labelAttributeName": "nb-track-823f6d3e-vehicle-ref",
                    "labelCategoryS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/vehicle_label_category.json",
                    "jobInputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/MOT17-13-SDP.manifest",
                    "jobInputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-input.s3.amazonaws.com/tracking_manifests/MOT17-13-SDP.manifest?...",
                    "jobOutputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectTracking/nb-track-823f6d3e-first_level-nb-track-823f6d3e-vehicle/output/nb-track-823f6d3e-vehicle/manifests/output/output.manifest",
                    "jobOutputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-processing.s3.amazonaws.com/batch_manifests/VideoObjectTracking/nb-track-823f6d3e-first_level-nb-track-823f6d3e-vehicle/output/nb-track-823f6d3e-vehicle/manifests/output/output.manifest?..."
                }
            ]
        },
        "secondLevel": {
            "status": "COMPLETE",
            "numChildBatches": "2",
            "numChildBatchesComplete": "2",
            "jobLevels": [
                {
                    "batchId": "nb-track-823f6d3e-second_level-nb-track-823f6d3e-vehicle-audit",
                    "batchStatus": "COMPLETE",
                    "labelingJobName": "nb-track-823f6d3e-vehicle-audit",
                    "labelAttributeName": "nb-track-823f6d3e-vehicle-audit-ref",
                    "labelCategoryS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/label_category_input/nb-track-823f6d3e-second_level-nb-track-823f6d3e-vehicle-audit/category-file.json",
                    "jobInputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectTracking/nb-track-823f6d3e-first_level-nb-track-823f6d3e-vehicle/output/nb-track-823f6d3e-vehicle/manifests/output/output.manifest",
                    "jobInputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-processing.s3.amazonaws.com/batch_manifests/VideoObjectTracking/nb-track-823f6d3e-first_level-nb-track-823f6d3e-vehicle/output/nb-track-823f6d3e-vehicle/manifests/output/output.manifest?...",
                    "jobOutputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectTrackingAudit/nb-track-823f6d3e-second_level-nb-track-823f6d3e-vehicle-audit/output/nb-track-823f6d3e-vehicle-audit/manifests/output/output.manifest",
                    "jobOutputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-processing.s3.amazonaws.com/batch_manifests/VideoObjectTrackingAudit/nb-track-823f6d3e-second_level-nb-track-823f6d3e-vehicle-audit/output/nb-track-823f6d3e-vehicle-audit/manifests/output/output.manifest?..."
                },
                {
                    "batchId": "nb-track-823f6d3e-second_level-nb-track-823f6d3e-lane-audit",
                    "batchStatus": "COMPLETE",
                    "labelingJobName": "nb-track-823f6d3e-lane-audit",
                    "labelAttributeName": "nb-track-823f6d3e-lane-audit-ref",
                    "labelCategoryS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/label_category_input/nb-track-823f6d3e-second_level-nb-track-823f6d3e-lane-audit/category-file.json",
                    "jobInputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectDetection/nb-track-823f6d3e-first_level-nb-track-823f6d3e-lane/output/nb-track-823f6d3e-lane/manifests/output/output.manifest",
                    "jobInputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-processing.s3.amazonaws.com/batch_manifests/VideoObjectDetection/nb-track-823f6d3e-first_level-nb-track-823f6d3e-lane/output/nb-track-823f6d3e-lane/manifests/output/output.manifest?...",
                    "jobOutputS3Uri": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectDetectionAudit/nb-track-823f6d3e-second_level-nb-track-823f6d3e-lane-audit/output/nb-track-823f6d3e-lane-audit/manifests/output/output.manifest",
                    "jobOutputS3Url": "https://smgt-workflow-1-322552456788-us-west-2-batch-processing.s3.amazonaws.com/batch_manifests/VideoObjectDetectionAudit/nb-track-823f6d3e-second_level-nb-track-823f6d3e-lane-audit/output/nb-track-823f6d3e-lane-audit/manifests/output/output.manifest?..."
                }
            ]
        },
        "thirdLevel": {
            "status": "COMPLETE",
            "numChildBatches": "0",
            "numChildBatchesComplete": "0",
            "jobLevels": []
        }
    },
    "token": "arn:aws:states:us-west-2:322552456788:execution:smgt-workflow-1-batch-process:nb-track-823f6d3e-8432b929",
    "status": "SUCCESS"
}

Within each job metadata blob, a jobOutputS3Url field contains a presigned URL to access the output manifest of this particular job. The output manifest contains the results of data labeling in augmented manifest format, which you can parse to retrieve annotations by indexing the JSON object with <jobName>-ref. This field points to an S3 location containing all annotations for the given video clip.

{
    "source-ref": "s3://smgt-workflow-1-322552456788-us-west-2-batch-input/tracking_manifests/MOT17-13-SDP_seq.json",
    "nb-track-93aa7d01-vehicle-ref": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectTracking/nb-track-93aa7d01-first_level-nb-track-93aa7d01-vehicle/output/nb-track-93aa7d01-vehicle/annotations/consolidated-annotation/output/0/SeqLabel.json",
    "nb-track-93aa7d01-vehicle-ref-metadata": {
        "class-map": {"0": "Vehicle"},
        "job-name": "labeling-job/nb-track-93aa7d01-vehicle",
        "human-annotated": "yes",
        "creation-date": "2021-04-05T17:43:02.469000",
        "type": "groundtruth/video-object-tracking",
    },
    "nb-track-93aa7d01-vehicle-audit-ref": "s3://smgt-workflow-1-322552456788-us-west-2-batch-processing/batch_manifests/VideoObjectTrackingAudit/nb-track-93aa7d01-second_level-nb-track-93aa7d01-vehicle-audit/output/nb-track-93aa7d01-vehicle-audit/annotations/consolidated-annotation/output/0/SeqLabel.json",
    "nb-track-93aa7d01-vehicle-audit-ref-metadata": {
        "class-map": {"0": "Vehicle"},
        "job-name": "labeling-job/nb-track-93aa7d01-vehicle-audit",
        "human-annotated": "yes",
        "creation-date": "2021-04-05T17:55:33.284000",
        "type": "groundtruth/video-object-tracking",
        "adjustment-status": "unadjusted",
    },
}

For example, for bounding box jobs, the SeqLabel.json file contains bounding box annotations for each annotated frame (in this case, only the first frame is annotated):

{
  "tracking-annotations": [
    {
      "annotations": [
        {
          "height": 66,
          "width": 81,
          "top": 547,
          "left": 954,
          "class-id": "0",
          "label-category-attributes": {},
          "object-id": "3c02d0f0-9636-11eb-90fe-6dd825b8de95",
          "object-name": "Vehicle:1"
        },
        {
          "height": 98,
          "width": 106,
          "top": 545,
          "left": 1079,
          "class-id": "0",
          "label-category-attributes": {},
          "object-id": "3d957ee0-9636-11eb-90fe-6dd825b8de95",
          "object-name": "Vehicle:2"
        }
      ],
      "frame-no": "0",
      "frame": "000001.jpg",
      "frame-attributes": {}
    }
  ]
}

Because the batch completion SNS message contains all output manifest files from the jobs launched in parallel, you can perform any postprocessing of your annotations based on this message. For example, if you have a specific serialization format for these annotations that combines vehicle bounding boxes and lane annotations, you can get the output manifest of the lane job as well as the vehicle job, then merge based on frame number and convert to your desired final format.

To learn more about Ground Truth output data formats, see Output Data.

Clean up

To avoid incurring future charges, run the Clean up section of the notebook to delete all the resources including S3 objects and the CloudFormation stack. When the deletion is complete, make sure to stop and delete the notebook instance that is hosting the current notebook script.

Conclusion

This two-part series provides you with a reference architecture to build an advanced data labeling workflow comprised of a multi-step data labeling pipeline, adjustment jobs, and data lakes for corresponding dataset annotations and worker metrics as well as updated dashboards.

In this post, you learned how to take video frame data and trigger a workflow to run multiple Ground Truth labeling jobs, generating two different types of annotations (bounding boxes and polylines). You also learned how you can extend the pipeline to audit and verify the labeled dataset and how to retrieve the audited results. Lastly, you saw how to reference the current progress of batch jobs using the BatchShow API.

For more information about the data lake for Ground Truth dataset annotations and worker metrics from Ground Truth, check back to the Ground Truth blog for the second blog post in this series(coming soon).

Try out the notebook and customize it for your input datasets by adding additional jobs or audit steps, or by modifying the data modality of the jobs. Further customization of solution could include, but is not limited, to:

Adding additional types of annotations such as semantic segmentation masks or keypoints
Adding automated quality assurance and filtering to the Step Functions workflow to only send low-quality annotations to the next level of review
Adding third or fourth levels of quality review for additional, more specialized types of reviews

This solution is built using serverless technologies on top of Step Functions, which makes it highly customizable and applicable for a wide variety of applications.

About the Authors

Vidya Sagar Ravipati is a Deep Learning Architect at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.

Jeremy Feltracco is a Software Development Engineer with the Amazon ML Solutions Lab at Amazon Web Services. He uses his background in computer vision, robotics, and machine learning to help AWS customers accelerate their AI adoption.

Jae Sung Jang is a Software Development Engineer. His passion lies with automating manual process using AI Solutions and Orchestration technologies to ensure business execution.

Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon.

Solution overview

Prerequisites

Architecture

Create an Amazon S3 bucket to be used as a data source

Deploy the infrastructure as a CloudFormation stack

Get the data and generate the metadata

Review Amazon Kendra configuration and start the data source sync

Observe Amazon Kendra index facet definition

Query the repository of WHO Model List of Essential Medicines

Infrastructure cleanup

Conclusion

About the Authors

Architecture overview

Deploy the solution using AWS CloudFormation

Translate your documents

Troubleshooting

Other considerations

Conclusion

About the Authors

Solution overview

Component 1: Data ingestion and preparation

Component 2: ML batch inference

Component 3: Data postprocessing and publishing

Business benefit and what’s next

About the Authors

Solution overview

Create and publish a detector

Create the input and output S3 buckets

Create a sample CSV file of event records

Perform a batch prediction

Review your batch prediction results

Conclusion

About the Author

Solution overview

Set up the Lambda function

Configure function permissions

Test the function

Set up an EventBridge scheduled event

Conclusion

About the Authors

AWS services used to implement this solution

Estimated costs

Solution overview

Trigger the orchestration workflow with an API

Run the workflow

Transformation step

TriggerLabelingFirstLevel

CheckForFirstLevelComplete

SendSecondLevelSNSAndCheckResponse

TriggerLabelingSecondLevel

CheckForSecondLevelComplete

SendThirdLevelSNSAndCheckResponse

TriggerLabelingThirdLevel

CopyLogsAndSendBatchCompleted

Prerequisites

Launch the CloudFormation stack

Launch the notebook

Set up the SageMaker notebook instance

Obtain the example dataset

Prepare the Ground Truth input files

Run an example workflow

Create a workflow batch

Complete the first round of labeling tasks

Complete the second round of labeling tasks

View the completed annotations

Final results

Clean up

Conclusion

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.