Build BI dashboards for your Amazon SageMaker Ground Truth labels and worker metadata

This is the second in a two-part series on the Amazon SageMaker Ground Truth hierarchical labeling workflow and dashboards. In Part 1: Automate multi-modality, parallel data labeling workflows with Amazon SageMaker Ground Truth and AWS Step Functions, we looked at how to create multi-step labeling workflows for hierarchical label taxonomies using AWS Step Functions. In Part 2, we look at how to build dashboards and derive insights for analyzing dataset annotations and worker performance metrics on data lakes generated as output from the complex workflows.

Amazon SageMaker Ground Truth (Ground Truth) is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning (ML). This post introduces a solution that you can use to create customized business intelligence (BI) dashboards using Ground Truth labeling job output data. You can use these dashboards to analyze annotation quality, worker metrics, and more.

In Part 1, we presented a solution to create multiple types of annotations for a single input data object and check annotation quality, using a series of multi-step labeling jobs that run in a parallel, hierarchical fashion using Step Functions. The solution results in high-quality annotations using Ground Truth. The format of these annotations is explained in Output Data, and each takes the form of one or more JSON manifest files in Amazon Simple Storage Service (Amazon S3). You now need a mechanism to dynamically fetch these manifests, publish them on to your analytical datastore, and use them to create meaningful reports in an automated fashion. This allows ML practitioners and data scientists to track annotation progress and quality and allows MLOps and annotation operations teams to gain insights about the annotations and track worker performance. For example, these interested parties may want to see the following reports generated from Ground Truth output data:

  • Annotation-level reports – These reports include the following:
    • The number of annotations done in a specified time frame.
    • Filtering based on label attributes. A label attribute is a Ground Truth feature that workers can use to provide metadata about individual annotations. For example, you can create a label attribute to have workers identify vehicle type (sedan, SUV, bus) or vehicle status (parked or moving).
    • The number of frames per label or frame attributes in a labeling job. A frame attribute is a Ground Truth feature that workers can use to provide metadata about video frames. For example, you can create a frame attribute to have workers identify frame quality (blurry or clear) and add a visualization to show the number of good (clear) vs. bad (blurry) frames.
    • The number of tasks audited or adjusted by a reviewer (in Part 1, this is a second-level or third-level worker).
    • If you have workers audit labels from previous labeling jobs, you can enumerate audit results for each label (such as car or bush) using label attributes (such as correctly or incorrectly labeled).
  • Worker-level reports – These reports include the following:
    • The number of Ground Truth jobs worked on by each worker.
    • The total number of labels created by each individual annotator.
    • For one or more labeling jobs, the total amount of time spent by each worker annotating data objects.
    • The minimum, average, and maximum time taken to label data objects by each worker.
    • The statistics of these questions across the entire data annotator team.

In this post, we walk you through the process of generating a data lake for annotations and worker metadata from Ground Truth output data and build visual dashboards on those datasets to gain business insights using Amazon S3, AWS Glue, Amazon Athena, and Amazon QuickSight.

If you completed Part 1 of this series, you can skip the prerequisite and deployment steps and start setting up the AWS Glue ETL job used to process the output data generated from that tutorial. If you didn’t complete Part 1, make sure to complete the prerequisites and deploy the solution, before enabling the AWS Glue workflow.

AWS services used to implement this solution

This post walks you through how to create helpful visualizations for analyzing Ground Truth output data to derive insights into annotations and throughput and efficiency of your own private workers. The walkthrough uses the following AWS services:

  • Amazon Athena – Allows you to perform ad hoc queries on S3 data using SQL, and query the QuickSight dataset for manual data analysis.
  • AWS Glue – Helps prepare your data for analysis or ML. AWS Glue is a serverless data preparation service that makes it easy to extract, clean, enrich, normalize, and load data. We use the following features:
    • An AWS Glue crawler to crawl the dataset and prepare metadata without loading it into a database. This reduces the cost of running an expensive database; you can store and run visuals from raw data files stored in an inexpensive, highly scalable, and durable S3 bucket.
    • AWS Glue ETL jobs to extract, transform, and load (ETL) additional data. A job is the business logic that performs the ETL work in AWS Glue.
    • The AWS Glue Data Catalog, which acts as a central metadata repository. This makes your data available for search and query using services such as Athena.
  • Amazon QuickSight – Generates insights and builds visualizations with your data. QuickSight lets you easily create and publish interactive dashboards. You can choose from an extensive library of visualizations, charts, and tables, and add interactive features such as drill-downs and filters. For more information about setting up a dashboard, see Getting Started with Data Analysis in Amazon QuickSight.
  • Amazon S3 – Stores the Ground Truth output data. Amazon S3 is the core service at the heart of the modern data architecture. Amazon S3 is unlimited, durable, elastic, and cost-effective for storing data or creating data lakes. You can use a data lake on Amazon S3 for reporting, analytics, artificial intelligence (AI), and machine learning (ML), because it can be shared across AWS big data services.

Solution overview

In Part 1 of this series, we discuss an architecture pattern that allows you to build a pipeline for orchestrating multi-step data labeling workflows that have workers add different types of annotations to data objects, in parallel, using Ground Truth. In this post, you learn how you can analyze the dataset annotations as well as worker performance. This solution builds data lakes using Ground Truth output data (annotations and worker metadata) and uses these data lakes to derive insights about or analyze the performance of your workers and dataset annotation quality using advanced analytics.

The code for Part 1 and Part 2 is located in the amazon-sagemaker-examples GitHub repo.

The following diagram illustrates this architecture, which is an end-to-end pipeline consisting of two components:

  • Workflow pipeline – A hierarchical workflow built using Ground Truth, AWS CloudFormation, Step Functions, Amazon DynamoDB, and AWS Lambda. This is covered in detail in Part 1.
  • Ground Truth reporting pipeline – A pipeline used to build BI dashboards using AWS Glue, Athena, and QuickSight to analyze and visualize Ground Truth output data and metadata generated by the AWS Glue ETL job. We discuss this in more detail in the next section.

Ground Truth reporting pipeline

The reporting pipeline is built on the output of the Ground Truth outputs stored in Amazon S3 (referred as the Ground Truth bucket).

The data is processed and the tables are created in the Data Catalog using the following steps:

  1. An AWS Glue crawler crawls the data labeling job output data, which is in JSON format, to determine the schema of your data, and creates a metadata table in your Data Catalog.
  2. The Data Catalog contains references to data that is used as sources and targets of your ETL jobs. The data is saved to an AWS Glue processing bucket.
  3. The ETL job retrieves worker metrics from the Ground Truth bucket and adds worker information from Amazon Cognito such as user name and email address. The job this data in the processed bucket (${Prefix}-${AWS::AccountId}-${AWS::Region}-wm-glue-output/processed_worker_metrics/). The job changes the format from JSON to Parquet for faster querying.
  4. A crawler crawls the processed worker metrics data from the processed AWS Glue bucket. A crawler also crawls the annotations folder and output manifests folder to generate annotations and manifest tables.
  5. For each crawler, AWS Glue adds tables (annotations table, output manifest tables, and worker metrics table) to the Data Catalog in the {Prefix}-gluedatabase database.
  6. Athena queries and retrieves the Ground Truth output data stored in the S3 data lake using the Data Catalog.
  7. The retrieved queries are visualized in QuickSight using dashboards.

As shown in the following dashboard examples, you can configure and display the top priority statistics at the top of the dashboard, such as total count of labeled vehicles, quality of labels and frames in a batch, and worker performance metrics. You can create additional visualizations according to your business needs. For more information, see Working with Visual Types in Amazon QuickSight.

The following table includes worker performance summary statistics.

The following dashboard shows several visualizations (from left to right, top to bottom):

  • The number of vehicles labeled, broken up by vehicle type
  • The number of annotations that passed and failed an audit quality check
  • The number of good-quality (pass) and bad-quality (fail) video frames in the labeling job, identified by workers using frame attributes
  • The number of parked vehicles (stationary) vs. moving vehicles (dynamic), identified by workers using label attributes
  • A histogram displaying the total number of vehicles labeled per frame
  • Tables displaying the quality of frames and audit results for multiple video frame labeling jobs


If you’re continuing from Part 1 of this series, you can skip this step and move on to enabling the AWS Glue workflow.

If you didn’t completed the demo in Part 1, you need the following resources:

  • An AWS account.
  • An AWS Identity and Access Management (IAM) user with access to Amazon S3, AWS Glue, and Athena. If you don’t require granular permission, attach the following AWS managed policies:
    • AmazonS3FullAccess
    • AmazonSageMakerFullAccess
  • Familiarity with Ground Truth, AWS CloudFormation, and Step Functions.
  • An Amazon SageMaker workforce. For this demonstration, we use a private workforce. You can create a workforce through the SageMaker console. Note the Amazon Cognito user pool ID and the App client ID after you create your workforce. You use these values to tell the AWS CloudFormation deployment which workforce to use to create work teams, which represents the group of labelers. You can find these values in the Private workforce summary page on the Ground Truth area of the Amazon SageMaker console after you create your workforce, or when you call DescribeWorkteam. The following GIF demonstrates how to create a private workforce. For step-by-step instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.

Deploy the solution

If you didn’t complete the tutorial outlined in Part 1, you can use the sample data provided for this post to create a sample dashboard. If you completed Part 1, you can skip this section and proceed to enabling the AWS Glue workflow.

Launch the dashboard stack

To launch the resources required to create a sample dashboard with example data, you can launch the stack in AWS Region us-east-1 on the AWS CloudFormation console by choosing Launch Stack:

On the AWS CloudFormation console, choose Next, and modify the parameter for CognitoUserPoolId to identify the user pool associated with your private workforce. You can locate this information on the SageMaker console:

  1. On the SageMaker console, choose Labeling workforces in the navigation pane.
  2. Find the values on the Private
  3. Use the App client value for CognitoUserPoolClientId and the Amazon Cognito user pool value for CognitoUserPoolId.

Additionally, enter a prefix to use when naming resources. We use this for creating and managing labeling jobs and worker metrics.

For this post, you can use the default values for the following parameters:

  • GlueJobTriggerCron – The cron expression to use when scheduling the reporting AWS Glue cron job. The results from annotations generated with Ground Truth and the worker performance metrics are used to create a dashboard in QuickSight. The outputs from the SageMaker annotations and worker performance metrics show up in Athena queries after processing the data with AWS Glue. By default, AWS Glue cron jobs run every hour.
  • BatchProcessingInputBucketId – The bucket that contains the SMGT output data under the batch manifests folder. By default, the ML blogs bucket (aws-ml-blog) is defined and contains the SMGT output data.
  • LoggingLevel – The logging level to change the verbosity of the logs. Accepts values DEBUG and PROD. This is used internally and can be ignored.

To launch the stack in a different AWS Region, use the instructions found in the README of the GitHub repository.

After you deploy the solution, use the next section to enable an AWS Glue workflow used to generate the BI dashboards.

Enable the AWS Glue workflow

If you completed Part 1, you launched a CloudFormation stack to create the Ground Truth labeling framework and the annotated MOT17 automotive dataset, using Ground Truth for vehicles and road boundaries and lanes, and audited the frames for quality of the annotations. To convert your data flow into the reporting dashboard set up by Ground Truth Labeling framework, you need to connect the output infrastructure that you previously set up to Athena and QuickSight. Athena can treat data in Amazon S3 as a relational database and allows you to run SQL queries on your data. QuickSight runs those queries on your behalf and creates visualizations of your data.

The following workflow allows Athena to run SQL queries on the example data. Complete the following steps to enable the workflow:

  1. On the AWS Glue console, in the left navigation pane, under ETL, choose Workflows.
  2. Select the SMGT-Glue-Workflow workflow.
  3. On the Actions menu, choose Run.

If you don’t want to start the workflow now, you can wait—it automatically runs hourly.

AWS Glue takes some time to spin up its resources during the first run, so allow approximately 30 minutes for the workflow to finish. The completed workflow shows up on the Workflows page.

This pipeline is set up in the reporting.yml file. Currently, the pipeline is run using the AWS Glue workflow using the ScheduledJobTrigger resource with the flag StartOnCreation: false. If you want to run this pipeline on a schedule, switch this flag to true.

Datasets surfaced

All the following metadata and manifest external tables act as base source tables for Ground Truth (SMGT), and they persist values in the same form as they are captured within Ground Truth, with some customization to link the outputed worker ID to identifiable worker information, such as a user name, in the worker metadata. This provides flexibility for auditing and changing analytical needs.

The database ${Prefix}-${AWS::AccountId}-${AWS::Region}-gluedatabase contains four databases, which are surfaced using the AWS Glue workflow. For our demonstration, we use smgt-gluedatabase as the database name. The tables are as follows:

  • An annotations table, called annotations_batch_manifests
  • Two output manifest tables (one each for first-level jobs and second-level jobs)
    • The labeling job table output_manifest_videoobjecttracking
    • The audit job table output_manifest_videoobjecttrackingaudit
  • A worker metrics table, called worker_metrics_processed_worker_metrics

The following screenshot shows the sample output of the tables under the AWS Glue database.

Connect Athena with the data lake

You can use Athena to connect to your S3 data lake and run SQL queries, which QuickSight uses to create visualizations.

If this is your first time using Athena, you need to configure the Athena query result location to the reporting S3 bucket created for the Athena workgroup. For more information, see Specifying a Query Result Location.

  1. On the Athena console, choose Settings in the navigation par.
  2. For Query result location, enter the S3 URL for the location of the bucket created for the Athena workgroup. The format is s3://${Prefix}-${AWS::AccountId}-${AWS::Region}-athena/. Note that the trailing slash is required.
  3. Leave the other fields unchanged.
  4. Choose Save.
  5. In the Athena Query Editor, run the following SQL queries to verify that the reporting stack is configured properly:
SELECT * FROM "smgt-gluedatabase"."annotations_batch_manifests" limit 10;
SELECT * FROM "smgt-gluedatabase"."worker_metrics_processed_worker_metrics" limit 10;
SELECT * FROM "smgt-gluedatabase"."output_manifest_videoobjecttracking" limit 10;
SELECT * FROM "smgt-gluedatabase"."output_manifest_videoobjecttrackingaudit" limit 10;

You must have at least one Ground Truth job completed to generate these tables.

The following screenshot shows our output.

Visualize in QuickSight

You’re now ready to visualize your data in QuickSight.

Set up QuickSight

In this section, you update permissions in your QuickSight account to provide access to the S3 reporting buckets. For more information, see Accessing Data Sources. You also import the data from Athena to SPICE so that QuickSight can display it.

  1. On the QuickSight console, choose your user name on the application bar, and choose Manage QuickSight.
  2. Choose Security & permissions.
  3. Under QuickSight access to AWS services, choose Add or remove.

A list of available AWS services is displayed.

  1. Under Amazon S3, choose details and choose Select S3 buckets.

  1. Do one of the following:
    1. Option 1 (completed part 1): If you have completed Part 1 and are running this section, select the following S3 buckets:
      1. In S3 Buckets Lined to QuickSight Account, under S3 buckets, choose the following S3 buckets
        1. {Prefix}-workflow-{account-ID}-{region}-batch-processing
        2. {Prefix}-workflow-{account-ID}-{region}-wm-glue-output
        3. {Prefix}-workflow-{account-ID}-{region}-athena
      2. In S3 Write permissions for Athena Workgroup, choose the following S3 bucket.
        1. {Prefix}-workflow-{account-ID}-{region}-athena
    2. Option 2 (did not complete part 1): If you did not complete part 1, and use the launch stack option in this blog post, select the following S3 buckets:
      1. In S3 Buckets Lined to QuickSight Account, under S3 buckets, choose the following S3 buckets.
        1. {Prefix}-{account-ID}-{region}-wm-glue-output
        2. {Prefix}{account-ID}-{region}-athena
      2. In S3 Write permissions for Athena Workgroup, choose the following S3 bucket.
        1. {Prefix}-{account-ID}-{region}-athena
      3. In S3 Buckets You Can Access Across AWS, under S3 buckets, choose the following S3 buckets.
        1. aws-ml-blog
  1. In both cases, after you’ve selected the buckets described above, choose Finish to close the Select Amazon S3 buckets dialog box.
  2. Choose Update to finish updating the permissions.

Create datasets

Create a new dataset using Athena as the source.

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset.
  3. In the FROM NEW DATA SOURCES section, choose the Athena
  4. For Data source name, enter Worker Metrics.
  5. For Athena workgroup, enter {Prefix}ReportsWorkGroup.
  6. Choose Create data source.
  7. For Database: contain sets of tables, choose the smgt-gluedatabase
  8. Select Use custom SQL and enter the following query:
SELECT *, cardinality(ans.trackingannotations.framedata.entries) as tasks FROM "smgt-gluedatabase"."worker_metrics_processed_worker_metrics", unnest(answercontent) as t(ans);
  1. Choose Edit/Preview data.
  2. For Custom SQL Name, enter Worker Metrics Dataset.
  3. Choose Apply.
  4. Choose Save & Visualize.
  5. Choose Visualize.
  6. In addition to creating the worker metrics dataset, you should also create annotation datasets.

The following code creates a label-level dataset for vehicles:

SELECT job_name,each_ann.height,each_ann.width,,each_ann."left",each_ann."label-category-attributes".moving,each_ann."label-category-attributes".vehicle_type,each_ann."label-category-attributes".audit,each_ann."object-name",each_ann from
(SELECT ann.annotations, partition_1 as job_name FROM "smgt-gluedatabase"."annotations_batch_manifests", unnest("tracking-annotations") as t(ann) where cardinality(ann.annotations) != 0) as data, unnest(data.annotations) as t(each_ann);

The following code creates a frame-level dataset for vehicles:

SELECT ann."frame-no",ann.frame,ann."frame-attributes"."number_of_vehicles",ann."frame-attributes"."quality_of_the_frame",ann.annotations, cardinality(ann.annotations) as num_labels, partition_1 as job_name, ann FROM "smgt-gluedatabase"."annotations_batch_manifests", unnest("tracking-annotations") as t(ann) where cardinality(ann.annotations) != 0

Next, you create a new analysis that imports the data from Athena to SPICE so that QuickSight can display it.

  1. On the All analyses page, choose New analysis.
  2. Choose the dataset that you just created and then choose Create analysis.

Create a worker metrics dashboard

QuickSight enables you to visualize your tabular data. For more information, see Creating an Amazon QuickSight Visual.

The following table summarizes several useful worker metric graphs that you can add to your dashboard.

Table Name

Graph Type Field Wells Value Field Wells X-axis Field Wells Row Field Wells Columns


Total time spent labeling by a worker Vertical stacked bar chart timespentinseconds(Sum) user name modality
Total time spent by modality Autograph timespentinseconds(Sum) modality
Worker metrics table Table timespentinSeconds(sum)
timespentinseconds (Max)
timespentinseconds (Min)
Average Time Taken Per Video (Average)
user name

You can add these tables to your QuickSight dashboard by creating a visual and customizing according to your requirements.

The follow are best practices for using the tables:

For more information about how to create visuals, calculated fields, parameters, controls, and visual tables, see Dashboard Building 101.

The following example visualization uses the Amazon Cognito worker sub IDs to identify worker metadata (such as email addresses). If you didn’t complete Part 1 and are using the example data provided for this post, these sub IDs aren’t associated with worker metadata in Amazon Cognito, so the sub ID appears in place of user names in the table. To learn more about using worker sub IDs with worker information, see Tracking the throughput of your private labeling team through Amazon SageMaker Ground Truth.

Create an annotation dashboard

The following table summarizes several useful annotation graphs that you can add to your dashboard.


Table Name

Graph Type Field Wells Value Field Wells Y-axis Field Wells Row Field Wells Columns


Number of vehicles Pie Chart vehicle_type (Count) vehicle_type
Annotation level quality Donut Chart audit
Frame level quality Donut Chart quality_of_the_frame
Number of parked vehicles vs vehicles in motion Donut Chart moving
Maximum number of vehicles in a frame Horizontal Bar Chart number_of_vechicles (Count)
Quality of the frame per Job Table quality_of_the_frame (Count) job_name quality_of_the_frame
Quality of the labels per Job Table audit (Count) job_name audit

The following screenshot shows a sample dashboard for these annotation reports.

Save the reports tables as CSV

To download your worker metrics and annotation reports as a CSV file, choose the respective sheet. In the Options section, choose Menu options and then choose Export to CSV.

For more information, see Exporting Data.

Schedule a data refresh in QuickSight

To refresh your dashboard every hour, set the SPICE refresh schedule to be 1 hour for newly created datasets. For instructions, see Refreshing a Dataset on a Schedule.

We show the sample QuickSight dashboards when data is ingested from the Ground Truth output data in the preceding sections.

Customize the solution

If you want to build dashboards on your current Ground Truth output data directories, you can make customizations:

  • The reporting pipeline CloudFormation template is set up in yml. The pipeline is set up for the video frame object tracking labeling use case, in which the annotations are stored in an output sequence file for each sequence of video frames that are labeled and not in the output manifest file. If your annotations are in the output manifest file, you can remove the annotation crawler and use output manifest tables for your dashboards. To learn more about the output data format for the task types supported by Ground Truth, see Output Data.
  • The S3 path for outputs of all the Ground Truth jobs in the reporting.yml CloudFormation template points to s3://${BatchProcessingInputBucketId}/batch_manifests/. To use your data and new jobs, change the multiple mentions of this path in the reporting.yml template to the path to your Ground Truth job output data.
  • All the queries used for building the dashboards are based on attributes used in the Ground Truth label category configuration file used in this example notebook. You can customize the queries for annotation reports based on attributes used in your label configuration file.

Clean up

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. On the Amazon S3 console, delete the S3 bucket that contains the raw and processed datasets.
  2. Cancel your QuickSight subscription.
  3. On the Athena console, delete the Athena workgroup named ${Prefix}-${AWS::AccountId}-${AWS::Region}-SMGTReportsWorkGroup
  4. On the AWS CloudFormation console, delete the stack you created to remove the resources the CloudFormation template created.


This two-part series provides you with a reference architecture to build an advanced data labeling workflow comprised of a multi-step data labeling pipeline, adjustment jobs, and data lakes for corresponding dataset annotations and worker metrics as well as updated dashboards.

In this post, you learned how to generate data lakes for annotations and worker metadata from Ground Truth output data generated from Part 1 using Ground Truth, Amazon S3, and AWS Glue. Then we discussed how to build visual dashboards for your annotation and worker metadata reports on those data lakes to derive business insights using Athena and QuickSight.

To learn more about automatic model building, selection, and deployment of custom classification models, refer to Automate multi-modality, parallel data labeling workflows with Amazon SageMaker Ground Truth and AWS Step Functions.

Try out the notebook and customize it for your label configuration by adding additional jobs or audit steps, or by modifying the data modality of the jobs. Further customization could include, but is not limited, to:

  • Adding additional types of annotations such as semantic segmentation masks or keypoints
  • Adding different types of visuals and analyses
  • Adding different types of modalities such as point cloud or image classification

This solution is built using serverless technologies on top of AWS Glue and Amazon S3, which makes it highly customizable and applicable for a wide variety of applications. We encourage you to extend this pipeline to your data analytics and visualization use cases—there are many more transformations in AWS Glue, capabilities to build complex queries using Athena, and prebuilt visuals in QuickSight to explore.

Happy building!

About the Authors

Vidya Sagar Ravipati is a Deep Learning Architect at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.



Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.




Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon.


Read More

Build a scalable machine learning pipeline for ultra-high resolution medical images using Amazon SageMaker

Neural networks have proven effective at solving complex computer vision tasks such as object detection, image similarity, and classification. With the evolution of low-cost GPUs, the computational cost of building and deploying a neural network has drastically reduced. However, most techniques are designed to handle pixel resolutions commonly found in visual media. For example, typical resolution sizes are 544 and 416 pixels for YOLOv3, 300 and 512 pixels for SSD, and 224 pixels for VGG. Training a classifier over a dataset consisting of gigapixel images (10^9+ pixels) such as satellite or digital pathology images is computationally challenging. These images cannot be directly input into a neural network because each GPU is limited by available memory. This requires specific preprocessing techniques such as tiling to be able to process the original images in smaller chunks. Furthermore, due to the large size of these images, the overall training time tends to be high, often requiring several days or weeks without the use of proper scaling techniques such as distributed training.

In this post, we explain how to build a highly scalable machine learning (ML) pipeline to fulfill three objectives:


In this post, we use a dataset consisting of whole-slide digital pathology images obtained from The Cancer Genome Atlas (TCGA) to accurately and automatically classify them as LUAD (adenocarcinoma), LUSC (squamous cell carcinoma), or normal lung tissue, where LUAD and LUSC are the two most prevalent subtypes of lung cancer. The dataset is available for public use by NIH and NCI.

The raw high-resolution images are in SVS format. SVS files are used for archiving and analyzing Aperio microscope images. You can apply the techniques and tools used in this post to any ultra high-resolution image dataset, including satellite images.

The following is a sample image of a tissue slide. This single image contains over a quarter of a billion pixels, and occupies over 750 MB of memory. This image cannot be fed directly to a neural network in its original form, so we must tile the image into many smaller images.

The following are samples of tiled images generated after preprocessing the preceding tissue slide image. These RGB 3-channel images are of size 512×512 and can be directly used as inputs to a neural network. Each of these tiled images is assigned the same label as the parent slide. Additionally, tiled images with more than 50% background are discarded.

Architecture overview

The following figure shows the overall end-to-end architecture, from the original raw images to inference. First, we use SageMaker Processing to tile, zoom, and sort the images into train and test splits, and then package them into the necessary number of shards for distributed SageMaker training. Second, a SageMaker training job loads the Docker container from Amazon Elastic Container Registry (Amazon ECR). The job uses Pipe mode to read the data from the prepared shards of images, trains the model, and stores the final model artifact in Amazon Simple Storage Service (Amazon S3). Finally, we deploy the trained model on a real-time inference endpoint that loads the appropriate Docker container (from Amazon ECR) and model (from Amazon S3) to process inference requests with low latency.

Data preprocessing using SageMaker Processing

The SVS slide images are preprocessed in three steps:

  • Tiling images – The images are tiled by non-overlapping 512×512 pixel windows, and tiles containing over 50% background are discarded. The tiles are stored as JPEG images.
  • Converting images to TFRecords – We use SageMaker Pipe mode to reduce our training time, which requires the data to be available in a proto-buffer format. TFRecord is a popular proto-buffer format used for training models with TensorFlow. We explain SageMaker Pipe mode and proto-buffer format in more detail in the following section.
  • Sorting TFRecords – We sort the dataset into test, train, and validation cohorts for a three-way classifier (LUAD/LUSC/Normal). The TCGA dataset can have multiple slide images corresponding to a single patient. We need to make sure all the tiles generated from slides corresponding to the same patient occupy the same split to avoid data leakage. For the test set, we create per-slide TFRecord containing all the tiles from that slide so that we can evaluate the model in the way it will be used in deployment.

The following is the preprocessing code:

def generate_tf_records(base_folder, input_files, output_file, n_image, slide=None):
    record_file = output_file

    count = n_image
    with as writer:
        while count:
            filename, label = random.choice(input_files)
            temp_img = plt.imread(os.path.join(base_folder, filename))
            if temp_img.shape != (512, 512, 3):
            count -= 1

            image_string = np.float32(temp_img).tobytes()
            slide_string = slide.encode('utf-8') if slide else None
            tf_example = image_example(image_string, label, slide_string)

We use SageMaker Processing for the preceding preprocessing steps, which allows us to run data preprocessing or postprocessing, feature engineering, data validation, and model evaluation workloads with SageMaker. Processing jobs accept data from Amazon S3 as input and store processed output data back into Amazon S3.

A benefit of using SageMaker Processing is the ease of distributing inputs across multiple compute instances. We can simply set s3_data_distribution_type=ShardedByS3Key parameter to divide data equally among all processing containers.

Importantly, the number of processing instances matches the number of GPUs we will use for distributed training with Horovod (i.e., 16). The reasoning becomes clearer when we introduce Horovod training.

The processing script is available on GitHub.

processor = Processor(image_uri=image_name,
                      instance_count=16,               # run the job on 16 instances
                      base_job_name='processing-base', # should be unique name
    source=f's3://<bucket_name>/tcga-svs', # s3 input prefix
    s3_data_distribution_type='ShardedByS3Key', # Split the data across instances
    destination='/opt/ml/processing/input')], # local path on the container
        source='/opt/ml/processing/output', # local output path on the container
        destination=f's3://<bucket_name>/tcga-svs-tfrecords/' # output s3 location
    arguments=['10000'], # number of tiled images per TF record for training dataset

Distributed model training using SageMaker Training

Taking ML models from conceptualization to production is typically complex and time-consuming. We have to manage large amounts of data to train the model, choose the best algorithm for training it, manage the compute capacity while training it, and then deploy the model into a production environment. SageMaker reduces this complexity by making it much easier to build and deploy ML models. It manages the underlying infrastructure to train your model at petabyte scale and deploy it to production.

After we preprocess the whole-slide images, we still have hundreds of gigabytes of data. Training on a single instance (GPU or CPU) would take several days or weeks to finish. To speed things up, we need to distribute the workload of training a model across multiple instances. For this post, we focus on distributed deep learning based on data parallelism using Horovod, a distributed training framework, and SageMaker Pipe mode.

Horovod: A cross-platform distributed training framework

When training a model with a large amount of data, the data needs to distributed across multiple CPUs or GPUs on either a single instance or multiple instances. Deep learning frameworks provide their own methods to support distributed training. Horovod is a popular framework-agnostic toolkit for distributed deep learning. It utilizes an allreduce algorithm for fast distributed training (compared with a parameter server approach) and includes multiple optimization methods to make distributed training faster. For more examples of distributed training with Horovod on SageMaker, see Multi-GPU and distributed training using Horovod in Amazon SageMaker Pipe mode and Reducing training time with Apache MXNet and Horovod on Amazon SageMaker.

SageMaker Pipe mode

You can provide input to SageMaker in either File mode or Pipe mode. In File mode, the input files are copied to the training instance. With Pipe mode, the dataset is streamed directly to your training instances. This means that the training jobs start sooner, compute and download can happen in parallel, and less disk space is required. Therefore, we recommend Pipe mode for large datasets.

SageMaker Pipe mode requires data to be in a protocol buffer format. Protocol buffers are language-neutral, platform-neutral, extensible mechanisms for serializing structured data. TFRecord is a popular proto-buffer format used for training models with TensorFlow. TFRecords are optimized for use with TensorFlow in multiple ways. First, they make it easy to combine multiple datasets and integrate seamlessly with the data import and preprocessing functionality provided by the library. Second, you can store sequence data—for instance, a time series or word encodings—in a way that allows for very efficient and (from a coding perspective) convenient import of this type of data.

The following diagram illustrates data access with Pipe mode.

Data sharding with SageMaker Pipe mode

You should keep in mind a few considerations when working with SageMaker Pipe mode and Horovod:

  • The data that is streamed through each pipe is mutually exclusive of the other pipes. The number of pipes dictates the number of data shards that need to be created.
  • Horovod wraps the training script for each compute instance. This means that data for each compute instance needs to be from a different shard.
  • With the SageMaker Training parameter S3DataDistributionType set to ShardedByS3Key, we can share a pipe with more than one instance. The data is streamed in round-robin fashion across instances.

To illustrate this better, let’s say we use two instances (A and B) of type ml.p3.8xlarge. Each ml.p3.8xlarge instance has four GPUs. We create four pipes (P1, P2, P3, and P4) and set S3DataDistributionType = 'ShardedByS3Key’. As shown in the following table, each pipe equally distributes the data between two instances in a round-robin fashion. This is the core concept needed in setting up pipes with Horovod. Because Horovod wraps the training script for each GPU, we need to create as many pipes as there are GPUs per training instance.

The following code shards the data in Amazon S3 for each pipe. Each shard should have a separate prefix in Amazon S3.

# Definite distributed training hyperparameters
train_instance_count = 4
gpus_per_host = 4
num_of_shards = gpus_per_host * train_instance_count

distributions = {'mpi': {
    'enabled': True,
    'processes_per_host': gpus_per_host
# Sharding
client = boto3.client('s3')
result = client.list_objects(Bucket=s3://<bucket_name>, Prefix='tcga-svs-tfrecords/train/', Delimiter='/')

j = -1
for i in range(num_of_shards):
    copy_source = {
        'Bucket': s3://<bucket_name>,
        'Key': result['Contents'][i]['Key']
    if i % gpus_per_host == 0:
        j += 1
    dest = 'tcga-svs-tfrecords/train_sharded/' + str(j) +'/' + result['Contents'][i]['Key'].split('/')[2]
    s3.meta.client.copy(copy_source, s3://<bucket_name>, dest)

# Define inputs to SageMaker estimator
svs_tf_sharded = f's3://<bucket_name>/tcga-svs-tfrecords'
shuffle_config = sagemaker.session.ShuffleConfig(234)
train_s3_uri_prefix = svs_tf_sharded
remote_inputs = {}

for idx in range(gpus_per_host):
    train_s3_uri = f'{train_s3_uri_prefix}/train_sharded/{idx}/'
    train_s3_input = s3_input(train_s3_uri, distribution ='ShardedByS3Key', shuffle_config=shuffle_config)
    remote_inputs[f'train_{idx}'] = train_s3_input
    remote_inputs['valid_{}'.format(idx)] = '{}/valid'.format(svs_tf_sharded)
remote_inputs['test'] = '{}/test'.format(svs_tf_sharded)

We use a SageMaker estimator to launch training on four instances of ml.p3.8xlarge. Each instance has four GPUs. Thus, there are a total of 16 GPUs. See the following code:

local_hyperparameters = {'epochs': 5, 'batch-size' : 16, 'num-train':160000, 'num-val':8192, 'num-test':8192}

estimator_dist = TensorFlow(base_job_name='svs-horovod-cloud-pipe',
                            input_mode='Pipe'), wait=True)

The following code snippet of the training script shows how to orchestrate Horovod with TensorFlow for distributed training:

mpi = False
if 'sagemaker_mpi_enabled' in args.fw_params:
    if args.fw_params['sagemaker_mpi_enabled']:
        import horovod.keras as hvd
        mpi = True
        # Horovod: initialize Horovod.
        # Pin GPU to be used to process local rank (one GPU per process)
        gpus = tf.config.experimental.list_physical_devices('GPU')
        tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
    hvd = None
callbacks = []
if mpi:
    if hvd.rank() == 0:
        callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.ckpt',
    callbacks.append(ModelCheckpoint(args.output_dir + '/checkpoint-{epoch}.ckpt',
train_dataset = train_input_fn(hvd, mpi)
valid_dataset = valid_input_fn(hvd, mpi)
test_dataset = test_input_fn()
model = model_def(args.learning_rate, mpi, hvd)"Starting training")
size = 1
if mpi:
    size = hvd.size(),
          steps_per_epoch=((args.num_train // args.batch_size) // size),
          validation_steps=((args.num_val // args.batch_size) // size), 

Because Pipe mode streams the data to each of our instances, the training script cannot calculate the data size during training (which is needed to compute steps_per_epoch). The parameter is therefore provided manually as a hyperparameter to the TensorFlow estimator. Additionally, the number of data points must be specified so that it can be divided equally amongst the GPUs. An unequal division could lead to a Horovod deadlock, because the time taken by each GPU to complete the training process is no longer identical. To ensure that the data points are equally divided, we use the same of number of instances for preprocessing as the number of GPUs for training. In our example, this number is 16.

Inference and deployment

After we train the model using SageMaker, we deploy it for inference on new images. To set up a persistent endpoint to get one prediction at a time, use SageMaker hosting services. To get predictions for an entire dataset, use SageMaker batch transform.

In this post, we deploy the trained model as a SageMaker endpoint. The following code deploys the model to an m4 instance, reads tiled image data from TFRecords, and generates a slide-level prediction:

# Generate predictor object from trained model
predictor = estimator_dist.deploy(initial_instance_count=1,  

# Tile-level prediction
raw_image_dataset ='images/{local_file}') # read a TFrecord
parsed_image_dataset = # Parse TFrecord to JPEGs

pred_scores_list = []
for i, element in enumerate(parsed_image_dataset):
    image = element[0].numpy()
    label = element[1].numpy()
    slide = element[2].numpy().decode()
    if i == 0:
        print(f"Making tile-level predictions for slide: {slide}...")

    print(f"Querying endpoint for a prediction for tile {i+1}...")
    pred_scores = predictor.predict(np.expand_dims(image, axis=0))['predictions'][0]
    pred_class = np.argmax(pred_scores) 
    if i > 0 and i % 10 == 0:
        plt.title(f'Tile {i} prediction: {pred_class}')  
        plt.imshow(image / 255)

# Slide-level prediction (average score over all tiles)
mean_pred_scores = np.mean(np.vstack(pred_scores_list), axis=0)
mean_pred_class = np.argmax(mean_pred_scores)
print(f"Slide-level prediction for {slide}:", mean_pred_class)

The model is trained on individual tile images. During inference, the SageMaker endpoint provides classification scores for each tile. These scores are averaged out across all tiles to generate the slide-level score and prediction. The following diagram illustrates this workflow.

A majority vote scheme would also be appropriate.

To perform inference on a large new batch of slide images, you can run a batch transform job for offline predictions on the dataset in Amazon S3 on multiple instances. Once the processed TFRecords are retrieved from Amazon S3, you can replicate the preceding steps to generate a slide-level classification for each of the new images.


In this post, we introduced a scalable machine learning pipeline for ultra high-resolution images that uses SageMaker Processing, SageMaker Pipe mode, and Horovod. The pipeline simplifies the convoluted process of large-scale training of a classifier over a dataset consisting of images that approach the gigapixel scale. With SageMaker and Horovod, we eased the process by distributing inputs across multiple compute instances, which reduces training time. We also provided a simple but effective strategy to aggregate tile-level predictions to produce slide-level inference.

For more information about SageMaker, see Build, train, and deploy a machine learning model with Amazon SageMaker. For the complete example to run on SageMaker, in which Pipe mode and Horovod are applied together, see the GitHub repo.


  1. Nicolas Coudray, Paolo Santiago Ocampo, Theodore Sakellaropoulos, Navneet Narula, Matija Snuderl, David Fenyö, Andre L. Moreira, Narges Razavian, Aristotelis Tsirigos. “Classification and mutation prediction from non–small cell lung cancer histopathology images using deep learning”. Nature Medicine, 2018; DOI: 10.1038/s41591-018-0177-5

About the Authors

Karan Sindwani is a Data Scientist at Amazon Machine Learning Solutions where he builds and deploys deep learning models. He specializes in the area of computer vision. In his spare time, he enjoys hiking.



Vinay Hanumaiah is a Deep Learning Architect at Amazon ML Solutions Lab, where he helps customers build AI and ML solutions to accelerate their business challenges. Prior to this, he contributed to the launch of AWS DeepLens and Amazon Personalize. In his spare time, he enjoys time with his family and is an avid rock climber.




Ryan Brand is a Data Scientist in the Amazon Machine Learning Solutions Lab. He has specific experience in applying machine learning to problems in healthcare and the life sciences, and in his free time he enjoys reading history and science fiction.




Tatsuya Arai, Ph.D. is a biomedical engineer turned deep learning data scientist on the Amazon Machine Learning Solutions Lab team. He believes in the true democratization of AI and that the power of AI shouldn’t be exclusive to computer scientists or mathematicians.

Read More

Build a cognitive search and a health knowledge graph using AWS AI services

Medical data is highly contextual and heavily multi-modal, in which each data silo is treated separately. To bridge different data, a knowledge graph-based approach integrates data across domains and helps represent the complex representation of scientific knowledge more naturally. For example, three components of major electronic health records (EHR) are diagnosis codes, primary notes, and specific medications. Because these are represented in different data silos, secondary use of these documents for accurately identifying patients with a specific observable trait is a crucial challenge. By connecting those different sources, subject matter experts have a richer pool of data to understand how different concepts such as diseases and symptoms interact with one another and help conduct their research. This ultimately helps healthcare and life sciences researchers and practitioners create better insights from the data for a variety of use cases, such as drug discovery and personalized treatments.

In this post, we use Amazon HealthLake to export EHR data in the Fast Healthcare Interoperability Resources (FHIR) data format. We then build a knowledge graph based on key entities extracted and harmonized from the medical data. Amazon HealthLake also extracts and transforms unstructured medical data, such as medical notes, so it can be searched and analyzed. Together with Amazon Kendra and Amazon Neptune, we allow domain experts to ask a natural language question, surface the results and relevant documents, and show connected key entities such as treatments, inferred ICD-10 codes, medications, and more across records and documents. This allows for easy analysis of co-occurrence of key entities, co-morbidities analysis, and patient cohort analysis in an integrated solution. Combining effective search capabilities and data mining through graph networks reduces time and cost for users to find relevant information around patients and improve knowledge serviceability surrounding EHRs. The code base for this post is available on the GitHub repo.

Solution overview

In this post, we use the output from Amazon HealthLake for two purposes.

First, we index EHRs into Amazon Kendra for semantic and accurate document ranking out of patient notes, which help improve physician efficiency identifying patient notes and compare it with other patients sharing similar characteristics. This shifts from using a lexical search to a semantic search that introduces context around the query, which results in better search output (see the following screenshot).

Second, we use Neptune to build knowledge graph applications for users to view metadata associated with patient notes in a more simple and normalized view, which allows us to highlight the important characteristics stemming from a document (see the following screenshot).

The following diagram illustrates our architecture.

The steps to implement the solution are as follows:

  1. Create and export Amazon HealthLake data.
  2. Extract patient visit notes and metadata.
  3. Load patient notes data into Amazon Kendra.
  4. Load the data into Neptune.
  5. Set up the backend and front end to run the web app.

Create and export Amazon HealthLake data

As a first step, create a data store using Amazon HealthLake either via the Amazon HealthLake console or the AWS Command Line Interface (AWS CLI). For this post, we focus on the AWS CLI approach.

  1. We use AWS Cloud9 to create a data store with the following code, replacing <<your data store name >> with a unique name:
aws healthlake create-fhir-datastore --region us-east-1 --datastore-type-version R4 --preload-data-config PreloadDataType="SYNTHEA" --datastore-name "<<your_data_store_name>>"

The preceding code uses a preloaded dataset from Synthea, which is supported in FHIR version R4, to explore how to use Amazon HealthLake output. Running the code produces a response similar to the following code, and this step takes a few minutes to complete (approximately 30 minutes at the time of writing):

	"DatastoreEndpoint": "<<your_data_store_id>>/r4/",
	"DatastoreArn": "arn:aws:healthlake:us-east-1:<<your_AWS_account_number>>:datastore/fhir/<<your_data_store_id>>",
	"DatastoreStatus": "CREATING",
	"DatastoreId": "<<your_data_store_id>>"

You can check the status of completion either on the Amazon HealthLake console or in the AWS Cloud9 environment.

  1. To check the status in AWS Cloud9, use the following code to check the status and wait until DatastoreStatus changes from CREATING to ACTIVE:
aws healthlake describe-fhir-datastore --datastore-id "<<your_data_store_id>>" --region us-east-1
  1. When the status changes to ACTIVE, get the role ARN from the HEALTHLAKE-KNOWLEDGE-ANALYZER-IAMROLE stack in AWS CloudFormation, associated with the physical ID AmazonHealthLake-Export-us-east-1-HealthDataAccessRole, and copy the ARN in the linked page.
  2. In AWS Cloud9, use the following code to export the data from Amazon HealthLake to the Amazon Simple Storage Service (Amazon S3) bucket generated from AWS Cloud Development Kit (AWS CDK) and note the job-id output:
aws healthlake start-fhir-export-job --output-data-config S3Uri="s3://hl-synthea-export-<<your_AWS_account_number>>/export-$(date +"%d-%m-%y")" --datastore-id <<your_data_store_id>> --data-access-role-arn arn:aws:iam::<<your_AWS_account_number>>:role/AmazonHealthLake-Export-us-east-1-HealthKnoMaDataAccessRole
  1. Verify that the export job is complete using the following code with the job-id obtained from the last code you ran. (when the export is complete, JobStatus in the output states COMPLETED):
aws healthlake describe-fhir-export-job --datastore-id <<your_data_store_id>> --job-id <<your_job_id>>

Extract patient visit notes and metadata

The next step involves decoding patient visits to obtain the raw texts. We will import the following file DocumentReference-0.ndjson (shown in the following screenshot of S3) from the Amazon HealthLake export step we previously completed into the CDK deployed Amazon SageMaker notebook instance. First, save the notebook provided from the Github repo into the SageMaker instance. Then, run the notebook to automatically locate and import the DocumentReference-0.ndjson files from S3.

For this step, use the resourced SageMaker to quickly run the notebook. The first part of the notebook creates a text file that contains notes from each patient’s visit and is saved to an Amazon S3 location. Because multiple visits could exist for a single patient, a unique identification combines the patient unique ID and the visit ID. These patients’ notes are used to perform semantic search against using Amazon Kendra.

The next step in the notebook involves creating triples based on the automatically extracted metadata. By creating and saving the metadata in an Amazon S3 location, an AWS Lambda function gets triggered to generate the triples surrounding the patient visit notes.

Load patient notes data into Amazon Kendra

The text files that are uploaded in the source path of the S3 bucket need to be crawled and indexed. For this post, a developer edition is created during the AWS CDK deployment, so the index is created to connect the raw patient notes.

  1. On the AWS CloudFormation console under the HEALTHLAKE-KNOWLEDGE-ANALYZER-CORE stack, search for kendra on the Resources tab and take note of the index ID and data source ID (copy the first part of the physical ID before the pipe ( | )).

  1. Back in AWS Cloud9, run the following command to synchronize the patient notes in Amazon S3 to Amazon Kendra:
aws kendra start-data-source-sync-job --id <<data_source_id_2nd_circle>> --index-id <<index_id_1st_ circle>>
  1. You can verify when the sync status is complete by running the following command:
aws kendra describe-data-source --id <<data_source_id_2nd_circle>> --index-id <<index_id_1st_circle>>

Because the ingested data is very small, it should immediately show that Status is ACTIVE upon running the preceding command.

Load the data into Neptune

In this next step, we access the Amazon Elastic Compute Cloud (Amazon EC2) instance that was spun up and load the triples from Amazon S3 into Neptune using the following code:

curl -X POST 
    -H 'Content-Type: application/json' 
    https://healthlake-knowledge-analyzer-vpc-and-neptune-neptunedbcluster.cluster-<<your_unique_id>> -d '
    "source": "s3://<<your_Amazon_S3_bucket>>/stdized-data/neptune_triples/nquads/",
    "format": "nquads",
    "iamRoleArn": "arn:aws:iam::<<your_AWS_account_number>>:role/KNOWLEDGE-ANALYZER-IAMROLE-ServiceRole",
    "region": "us-east-1",
    "failOnError": "TRUE"

Set up the backend and front end to run the web app

The preceding step should take a few seconds to complete. In the meantime, configure the EC2 instance to access the web app. Make sure to have both Python and Node installed in the instance.

  1. Run the following code in the terminal of the instance:
sudo iptables -t nat -I PREROUTING -p tcp --dport 80 -j REDIRECT --to-ports 3000

This routes the public address to the deployed app.

  1. Copy the two folders titled ka-webapp and ka-server-webapp and upload them to a folder named dev in the EC2 instance.
  2. For the front end, create a screen by running the following command:
screen -S back 
  1. In this screen, change the folder to ka-webapp and run npm install.
  2. After installation, go into the file .env.development and place the Amazon EC2 public IPv4 address and save the file.
  3. Run npm start and then detach the screen.
  4. For the backend, create another screen by entering:
screen -S back
  1. Change the folder to ka-server-webapp and run pip install -r requirements.txt.
  2. When the libraries are installed, enter the following code:
  1. Detach from the current screen, and using any browser, go the Amazon EC2 Public IPv4 address to access the web app.

Try searching for a patient diagnosis and choose a document link to visualize the knowledge graph of that document.

Next steps

In this post, we integrate data output from Amazon HealthLake into both a search and graph engine to semantically search relevant information and highlight important entities linked to documents. You can further expand this knowledge graph and link it to other ontologies such as MeSH and MedDRA.

Furthermore, this provides a foundation to further integrate other clinical datasets and expand this knowledge graph to build a data fabric. You can make queries on historical population data, chaining structured and language-based searches for cohort selection to correlate disease with patient outcome.

Clean up

To clean up your resources, complete the following steps:

  1. To delete the stacks created, enter the following commands in the order given to properly remove all resources:
  1. While the preceding commands are in progress, delete the Amazon Kendra data source that was created:
$ aws healthlake delete-fhir-datastore --datastore-id <<your_data_store_id>> 
  1. To verify it’s been deleted, check the status by running the following command:
$ aws healthlake describe-fhir-datastore --datastore-id "<<your_data_store_id>>" --region us-east-1
  1. Check the AWS CloudFormation console to ensure that all associated stacks starting with HEALTHLAKE-KNOWLEDGE-ANALYZER have all been deleted successfully.


Amazon HealthLake provides a managed service based on the FHIR standard to allow you to build health and clinical solutions. Connecting the output of Amazon HealthLake to Amazon Kendra and Neptune gives you the ability to build a cognitive search and a health knowledge graph to power your intelligent application.

Building on top of this approach can enable researchers and front-line physicians to easily search across clinical notes and research articles by simply typing their question into a web browser. Every clinical evidence is tagged, indexed, and structured using machine learning to provide evidence-based topics on things like transmission, risk factors, therapeutics, and incubation. This particular functionality is tremendously valuable for clinicians or scientists because it allows them to quickly ask a question to validate and advance their clinical decision support or research.

Try this out on your own! Deploy this solution using Amazon HealthLake in your AWS account by deploying the example on GitHub.

About the Authors

Prithiviraj Jothikumar, PhD, is a Data Scientist with AWS Professional Services, where he helps customers build solutions using machine learning. He enjoys watching movies and sports and spending time to meditate.



Phi Nguyen is a solutions architect at AWS helping customers with their cloud journey with a special focus on data lake, analytics, semantics technologies and machine learning. In his spare time, you can find him biking to work, coaching his son’s soccer team or enjoying nature walk with his fami



Parminder Bhatia is a science leader in the AWS Health AI, currently building deep learning algorithms for clinical domain at scale. His expertise is in machine learning and large scale text analysis techniques in low resource settings, especially in biomedical, life sciences and healthcare technologies. He enjoys playing soccer, water sports and traveling with his family.



Garin Kessler is a Senior Data Science Manager at Amazon Web Services, where he leads teams of data scientists and application architects to deliver bespoke machine learning applications for customers. Outside of AWS, he lectures on machine learning and neural language models at Georgetown. When not working, he enjoys listening to (and making) music of questionable quality with friends and family.


Dr. Taha Kass-Hout is Director of Machine Learning and Chief Medical Officer at Amazon Web Services, and leads our Health AI strategy and efforts, including Amazon Comprehend Medical and Amazon HealthLake. Taha is also working with teams at Amazon responsible for developing the science, technology, and scale for COVID-19 lab testing. A physician and bioinformatician, Taha served two terms under President Obama, including the first Chief Health Informatics officer at the FDA. During this time as a public servant, he pioneered the use of emerging technologies and cloud (CDC’s electronic disease surveillance), and established widely accessible global data sharing platforms, the openFDA, that enabled researchers and the public to search and analyze adverse event data, and precisionFDA (part of the Presidential Precision Medicine initiative).

Read More

Improve the streaming transcription experience with Amazon Transcribe partial results stabilization

Whether you’re watching a live broadcast of your favorite soccer team, having a video chat with a vendor, or calling your bank about a loan payment, streaming speech content is everywhere. You can apply a streaming transcription service to generate subtitles for content understanding and accessibility, to create metadata to enable search, or to extract insights for call analytics. These transcription services process streaming audio content and generate partial transcription results until it provides a final transcription for a segment of continuous speech. However, some words or phrases in these partial results might change, as the service further understands the context of the audio.

We’re happy to announce that Amazon Transcribe now allows you to enable and configure partial results stabilization for streaming audio transcriptions. Amazon Transcribe is an automatic speech recognition (ASR) service that enables developers to add real-time speech-to-text capabilities into their applications for on-demand and streaming content. Instead of waiting for an entire sentence to be transcribed, you can now control the stabilization level of partial results. Transcribe offers 3 settings: High, Medium and Low. Setting the stabilization “High” allows a greater portion of the partial results to be fixed with only the last few words changing during the transcription process. This feature helps you have more flexibility in your streaming transcription workflows based on the user experience you want to create.

In this post, we walk through the benefits of this feature and how to enable it via the Amazon Transcribe console or the API.

How partial results stabilization works

Let’s dive deeper into this with an example.

During your daily conversations, you may think you hear a certain word or phrase, but later realize that it was incorrect based on additional context. Let’s say you were talking to someone about food, and you heard them say “Tonight, I will eat a pear…” However, when the speaker finishes, you realize they actually said “Tonight I will eat a pair of pancakes.” Just as humans may change our understanding based on the information at hand, Amazon Transcribe uses machine learning (ML) to self-correct the transcription of streaming audio based on the context it receives. To enable this, Amazon Transcribe uses partial results.

During the streaming transcription process, Amazon Transcribe outputs chunks of the results with an isPartial flag. Results with this flag marked as true are the ones that Amazon Transcribe may change in the future depending on the additional context received. After Amazon Transcribe classifies that it has sufficient context to be over a certain confidence threshold, the results are stabilized and the isPartial flag for that specific partial result is marked false. The window size of these partial results could range from a few words to multiple sentences depending on the stream context.

The following image displays how the partial results are generated (and edited) in Amazon Transcribe for streaming transcription.

Results stabilization enables more control over the latency and accuracy of transcription results. Depending on the use case, you may prioritize one over the other. For example, when providing live subtitles, high stabilization of results may be preferred because speed is more important than accuracy. On the other hand for use cases like content moderation, lower stabilization is preferred because accuracy may be more important than latency.

A high stability level enables quicker stabilization of transcription results by limiting the window of context for stabilizing results, but can lead to lower overall accuracy. On the other hand, a low stability level leads to more accurate transcription results, but the partial transcription results are more likely to change.

With the streaming transcription API, you can now control the stability of the partial results in your transcription stream.

Now let’s look at how to use the feature.

Access partial results stabilization via the Amazon Transcribe console

To start using partial results stabilization on the Amazon Transcribe console, complete the following steps:

  1. On the Amazon Transcribe console, make sure you’re in a Region that supports Amazon Transcribe Streaming.

For this post, we use us-east-1.

  1. In the navigation pane, choose Real-time transcription.
  2. Under Additional settings, enable Partial results stabilization.

  1. Select your stability level.

You can choose between three levels:

  • High – Provides the most stable partial transcription results with lower accuracy compared to Medium and Low settings. Results are less likely to change as additional context is gathered.
  • Medium – Provides partial transcription results that have a balance between stability and accuracy
  • Low – Provides relatively less stable partial transcription results with higher accuracy compared to High and Medium settings. Results get updated as additional context is gathered and utilized.

  1. Choose Start streaming to play a stream and check the results.

Access partial results stabilization via the API

In this section, we demonstrate streaming with HTTP/2. You can enable your preferred level of partial results stabilization in an API request.

You enable this feature via the enable-partial-results-stabilization flag and the partial-results-stability level input parameters:

POST /stream-transcription HTTP/2 
x-amzn-transcribe-language-code: LanguageCode 
x-amzn-transcribe-sample-rate: MediaSampleRateHertz 
x-amzn-transcribe-media-encoding: MediaEncoding 
x-amzn-transcribe-session-id: SessionId 
x-amzn-transcribe-enable-partial-results-stabilization= true
x-amzn-transcribe-partial-results-stability = low | medium | high

Enabling partial results stabilization introduces the additional parameter flag Stable in the API response at the item level in the transcription results. If a partial results item in the streaming transcription result has the Stable flag marked as true, the corresponding item transcription in the partial results doesn’t change irrespective of any subsequent context identified by Amazon Transcribe. If the Stable flag is marked as false, there is still a chance that the corresponding item may change in the future, until the IsPartial flag is marked as false.

The following code shows our API response:

    "Alternatives": [
            "Items": [
                    "Confidence": 0,
                    "Content": "Amazon",
                    "EndTime": 1.22,
                    "Stable": true,
                    "StartTime": 0.78,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                    "Confidence": 0,
                    "Content": "is",
                    "EndTime": 1.63,
                    "Stable": true,
                    "StartTime": 1.46,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                    "Confidence": 0,
                    "Content": "the",
                    "EndTime": 1.76,
                    "Stable": true,
                    "StartTime": 1.64,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                    "Confidence": 0,
                    "Content": "largest",
                    "EndTime": 2.31,
                    "Stable": true,
                    "StartTime": 1.77,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
                    "Confidence": 1,
                    "Content": "rainforest",
                    "EndTime": 3.34,
                    "Stable": true,
                    "StartTime": 2.4,
                    "Type": "pronunciation",
                    "VocabularyFilterMatch": false
            "Transcript": "Amazon is the largest rainforest "
    "EndTime": 4.33,
    "IsPartial": false,
    "ResultId": "f4b5d4dd-b685-4736-b883-795dc3f7f636",
    "StartTime": 0.78


This post introduces the recently launched partial results stabilization feature in Amazon Transcribe. For more information, see the Amazon Transcribe Partial results stabilization documentation.

To learn more about the Amazon Transcribe Streaming Transcription API, check out Using Amazon Transcribe streaming With HTTP/2 and Using Amazon Transcribe streaming with WebSockets.

About the Author

Alex Chirayath is an SDE in the Amazon Machine Learning Solutions Lab. He helps customers adopt AWS AI services by building solutions to address common business problems.

Read More

The Washington Post Launches Audio Articles Voiced by Amazon Polly 

AWS is excited to announce that The Washington Post is integrating Amazon Polly to provide their readers with audio access to stories across The Post’s entire spectrum of web and mobile platforms, starting with technology stories. Amazon Polly is a service that turns text into lifelike speech, allowing you to create applications that talk, and build entirely new categories of speech-enabled products. Post subscribers live busy lives with limited time to read the news. The goal is to unlock the Post’s world-class written journalism in audio form and give readers a convenient way to stay up to date on the news, like listening while doing other things.

In The Post’s announcement, Kat Down Mulder, managing editor says, “Whether you’re listening to a story while multitasking or absorbing a compelling narrative while on a walk, audio unlocks new opportunities to engage with our journalism in more convenient ways. We saw that trend throughout last year as readers who listened to audio articles on our apps engaged more than three times longer with our content. We’re doubling-down on our commitment to audio and will be experimenting rapidly and boldly in this space. The full integration of Amazon Polly within our publishing ecosystem is a big step that offers readers this powerful convenience feature at scale, while ensuring a high-quality and consistent audio experience across all our platforms for our subscribers and readers.”

Integrating Amazon Polly into The Post’s publishing workflow has been easy and straightforward. When an article is ready for publication, the written content management system (CMS) publishes the text article and simultaneously sends the text to the audio CMS, where the article text is processed by Amazon Polly to produce an audio recording of the article. The audio is delivered as an mp3 and published in conjunction with the written portion of the article.

Figure 1 High-level architecture Washington Post article creation

Last year, The Post began testing article narration using the text-to-speech, accessibility capabilities in iOS and Android operating systems. While there were promising signs around engagement, some noted that the voices sounded robotic. The Post started testing other options and ended up choosing Amazon Polly because of its high-quality automated voices. “We’ve tested users’ perceptions to both human and automated voices and found high levels of satisfaction with Amazon Polly’s offering. Integrating Amazon Polly into our publishing workflow also gives us the ability to offer a consistent listening experience across platforms and experiment with new functions that we believe our subscribers will enjoy.” says Ryan Luu, senior product manager at The Post.

Over the coming months, The Post will be adding voice support for new sections, new languages and better usability. “We plan to introduce new features like more playback controls, text highlighting as you listen, and audio versions of Spanish articles,” said Luu. “We also hope to give readers the ability to create audio playlists to make it easy for subscribers to queue up stories they’re interested in and enjoy that content on the go.”

Amazon Polly is a text-to-speech service that powers audio access to news articles for media publishers like Gannett (the publisher of USA Today), The Globe and Mail (the biggest newspaper in Canada), and leading publishing companies such as BlueToad and Trinity Audio. In addition, Amazon Polly provides natural sounding voices in a variety of languages and personas to give content a voice in other sectors such as education, healthcare, and gaming.

For more information, see What Is Amazon Polly? and log in to the Amazon Polly console to try it out for free. To experience The Post’s new audio articles, listen to the story “Did you get enough steps in today? Maybe one day you’ll ask your ‘smart’ shirt.”

About the Author

Esther Lee is a Product Manager for AWS Language AI Services. She is passionate about the intersection of technology and education. Out of the office, Esther enjoys long walks along the beach, dinners with friends and friendly rounds of Mahjong.

Read More

Build an anomaly detection model from scratch with Amazon Lookout for Vision

A common problem in manufacturing is verifying that products meet quality standards. You can use manual inspection on a subset of the products, but it’s usually not scalable enough to meet demand as production grows. In this post, I go through the steps of creating an end-to-end machine vision solution that identifies visual anomalies in products using Amazon Lookout for Vision. I’ll show you how to train a model that performs anomaly detection, use the model in real-time, update the model when new data is available, and how to monitor the model.

Solution overview

Imagine a factory producing Lego bricks. The bricks are transported on a conveyor belt in front of a camera that determines if they meet the factory’s quality standards. When a brick on the belt breaks a light beam, the device takes a photo and sends it to Amazon Lookout for Vision for anomaly detection. If a defective brick is identified, it’s pushed off the belt by a pusher.

The following diagram illustrates the architecture of our anomaly detection solution, which uses Amazon Lookout for Vision, Amazon Simple Storage Service (Amazon S3), and a Raspberry Pi.

Amazon Lookout for Vision is a machine learning (ML) service that uses machine vision to help you identify visual defects in products without needing any ML experience. It uses deep learning to remove the need for carefully calibrated environments in terms of lighting and camera angle, which many existing machine vision techniques require.

To get started with Amazon Lookout for Vision, you need to provide data for the service to use when training the underlying deep learning models. The dataset used in this post consists of 289 normal and 116 anomalous images of a Lego brick, which are hosted in an S3 bucket that I have made public so you can download the dataset.

To make the scenario more realistic, I’ve varied the lighting and camera position between images. Additionally, I use 20 test images and 9 new images to update the model later on with both normal and anomalous images. The anomalous images were created by drawing on and scratching the brick, changing the brick color, adding other bricks, and breaking off small pieces to simulate production defects. The following image shows the physical setup used when collecting training images.


To follow along with this post, you’ll need the following:

  • An AWS account to train and use Amazon Lookout for Vision
  • A camera (for this post, I use a Pi camera)
  • A device that can run code (I use a Raspberry Pi 4)

Train the model

To use the dataset when training a model, you first upload the training data to Amazon S3 and create an Amazon Lookout for Vision project. A project is an abstraction around the training dataset and multiple model versions. You can think of a project as a collection of the resources that relate to a specific machine vision use case. For instance, in this post, I use one dataset but create multiple model versions as I gradually optimize the model for the use case with new data, all within the boundaries of one project.

You can use the SDK, AWS Command Line Interface (AWS CLI), and AWS Management Console to perform all the steps required to create and train a model. For this post, I use a combination of the AWS CLI and the console to train and start the model, and use the SDK to send images for anomaly detection from the Raspberry Pi.

To train the model, we complete the following high-level steps:

  1. Upload the training data to Amazon S3.
  2. Create an Amazon Lookout for Vision project.
  3. Create an Amazon Lookout for Vision dataset.
  4. Train the model.

Upload the training data to Amazon S3

To get started, complete the following steps:

  1. Download the dataset to your computer.
  2. Create an S3 bucket and upload the training data.

I named my bucket l4vdemo, but bucket names need to be globally unique, so make sure to change it if you copy the following code. Make sure to keep the folder structure in the dataset, because Amazon Lookout for Vision uses it to label normal and anomalous images automatically based on folder name. You could use the integrated labeling tool on the Amazon Lookout for Vision console or Amazon SageMaker Ground Truth to label the data, but the automatic labeler allows you to keep the folder structure and save some time.

aws s3 sync s3://aws-ml-blog/artifacts/Build-an-anomaly-detection-model-from-scratch-with-L4V/ data/

aws s3 mb s3://l4vdemo

aws s3 sync data s3://l4vdemo

Create an Amazon Lookout for Vision project

You’re now ready to create your project.

  1. On the Amazon Lookout for Vision console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter a name.
  4. Choose Create project.

Create the dataset

For this post, I create a single dataset and import the training data from the S3 bucket I uploaded the data to in Step 1.

  1. Choose Create dataset.
  2. Select import images from S3 bucket.
  3. For S3 URI, enter the URI for your bucket (for this post, s3://l4vdemo/, but make sure to use the unique bucket name you created).

  1. For Automatic labeling, select Automatically attach labels to images based on the folder name.

This allows you to use the existing folder structure to infer whether your images are normal or anomalous.

  1. Choose Create dataset.

Train the model

After we create the dataset, the number of labeled and unlabeled images should be visible in the Filters pane, as well as the number of normal and anomalous images.

  1. To start training a deep learning model, choose Train model.


Model training can take a few hours depending on the number of images in the training dataset.

  1. When training is complete, in the navigation pane, chose Models under your project.

You should see the newly created model listed with a status of Training complete.

  1. Choose the model to see performance metrics like precision, recall and F1 score, training duration, and more model metadata.

Use the model

Now that a model is trained, let’s test it on data it hasn’t seen before. To use the model, you must first start hosting it to provision all backend resources required to perform real-time inference.

aws lookoutvision start-model 
--project-name lego-demo 
--model-version 1 
--min-inference-units 1

When starting the model hosting, you pass both project name and model version as arguments to identify the model. You also need to specify the number of inference units to use; each unit enables approximately five requests per second.

To use the hosted model, use the detect-anomalies command and pass in the project and model version along with the image to perform inference on:

aws lookoutvision detect-anomalies 
--project-name lego-demo 
--model-version 1 
--content-type image/jpeg 
--body test/test-1611853160.2488434.jpg

The dataset we use in this post contains 20 images, and I encourage you to test the model with different images.

When performing inference on an anomalous brick, the response could look like the following:

    "DetectAnomalyResult": {
        "Source": {
            "Type": "direct"
        "IsAnomalous": true,
        "Confidence": 0.9754859209060669

The flag IsAnomalous is true and Amazon Lookout for Vision also provides a confidence score that tells you how sure the model is of its classification. The service always provides a binary classification, but you can use the confidence score to make more well-informed decisions, such as whether to scrap the brick directly or send it for manual inspection. You could also persist images with lower confidence scores and use them to update the model, which I show you how to do in the next section.

Keep in mind that you’re charged for the model as long as it’s running, so stop it when you no longer need it:

aws lookoutvision stop-model 
--project-name lego-demo 
--model-version 1

Update the model

As new data becomes available, you may want to maintain or update the model to accommodate for new types of defects and increase the model’s overall performance. The dataset contains nine images in the new-data folder, which I use to update the model. To update an Amazon Lookout for Vision model, you run a trial detection and verify the machine predictions to correct the model predictions, and add the verified images to your training dataset.

Run a trial detection

To run a trial detection, complete the following steps:

  1. On the Amazon Lookout for Vision console, under your model in the navigation pane, choose Trial detections.
  2. Choose Run trial detection.
  3. For Trial name, enter a name.
  4. For Import images, select Import images from S3 bucket.
  5. For S3 URI, enter the URI of the new-data folder that you uploaded in Step 1 of training the model

  1. Choose Run trial

Verify machine predictions

When the trial detection is complete, you can verify the machine predictions.

  1. Choose Verify machine predictions.
  2. Select either Correct or Incorrect to label the images


  1. When all the images have been labeled, choose Add verified images to dataset.

This updates your training dataset with the new data.

Retrain the model

After you update your training dataset with the new data, you can see that the number of labeled images in your dataset has increased, along with the number of verified images.

  1. Choose Train model to train a new version of the model.


  1. When the new model is training, on the Models page, you can verify that a new model version is being trained. When the training is complete, you can view model performance metrics on the Models page and start using the new version.


Anomaly detection application

Now that I’ve trained my model, let’s use it with the Raspberry Pi to sort Lego bricks. In this use case, I’ve set up a Raspberry Pi with a camera that gets triggered whenever a break beam sensor senses a Lego brick. We use the following code:

import boto3
from picamera import PiCamera
import my_break_bream_sensor
import my_pusher

l4v_client = boto3.client('lookoutvision')
image_path = '/home/pi/Desktop/my_image.jpg'

with PiCamera() as camera:
        if my_break_bream_sensor.isBroken():  # Replace with your own sensor.
            with open(image_path, 'rb') as image:
                response = l4v_client.detect_anomalies(ProjectName='lego-demo',

            is_anomalous = response['DetectAnomalyResult']['IsAnomalous']
            if (is_anomalous):
                my_pusher.push()  # Replace with your own pusher.

Monitoring the model

When the system is up and running, you can use the Amazon Lookout for Vision dashboard to visualize metadata from the projects you have running, such as the number of detected anomalies during a specific time period. The dashboard provides an overview of all current projects, as well as aggregated information like total anomaly ratio.


The cost of the solution is based on the time to train the model and the time the model is running. You can divide the cost across all analyzed products to get a per-product cost. Assuming one brick is analyzed per second nonstop for a month, the cost of the solution, excluding hardware and training, is around $0.001 per brick, assuming we’re using 1 inference unit. However, if you increase production speed and analyze five bricks per second, the cost is around $0.0002 per brick.


Now you know how to use Amazon Lookout for Vision to train, run, update, and monitor an anomaly detection application. The use case in this post is of course simplified; you will have other requirements specific to your needs. Many factors affect the total end-to-end latency when performing inference on an image. The Amazon Lookout for Vision model runs in the cloud, which means that you need to evaluate and test network availability and bandwidth to ensure that the requirements can be met. To avoid creating bottlenecks, you can use a circuit breaker in your application to manage timeouts and prevent congestion in case of network issues.

Now that you know how to train, test and use and update an ML model for anomaly detection, try it out with your own data! To get further details about Amazon Lookout for Vision, please visit the webpage!

About the Authors

Niklas Palm is a Solutions Architect at AWS in Stockholm, Sweden, where he helps customers across the Nordics succeed in the cloud. He’s particularly passionate about serverless technologies along with IoT and machine learning. Outside of work, Niklas is an avid cross-country skier and snowboarder as well as a master egg boiler.

Read More