This is the second in a two-part series on the Amazon SageMaker Ground Truth hierarchical labeling workflow and dashboards. In Part 1: Automate multi-modality, parallel data labeling workflows with Amazon SageMaker Ground Truth and AWS Step Functions, we looked at how to create multi-step labeling workflows for hierarchical label taxonomies using AWS Step Functions. In Part 2, we look at how to build dashboards and derive insights for analyzing dataset annotations and worker performance metrics on data lakes generated as output from the complex workflows.
Amazon SageMaker Ground Truth (Ground Truth) is a fully managed data labeling service that makes it easy to build highly accurate training datasets for machine learning (ML). This post introduces a solution that you can use to create customized business intelligence (BI) dashboards using Ground Truth labeling job output data. You can use these dashboards to analyze annotation quality, worker metrics, and more.
In Part 1, we presented a solution to create multiple types of annotations for a single input data object and check annotation quality, using a series of multi-step labeling jobs that run in a parallel, hierarchical fashion using Step Functions. The solution results in high-quality annotations using Ground Truth. The format of these annotations is explained in Output Data, and each takes the form of one or more JSON manifest files in Amazon Simple Storage Service (Amazon S3). You now need a mechanism to dynamically fetch these manifests, publish them on to your analytical datastore, and use them to create meaningful reports in an automated fashion. This allows ML practitioners and data scientists to track annotation progress and quality and allows MLOps and annotation operations teams to gain insights about the annotations and track worker performance. For example, these interested parties may want to see the following reports generated from Ground Truth output data:
- Annotation-level reports – These reports include the following:
- The number of annotations done in a specified time frame.
- Filtering based on label attributes. A label attribute is a Ground Truth feature that workers can use to provide metadata about individual annotations. For example, you can create a label attribute to have workers identify vehicle type (sedan, SUV, bus) or vehicle status (parked or moving).
- The number of frames per label or frame attributes in a labeling job. A frame attribute is a Ground Truth feature that workers can use to provide metadata about video frames. For example, you can create a frame attribute to have workers identify frame quality (blurry or clear) and add a visualization to show the number of good (clear) vs. bad (blurry) frames.
- The number of tasks audited or adjusted by a reviewer (in Part 1, this is a second-level or third-level worker).
- If you have workers audit labels from previous labeling jobs, you can enumerate audit results for each label (such as car or bush) using label attributes (such as correctly or incorrectly labeled).
- Worker-level reports – These reports include the following:
- The number of Ground Truth jobs worked on by each worker.
- The total number of labels created by each individual annotator.
- For one or more labeling jobs, the total amount of time spent by each worker annotating data objects.
- The minimum, average, and maximum time taken to label data objects by each worker.
- The statistics of these questions across the entire data annotator team.
In this post, we walk you through the process of generating a data lake for annotations and worker metadata from Ground Truth output data and build visual dashboards on those datasets to gain business insights using Amazon S3, AWS Glue, Amazon Athena, and Amazon QuickSight.
If you completed Part 1 of this series, you can skip the prerequisite and deployment steps and start setting up the AWS Glue ETL job used to process the output data generated from that tutorial. If you didn’t complete Part 1, make sure to complete the prerequisites and deploy the solution, before enabling the AWS Glue workflow.
AWS services used to implement this solution
This post walks you through how to create helpful visualizations for analyzing Ground Truth output data to derive insights into annotations and throughput and efficiency of your own private workers. The walkthrough uses the following AWS services:
- Amazon Athena – Allows you to perform ad hoc queries on S3 data using SQL, and query the QuickSight dataset for manual data analysis.
- AWS Glue – Helps prepare your data for analysis or ML. AWS Glue is a serverless data preparation service that makes it easy to extract, clean, enrich, normalize, and load data. We use the following features:
- An AWS Glue crawler to crawl the dataset and prepare metadata without loading it into a database. This reduces the cost of running an expensive database; you can store and run visuals from raw data files stored in an inexpensive, highly scalable, and durable S3 bucket.
- AWS Glue ETL jobs to extract, transform, and load (ETL) additional data. A job is the business logic that performs the ETL work in AWS Glue.
- The AWS Glue Data Catalog, which acts as a central metadata repository. This makes your data available for search and query using services such as Athena.
- Amazon QuickSight – Generates insights and builds visualizations with your data. QuickSight lets you easily create and publish interactive dashboards. You can choose from an extensive library of visualizations, charts, and tables, and add interactive features such as drill-downs and filters. For more information about setting up a dashboard, see Getting Started with Data Analysis in Amazon QuickSight.
- Amazon S3 – Stores the Ground Truth output data. Amazon S3 is the core service at the heart of the modern data architecture. Amazon S3 is unlimited, durable, elastic, and cost-effective for storing data or creating data lakes. You can use a data lake on Amazon S3 for reporting, analytics, artificial intelligence (AI), and machine learning (ML), because it can be shared across AWS big data services.
Solution overview
In Part 1 of this series, we discuss an architecture pattern that allows you to build a pipeline for orchestrating multi-step data labeling workflows that have workers add different types of annotations to data objects, in parallel, using Ground Truth. In this post, you learn how you can analyze the dataset annotations as well as worker performance. This solution builds data lakes using Ground Truth output data (annotations and worker metadata) and uses these data lakes to derive insights about or analyze the performance of your workers and dataset annotation quality using advanced analytics.
The code for Part 1 and Part 2 is located in the amazon-sagemaker-examples GitHub repo.
The following diagram illustrates this architecture, which is an end-to-end pipeline consisting of two components:
- Workflow pipeline – A hierarchical workflow built using Ground Truth, AWS CloudFormation, Step Functions, Amazon DynamoDB, and AWS Lambda. This is covered in detail in Part 1.
- Ground Truth reporting pipeline – A pipeline used to build BI dashboards using AWS Glue, Athena, and QuickSight to analyze and visualize Ground Truth output data and metadata generated by the AWS Glue ETL job. We discuss this in more detail in the next section.
Ground Truth reporting pipeline
The reporting pipeline is built on the output of the Ground Truth outputs stored in Amazon S3 (referred as the Ground Truth bucket).
The data is processed and the tables are created in the Data Catalog using the following steps:
- An AWS Glue crawler crawls the data labeling job output data, which is in JSON format, to determine the schema of your data, and creates a metadata table in your Data Catalog.
- The Data Catalog contains references to data that is used as sources and targets of your ETL jobs. The data is saved to an AWS Glue processing bucket.
- The ETL job retrieves worker metrics from the Ground Truth bucket and adds worker information from Amazon Cognito such as user name and email address. The job this data in the processed bucket (
${Prefix}-${AWS::AccountId}-${AWS::Region}-wm-glue-output/processed_worker_metrics/
). The job changes the format from JSON to Parquet for faster querying. - A crawler crawls the processed worker metrics data from the processed AWS Glue bucket. A crawler also crawls the annotations folder and output manifests folder to generate annotations and manifest tables.
- For each crawler, AWS Glue adds tables (annotations table, output manifest tables, and worker metrics table) to the Data Catalog in the
{Prefix}-gluedatabase
database. - Athena queries and retrieves the Ground Truth output data stored in the S3 data lake using the Data Catalog.
- The retrieved queries are visualized in QuickSight using dashboards.
As shown in the following dashboard examples, you can configure and display the top priority statistics at the top of the dashboard, such as total count of labeled vehicles, quality of labels and frames in a batch, and worker performance metrics. You can create additional visualizations according to your business needs. For more information, see Working with Visual Types in Amazon QuickSight.
The following table includes worker performance summary statistics.
The following dashboard shows several visualizations (from left to right, top to bottom):
- The number of vehicles labeled, broken up by vehicle type
- The number of annotations that passed and failed an audit quality check
- The number of good-quality (pass) and bad-quality (fail) video frames in the labeling job, identified by workers using frame attributes
- The number of parked vehicles (stationary) vs. moving vehicles (dynamic), identified by workers using label attributes
- A histogram displaying the total number of vehicles labeled per frame
- Tables displaying the quality of frames and audit results for multiple video frame labeling jobs
Prerequisites
If you’re continuing from Part 1 of this series, you can skip this step and move on to enabling the AWS Glue workflow.
If you didn’t completed the demo in Part 1, you need the following resources:
- An AWS account.
- An AWS Identity and Access Management (IAM) user with access to Amazon S3, AWS Glue, and Athena. If you don’t require granular permission, attach the following AWS managed policies:
AmazonS3FullAccess
AmazonSageMakerFullAccess
- Familiarity with Ground Truth, AWS CloudFormation, and Step Functions.
- An Amazon SageMaker workforce. For this demonstration, we use a private workforce. You can create a workforce through the SageMaker console. Note the Amazon Cognito user pool ID and the App client ID after you create your workforce. You use these values to tell the AWS CloudFormation deployment which workforce to use to create work teams, which represents the group of labelers. You can find these values in the Private workforce summary page on the Ground Truth area of the Amazon SageMaker console after you create your workforce, or when you call DescribeWorkteam. The following GIF demonstrates how to create a private workforce. For step-by-step instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.
- A QuickSight account. Create, if necessary, an Enterprise account in QuickSight.
Deploy the solution
If you didn’t complete the tutorial outlined in Part 1, you can use the sample data provided for this post to create a sample dashboard. If you completed Part 1, you can skip this section and proceed to enabling the AWS Glue workflow.
Launch the dashboard stack
To launch the resources required to create a sample dashboard with example data, you can launch the stack in AWS Region us-east-1
on the AWS CloudFormation console by choosing Launch Stack:
On the AWS CloudFormation console, choose Next, and modify the parameter for CognitoUserPoolId
to identify the user pool associated with your private workforce. You can locate this information on the SageMaker console:
- On the SageMaker console, choose Labeling workforces in the navigation pane.
- Find the values on the Private
- Use the App client value for
CognitoUserPoolClientId
and the Amazon Cognito user pool value forCognitoUserPoolId
.
Additionally, enter a prefix to use when naming resources. We use this for creating and managing labeling jobs and worker metrics.
For this post, you can use the default values for the following parameters:
- GlueJobTriggerCron – The cron expression to use when scheduling the reporting AWS Glue cron job. The results from annotations generated with Ground Truth and the worker performance metrics are used to create a dashboard in QuickSight. The outputs from the SageMaker annotations and worker performance metrics show up in Athena queries after processing the data with AWS Glue. By default, AWS Glue cron jobs run every hour.
- BatchProcessingInputBucketId – The bucket that contains the SMGT output data under the batch manifests folder. By default, the ML blogs bucket (
aws-ml-blog
) is defined and contains the SMGT output data. - LoggingLevel – The logging level to change the verbosity of the logs. Accepts values
DEBUG
andPROD
. This is used internally and can be ignored.
To launch the stack in a different AWS Region, use the instructions found in the README of the GitHub repository.
After you deploy the solution, use the next section to enable an AWS Glue workflow used to generate the BI dashboards.
Enable the AWS Glue workflow
If you completed Part 1, you launched a CloudFormation stack to create the Ground Truth labeling framework and the annotated MOT17 automotive dataset, using Ground Truth for vehicles and road boundaries and lanes, and audited the frames for quality of the annotations. To convert your data flow into the reporting dashboard set up by Ground Truth Labeling framework, you need to connect the output infrastructure that you previously set up to Athena and QuickSight. Athena can treat data in Amazon S3 as a relational database and allows you to run SQL queries on your data. QuickSight runs those queries on your behalf and creates visualizations of your data.
The following workflow allows Athena to run SQL queries on the example data. Complete the following steps to enable the workflow:
- On the AWS Glue console, in the left navigation pane, under ETL, choose Workflows.
- Select the
SMGT-Glue-Workflow
workflow. - On the Actions menu, choose Run.
If you don’t want to start the workflow now, you can wait—it automatically runs hourly.
AWS Glue takes some time to spin up its resources during the first run, so allow approximately 30 minutes for the workflow to finish. The completed workflow shows up on the Workflows page.
This pipeline is set up in the reporting.yml
file. Currently, the pipeline is run using the AWS Glue workflow using the ScheduledJobTrigger
resource with the flag StartOnCreation: false
. If you want to run this pipeline on a schedule, switch this flag to true
.
Datasets surfaced
All the following metadata and manifest external tables act as base source tables for Ground Truth (SMGT), and they persist values in the same form as they are captured within Ground Truth, with some customization to link the outputed worker ID to identifiable worker information, such as a user name, in the worker metadata. This provides flexibility for auditing and changing analytical needs.
The database ${Prefix}-${AWS::AccountId}-${AWS::Region}-gluedatabase
contains four databases, which are surfaced using the AWS Glue workflow. For our demonstration, we use smgt-gluedatabase
as the database name. The tables are as follows:
- An annotations table, called
annotations_batch_manifests
- Two output manifest tables (one each for first-level jobs and second-level jobs)
- The labeling job table
output_manifest_videoobjecttracking
- The audit job table
output_manifest_videoobjecttrackingaudit
- The labeling job table
- A worker metrics table, called
worker_metrics_processed_worker_metrics
The following screenshot shows the sample output of the tables under the AWS Glue database.
Connect Athena with the data lake
You can use Athena to connect to your S3 data lake and run SQL queries, which QuickSight uses to create visualizations.
If this is your first time using Athena, you need to configure the Athena query result location to the reporting S3 bucket created for the Athena workgroup. For more information, see Specifying a Query Result Location.
- On the Athena console, choose Settings in the navigation par.
- For Query result location, enter the S3 URL for the location of the bucket created for the Athena workgroup. The format is
s3://${Prefix}-${AWS::AccountId}-${AWS::Region}-athena/
. Note that the trailing slash is required. - Leave the other fields unchanged.
- Choose Save.
- In the Athena Query Editor, run the following SQL queries to verify that the reporting stack is configured properly:
SELECT * FROM "smgt-gluedatabase"."annotations_batch_manifests" limit 10;
SELECT * FROM "smgt-gluedatabase"."worker_metrics_processed_worker_metrics" limit 10;
SELECT * FROM "smgt-gluedatabase"."output_manifest_videoobjecttracking" limit 10;
SELECT * FROM "smgt-gluedatabase"."output_manifest_videoobjecttrackingaudit" limit 10;
You must have at least one Ground Truth job completed to generate these tables.
The following screenshot shows our output.
Visualize in QuickSight
You’re now ready to visualize your data in QuickSight.
Set up QuickSight
In this section, you update permissions in your QuickSight account to provide access to the S3 reporting buckets. For more information, see Accessing Data Sources. You also import the data from Athena to SPICE so that QuickSight can display it.
- On the QuickSight console, choose your user name on the application bar, and choose Manage QuickSight.
- Choose Security & permissions.
- Under QuickSight access to AWS services, choose Add or remove.
A list of available AWS services is displayed.
- Under Amazon S3, choose details and choose Select S3 buckets.
- Do one of the following:
- Option 1 (completed part 1): If you have completed Part 1 and are running this section, select the following S3 buckets:
- In S3 Buckets Lined to QuickSight Account, under S3 buckets, choose the following S3 buckets
1.{Prefix}-workflow-{account-ID}-{region}-batch-processing
2.{Prefix}-workflow-{account-ID}-{region}-wm-glue-output
3.{Prefix}-workflow-{account-ID}-{region}-athena
- In S3 Write permissions for Athena Workgroup, choose the following S3 bucket.
1.{Prefix}-workflow-{account-ID}-{region}-athena
- In S3 Buckets Lined to QuickSight Account, under S3 buckets, choose the following S3 buckets
- Option 2 (did not complete part 1): If you did not complete part 1, and use the launch stack option in this blog post, select the following S3 buckets:
- In S3 Buckets Lined to QuickSight Account, under S3 buckets, choose the following S3 buckets.
1.{Prefix}-{account-ID}-{region}-wm-glue-output
2.{Prefix}{account-ID}-{region}-athena
- In S3 Write permissions for Athena Workgroup, choose the following S3 bucket.
1.{Prefix}-{account-ID}-{region}-athena
- In S3 Buckets You Can Access Across AWS, under S3 buckets, choose the following S3 buckets.
1.aws-ml-blog
- In S3 Buckets Lined to QuickSight Account, under S3 buckets, choose the following S3 buckets.
- Option 1 (completed part 1): If you have completed Part 1 and are running this section, select the following S3 buckets:
- In both cases, after you’ve selected the buckets described above, choose Finish to close the Select Amazon S3 buckets dialog box.
- Choose Update to finish updating the permissions.
Create datasets
Create a new dataset using Athena as the source.
- On the QuickSight console, choose Datasets.
- Choose New dataset.
- In the FROM NEW DATA SOURCES section, choose the Athena
- For Data source name, enter Worker Metrics.
- For Athena workgroup, enter {Prefix}ReportsWorkGroup.
- Choose Create data source.
- For Database: contain sets of tables, choose the smgt-gluedatabase
- Select Use custom SQL and enter the following query:
SELECT *, cardinality(ans.trackingannotations.framedata.entries) as tasks FROM "smgt-gluedatabase"."worker_metrics_processed_worker_metrics", unnest(answercontent) as t(ans);
- Choose Edit/Preview data.
- For Custom SQL Name, enter Worker Metrics Dataset.
- Choose Apply.
- Choose Save & Visualize.
- Choose Visualize.
- In addition to creating the worker metrics dataset, you should also create annotation datasets.
The following code creates a label-level dataset for vehicles:
SELECT job_name,each_ann.height,each_ann.width,each_ann.top,each_ann."left",each_ann."label-category-attributes".moving,each_ann."label-category-attributes".vehicle_type,each_ann."label-category-attributes".audit,each_ann."object-name",each_ann from
(SELECT ann.annotations, partition_1 as job_name FROM "smgt-gluedatabase"."annotations_batch_manifests", unnest("tracking-annotations") as t(ann) where cardinality(ann.annotations) != 0) as data, unnest(data.annotations) as t(each_ann);
The following code creates a frame-level dataset for vehicles:
SELECT ann."frame-no",ann.frame,ann."frame-attributes"."number_of_vehicles",ann."frame-attributes"."quality_of_the_frame",ann.annotations, cardinality(ann.annotations) as num_labels, partition_1 as job_name, ann FROM "smgt-gluedatabase"."annotations_batch_manifests", unnest("tracking-annotations") as t(ann) where cardinality(ann.annotations) != 0
Next, you create a new analysis that imports the data from Athena to SPICE so that QuickSight can display it.
- On the All analyses page, choose New analysis.
- Choose the dataset that you just created and then choose Create analysis.
Create a worker metrics dashboard
QuickSight enables you to visualize your tabular data. For more information, see Creating an Amazon QuickSight Visual.
The following table summarizes several useful worker metric graphs that you can add to your dashboard.
Table Name |
Graph Type | Field Wells Value | Field Wells X-axis | Field Wells Row | Field Wells Columns |
Group/Color |
Total time spent labeling by a worker | Vertical stacked bar chart | timespentinseconds(Sum) | user name | modality | ||
Total time spent by modality | Autograph | timespentinseconds(Sum) | modality | |||
Worker metrics table | Table | timespentinSeconds(sum) tasks(sum) timespentinseconds (Max) timespentinseconds (Min) Average Time Taken Per Video (Average) |
user name |
You can add these tables to your QuickSight dashboard by creating a visual and customizing according to your requirements.
The follow are best practices for using the tables:
- Always sort your X-axis alphabetically by hour or day.
- Use conditional formatting to highlight large numbers of bad-quality frames. For an example, see Highlight Critical Insights with Conditional Formatting in Amazon QuickSight.
- Rename tables after they’re initially created, to help disambiguate them.
For more information about how to create visuals, calculated fields, parameters, controls, and visual tables, see Dashboard Building 101.
The following example visualization uses the Amazon Cognito worker sub IDs to identify worker metadata (such as email addresses). If you didn’t complete Part 1 and are using the example data provided for this post, these sub IDs aren’t associated with worker metadata in Amazon Cognito, so the sub ID appears in place of user names in the table. To learn more about using worker sub IDs with worker information, see Tracking the throughput of your private labeling team through Amazon SageMaker Ground Truth.
Create an annotation dashboard
The following table summarizes several useful annotation graphs that you can add to your dashboard.
Table Name |
Graph Type | Field Wells Value | Field Wells Y-axis | Field Wells Row | Field Wells Columns |
Group/Color |
Number of vehicles | Pie Chart | vehicle_type (Count) | vehicle_type | |||
Annotation level quality | Donut Chart | audit | ||||
Frame level quality | Donut Chart | quality_of_the_frame | ||||
Number of parked vehicles vs vehicles in motion | Donut Chart | moving | ||||
Maximum number of vehicles in a frame | Horizontal Bar Chart | number_of_vechicles (Count) | ||||
Quality of the frame per Job | Table | quality_of_the_frame (Count) | job_name | quality_of_the_frame | ||
Quality of the labels per Job | Table | audit (Count) | job_name | audit |
The following screenshot shows a sample dashboard for these annotation reports.
Save the reports tables as CSV
To download your worker metrics and annotation reports as a CSV file, choose the respective sheet. In the Options section, choose Menu options and then choose Export to CSV.
For more information, see Exporting Data.
Schedule a data refresh in QuickSight
To refresh your dashboard every hour, set the SPICE refresh schedule to be 1 hour for newly created datasets. For instructions, see Refreshing a Dataset on a Schedule.
We show the sample QuickSight dashboards when data is ingested from the Ground Truth output data in the preceding sections.
Customize the solution
If you want to build dashboards on your current Ground Truth output data directories, you can make customizations:
- The reporting pipeline CloudFormation template is set up in yml. The pipeline is set up for the video frame object tracking labeling use case, in which the annotations are stored in an output sequence file for each sequence of video frames that are labeled and not in the output manifest file. If your annotations are in the output manifest file, you can remove the annotation crawler and use output manifest tables for your dashboards. To learn more about the output data format for the task types supported by Ground Truth, see Output Data.
- The S3 path for outputs of all the Ground Truth jobs in the
reporting.yml
CloudFormation template points to s3://${BatchProcessingInputBucketId}/batch_manifests/. To use your data and new jobs, change themultiple mentions
of this path in the reporting.yml template to the path to your Ground Truth job output data. - All the queries used for building the dashboards are based on attributes used in the Ground Truth label category configuration file used in this example notebook. You can customize the queries for annotation reports based on attributes used in your label configuration file.
Clean up
To remove all resources created throughout this process and prevent additional costs, complete the following steps:
- On the Amazon S3 console, delete the S3 bucket that contains the raw and processed datasets.
- Cancel your QuickSight subscription.
- On the Athena console, delete the Athena workgroup named
${Prefix}-${AWS::AccountId}-${AWS::Region}-SMGTReportsWorkGroup
- On the AWS CloudFormation console, delete the stack you created to remove the resources the CloudFormation template created.
Conclusion
This two-part series provides you with a reference architecture to build an advanced data labeling workflow comprised of a multi-step data labeling pipeline, adjustment jobs, and data lakes for corresponding dataset annotations and worker metrics as well as updated dashboards.
In this post, you learned how to generate data lakes for annotations and worker metadata from Ground Truth output data generated from Part 1 using Ground Truth, Amazon S3, and AWS Glue. Then we discussed how to build visual dashboards for your annotation and worker metadata reports on those data lakes to derive business insights using Athena and QuickSight.
To learn more about automatic model building, selection, and deployment of custom classification models, refer to Automate multi-modality, parallel data labeling workflows with Amazon SageMaker Ground Truth and AWS Step Functions.
Try out the notebook and customize it for your label configuration by adding additional jobs or audit steps, or by modifying the data modality of the jobs. Further customization could include, but is not limited, to:
- Adding additional types of annotations such as semantic segmentation masks or keypoints
- Adding different types of visuals and analyses
- Adding different types of modalities such as point cloud or image classification
This solution is built using serverless technologies on top of AWS Glue and Amazon S3, which makes it highly customizable and applicable for a wide variety of applications. We encourage you to extend this pipeline to your data analytics and visualization use cases—there are many more transformations in AWS Glue, capabilities to build complex queries using Athena, and prebuilt visuals in QuickSight to explore.
Happy building!
About the Authors
Vidya Sagar Ravipati is a Deep Learning Architect at the Amazon ML Solutions Lab, where he leverages his vast experience in large-scale distributed systems and his passion for machine learning to help AWS customers across different industry verticals accelerate their AI and cloud adoption. Previously, he was a Machine Learning Engineer in Connectivity Services at Amazon who helped to build personalization and predictive maintenance platforms.
Gaurav Rele is a Data Scientist at the Amazon ML Solution Lab, where he works with AWS customers across different verticals to accelerate their use of machine learning and AWS Cloud services to solve their business challenges.
Talia Chopra is a Technical Writer in AWS specializing in machine learning and artificial intelligence. She works with multiple teams in AWS to create technical documentation and tutorials for customers using Amazon SageMaker, MxNet, and AutoGluon.