Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia

Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia

In this post, we show you how to deploy a TensorFlow based YOLOv4 model, using Keras optimized for inference on AWS Inferentia based Amazon EC2 Inf1 instances. You will set up a benchmarking environment to evaluate throughput and precision, comparing Inf1 with comparable Amazon EC2 G4 GPU-based instances. Deploying YOLOv4 on AWS Inferentia provides the highest throughput, lowest latency with minimal latency jitter, and the lowest cost per image.

The following charts show a 2-hour run in which Inf1 provides higher throughout and lower latency. The Inf1 instances achieved up to 1.85 times higher throughput and 37% lower cost per image when compared to the most optimized Amazon EC2 G4 GPU-based instances.

In addition, the following graph records the P90 inference latency is 60% lower on Inf1, and with significant lower variance compared to the G4 instances.

When you use the AWS Neuron data type auto-casting feature, there is no measurable degradation in accuracy. The compiler automatically converts the pipeline to mixed precision with BF16 data types for increased performance. The model reaches 48.7% mean average precision—thanks to the state-of-the-art YOLOv4 model implementation.

About AWS Inferentia and AWS Neuron SDK

AWS Inferentia chips are custom built by AWS to provide high-inference performance, with the lowest cost of inference in the cloud, with seamless features such as auto-conversion of trained FP32 models to Bfloat16, and elasticity in its machine learning (ML) models’ compute architecture, which supports a wide range of model types from image recognition, object detection, natural language processing (NLP), and modern recommender models.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the Inferentia chips. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch, and comes pre-installed in the AWS Deep Learning AMIs. Therefore, deploying deep learning models on AWS Inferentia is done in the same familiar environment used in other platforms, and your applications benefit from the boost in performance and lowest cost.

Since its launch, the Neuron SDK has seen dramatic improvement in the breadth of models that deliver high performance at a fraction of the cost. This includes NLP models like the popular BERT, image classification models (ResNet, VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new Neuron tool that allows you to easily scale monitoring of large fleets of Inference applications.

You Only Look Once

Object detection stands out as a computer vision (CV) task that has seen large accuracy improvements (average precision at 50 IoU > 70) due to deep learning model architectures. An object detection model tries to localize and classify objects in an image, allowing for applications ranging from real-time inspection of manufacturing defects to medical imaging and tracking your favorite player and ball on a soccer match.

Addressing the real-time inference challenges of such computer vision tasks is key for deploying these models at scale.

YOLO is part of the deep learning (DL) single-stage object detection model family, which includes models such as Single-Shot Detector (SSD) and RetinaNet. These models are usually built from stacking a backbone, neck, and head neural network that together perform detection and classification tasks. The main predictions are bounding boxes for identified objects and associated classes.

The backbone network takes care of extracting features of the input image, while the head gets trained on the supervised task, to predict the edges of the bounding box and classify its contents. The addition of a neck neural network allows for the head network to process features from intermediate steps of the backbone. The whole pipeline processes the images only once, hence the name You Only Look Once (YOLO).

On the other hand, models with two-stage detectors process further features from the previous convolutional layers to obtain proposals of regions, prior to generating object class prediction. In this way, the network focuses on detecting and classifying objects on regions of high object probability.

The following diagram illustrates this architecture (from YOLOv4: Optimal Speed and Accuracy of Object Detection, arXiv:2004.10934v1).

Single-stage models allow for multiple predictions of the same object in a single image. These predictions get disambiguated later by a process called non-max suppression (NMS), which takes care of leaving only the highest probability bounding box and label for the object. It’s a less computationally costly workflow than the two-stage approach.

Models like YOLO are all about performance. Its latest incarnation, version 4, aims at pushing the prediction accuracy further. The research paper YOLOv4: Optimal Speed and Accuracy of Object Detection shows how real-time inference can be achieved above the human perception of around 30 frames per second (FPS). In this post, you explore ways to push the performance of this model even further and use AWS Inferentia as a cost-effective hardware accelerator for real-time object detection.

Prerequisites

For this walkthrough, you need an AWS account with access to the AWS Management Console and the ability to create Amazon Elastic Compute Cloud (Amazon EC2) instances with public-facing IP.

Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.

Building a YOLOv4 predictor from a pre-trained model

To start building the model, set up an inf1.2xlarge EC2 instance in AWS, with 8 vCPU cores and 16 GB of memory. The Inf1 instance allows for optimizing the ratio between CPU and Inferentia devices through the selection of inf1.xlarge or inf1.2xlarge. We found that for YOLOv4, the optimal CPU to accelerator balance is achieved with inf.2xlarge. Going up to the second size instance improves throughput for a lower cost per image. Use the AWS Deep Learning AMI (Ubuntu 18.04) version 34.0—ami-06a25ee8966373068—in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia. For more information about running AWS Deep Learning AMIs on EC2 instances, see Launching and Configuring a DLAMI.

Next you can connect to the instance through SSH, activate the aws_neuron_tensorflow_p36 Conda environment, and update the Neuron compiler to the latest release. The compilation script depends on requirements listed in the YOLOv4 tutorial posted on the Neuron GitHub repo. Install them by running the following code in the terminal:

pip install neuron-cc tensorflow-neuron requests pillow matplotlib pycocotools==2.0.1 torch~=1.5.0 --force --extra-index-url=https://pip.repos.neuron.amazonaws.com

You can also run the following steps directly from the provided Jupyter notebook. If doing so, skip to the Running a performance benchmark on Inferentia section to explore the performance benefits of running YOLOv4 on AWS Inferentia.

The benchmark of the models requires an object detection validation dataset. Start by downloading the COCO 2017 validation dataset. The COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset, with over 300,000 images and 1.5 million object instances. The 2017 version of COCO contains 5,000 images for validation.

To download the dataset, enter the following code on the terminal:

curl -LO http://images.cocodataset.org/zips/val2017.zip
curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q val2017.zip
unzip annotations_trainval2017.zip

When the download is complete, you should see a val2017 and an annotations folder available in your working directory. At this stage, you’re ready to build and compile the model.

The GitHub repo contains the script yolo_v4_coco_saved_model.py for downloading the pretrained weights of a PyTorch implementation of YOLOv4, and the model definition for YOLOv4 using TensorFlow 1.15 and Keras. The code was adapted from an earlier implementation and converts the PyTorch checkpoint to a Keras h5 saved model. This implementation of YOLOv4 is optimized to run on AWS Inferentia. For more information about optimizations, see Working with YOLO v4 using AWS Neuron SDK.

To download, convert, and save your Keras model to the yolo_v4_coco_saved_model folder, enter the following code:

python3 yolo_v4_coco_saved_model.py ./yolo_v4_coco_saved_model

To instantiate a new predictor from the saved model, use tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model') on your inference script.

The following code implements a single batch predictor and image annotation script, so you can test the saved model:

import json
import tensorflow as tf
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

yolo_pred_cpu = tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model')
image_path = './val2017/000000581781.jpg'
with open(image_path, 'rb') as f:
    feeds = {'image': [f.read()]}

results = yolo_pred_cpu(feeds)

# load annotations to decode classification result
with open('./annotations/instances_val2017.json') as f:
    annotate_json = json.load(f)
label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}

# draw picture and bounding boxes
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(Image.open(image_path).convert('RGB'))

wanted = results['scores'][0] > 0.1

for xyxy, label_no_bg in zip(results['boxes'][0][wanted], results['classes'][0][wanted]):
    xywh = xyxy[0], xyxy[1], xyxy[2] - xyxy[0], xyxy[3] - xyxy[1]
    rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')
    ax.add_patch(rect)
    rx, ry = rect.get_xy()
    rx = rx + rect.get_width() / 2.0
    ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,
                ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))
plt.show()

The performance in this setup isn’t optimal because you ran YOLO only on CPU. Despite the native parallelization from TensorFlow, the eight cores aren’t enough to bring the inference time close to real time. For that, you use AWS Inferentia.

Compiling YOLOv4 to run on AWS Inferentia

The compilation of YOLOv4 uses the TensorFlow-Neuron API tfn.saved_mode.compile, working directly with the saved model directory created before. To further reduce the Neuron runtime overhead, two extra arguments are added to the compiler call: no_fuse_ops and minimum_segment_size.

The first argument, no_fuse_ops, partitions the graph prior to casting the FP16 tensors running in the sub-graph back to FP32, as defined in the model script. This allows for operations that run more efficiently on CPU to be skipped while the Neuron compiler runs its automatic smart partitioning. The argument minimum_segment_size sets the minimum number of operations in a sub-graph, to enforce trivial compilable sections to run on CPU. For more information, see Reference: TensorFlow-Neuron Compilation API.

To compile the model, enter the following code:

import shutil
import tensorflow as tf
import tensorflow.neuron as tfn


def no_fuse_condition(op):
    return any(op.name.startswith(pat) for pat in ['reshape', 'lambda_1/Cast', 'lambda_2/Cast', 'lambda_3/Cast'])

with tf.Session(graph=tf.Graph()) as sess:
    tf.saved_model.loader.load(sess, ['serve'], './yolo_v4_coco_saved_model')
    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]

shutil.rmtree('./yolo_v4_coco_saved_model_neuron', ignore_errors=True)

result = tfn.saved_model.compile(
                './yolo_v4_coco_saved_model', './yolo_v4_coco_saved_model_neuron',
                # we partition the graph before casting from float16 to float32, to help reduce the output tensor size by 1/2
                no_fuse_ops=no_fuse_ops,
                # to enforce trivial compilable subgraphs to run on CPU
                minimum_segment_size=100,
                batch_size=1,
                dynamic_batch_size=True,
)

print(result)

On an inf1.2xlarge, the compilation takes only a few minutes and outputs the ratio of the graph operations run on the AWS Inferentia chip. For our model, it’s approximately 79%. As mentioned earlier, to optimize the compiled model for performance, the target of the compilation shouldn’t be to maximize operations on the AWS inferential chip, but to balance the use of the available CPUs for efficient combined hardware utilization.

AWS Inferentia is designed to reach peak throughput at small—usually single-digit—batch sizes. When optimizing a specific model for throughput, explore compiling the model with different values of the batch_size argument and test what batch size yields the maximum throughput for your model. In the case of our YOLOv4 model, the best batch size is 1.

Replace the model path on the predictor instantiation to tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model_neuron') for a comparison with the previous CPU only inference. You get similar detection accuracy at a fraction of the inference time, approximately 40 milliseconds.

Setting up a benchmarking pipeline

To set up a performance measuring pipeline, create a multi-threaded loop running inference on all the COCO images downloaded. The code available in the notebook adapts the original implementation of the eval function. The following adapted version implements a ThreadPoolExecutor to send four parallel prediction calls at a time:

from concurrent import futures

def evaluate(yolo_predictor, images, eval_pre_path, anno_file, eval_batch_size, _clsid2catid):
    batch_im_id_list, batch_im_name_list, batch_img_bytes_list = get_image_as_bytes(images, eval_pre_path)

    # warm up
    yolo_predictor({'image': np.array(batch_img_bytes_list[0], dtype=object)})

    with futures.ThreadPoolExecutor(4) as exe:
        fut_im_list = []
        fut_list = []
        start_time = time.time()
        for batch_im_id, batch_im_name, batch_img_bytes in zip(batch_im_id_list, batch_im_name_list, batch_img_bytes_list):
            if len(batch_img_bytes) != eval_batch_size:
                continue
            fut = exe.submit(yolo_predictor, {'image': np.array(batch_img_bytes, dtype=object)})
            fut_im_list.append((batch_im_id, batch_im_name))
            fut_list.append(fut)
        bbox_list = []
        count = 0
        for (batch_im_id, batch_im_name), fut in zip(fut_im_list, fut_list):
            results = fut.result()
            bbox_list.extend(analyze_bbox(results, batch_im_id, _clsid2catid))
            for _ in batch_im_id:
                count += 1
                if count % 100 == 0:
                    print('Test iter {}'.format(count))
        
        print('==================== Performance Measurement ====================')
        print('Finished inference on {} images in {} seconds'.format(len(images), time.time() - start_time))
        print('=================================================================')
    
    # start evaluation
    box_ap_stats = bbox_eval(anno_file, bbox_list)
    return box_ap_stats

Additional helper functions are used to calculate average precision scores of the deployed model.

Running a performance benchmark on Inferentia

To run the COCO evaluation and benchmark the time to infer over the 5,000 images, run the evaluate function as shown in the following code:

val_coco_root = './val2017'
val_annotate = './annotations/instances_val2017.json'
clsid2catid = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16,
               15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31,
               27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43,
               39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56,
               51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72,
               63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85,
               75: 86, 76: 87, 77: 88, 78: 89, 79: 90}
eval_batch_size = 8

with open(val_annotate, 'r', encoding='utf-8') as f2:
    for line in f2:
        line = line.strip()
        dataset = json.loads(line)
        images = dataset['images']

box_ap = evaluate(yolo_pred, images, val_coco_root, val_annotate, eval_batch_size, clsid2catid)

When the evaluation is complete, you see logs on the screen like the following:

…

Test iter 4500
Test iter 4600
Test iter 4700
Test iter 4800
Test iter 4900
==================== Performance Measurement ====================
Finished inference on 5000 images in 47.50522780418396 seconds
=================================================================

…

Accumulating evaluation results...
DONE (t=6.78s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.487
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.741
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.531
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.546
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.573
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.601
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.430
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.657
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.744

At 5,000 images processed in 47 seconds, this deployment achieves 106 FPS, 3.5 times faster than the real-time threshold of 30 FPS. The research paper YOLOv4: Optimal Speed and Accuracy of Object Detection lists the results for batch one performance over the same COCO 2017 dataset running on a NVIDIA Volta GPU, such as the V100. The largest frame rate obtained was 96 FPS, at 41.2% mAP. Our model architecture and deployment achieves higher mAP, 48.7%, with a higher frame rate.

To have a direct comparison between AWS Inferentia, NVIDIA Volta, and Turing architectures, we replicated the same experiment in two GPU based instances, g4dn.xlarge and p3.2xlarge, by running the exact same model prior to compilation, with no further GPU optimization. This time we achieved 39 FPS and 111 FPS for the g4dn.xlarge and p3.2xlarge, respectively.

A YOLO model deployed in production usually doesn’t see a defined batch of 5,000 images at a time. To measure production like performance, we set up a prediction-only multi-threaded pipeline that runs inference for extended periods.

For a total time of 2 hours, we continually ran 8 parallel prediction calls with a batch of 4 images on each, totaling 32 images at a time. To maximize GPU throughput and try to decrease the performance gap between the Inf1 and G4 instances, we use the TensorFlow XLA compiler. This setup mimics a live endpoint behavior running at maximum throughput.

GPU thermal throttling

In contrast to AWS Inferentia chips, GPU throughput is inversely proportional to GPU temperature. GPU temperature can vary on endpoints running for extended periods at high throughput, which leads to FPS and latency fluctuations. This effect is known as thermal throttling. Some production systems can define a limit throughput below the maximum achievable to avoid performance swings over time. The following graph shows the average FPS over 30 second increments for the duration of the test . We observed up to 12% variation of the FPS rolling average on the GPU instance. On AWS Inferentia, this variation is below 3% for a substantially larger FPS average.

During the 2-hour period, we ran inference on over 856,000 images on the inf1.2xlarge instance. On the g4dn.xlarge, the maximum number of inferences achieved was 486,000. That amounts to 76% more images processed over the same amount of time using AWS Inferentia! Latency averages for batch 4 inference are also 60% lower for AWS Inferentia.

Using the total throughput collected during our 2-hour test, we calculated that the price of running 1 million inferences is $1.362 on an inf1.xlarge in the us-east-1 Region. For the g4dn.xlarge, the price is $2.163—a 37% price reduction for the YOLOv4 object detection pipeline on AWS Inferentia.

Safely shutting down and cleaning up

On the Amazon EC2 console, choose the instances used to perform the benchmark, and choose Terminate from the Actions drop-down menu. Stopping the instance discards data stored only in the instance’s home volume. You can persist the compiled model in an Amazon Simple Storage Service (S3) bucket, so it can be reused later. If you’ve made changes to the code inside the instances, remember to persist those as well.

Conclusion

In this post, you walked through the steps of optimizing a TensorFlow YOLOv4 model to run on AWS Inferentia. You explored AWS Neuron optimizations that yield better model performance with improved average precision, and in a much more cost-effective way. In production, the Neuron compiled model is up to 37% less expensive in the long run, with little throughput and latency fluctuations, when compared to the most optimized GPU instance.

Some of the steps described in this post also apply to other ML model types and frameworks. For more information, see the AWS Neuron SDK GitHub repo.

Learn more about the AWS Inferentia chip and the Amazon EC2 Inf1 instances to get started with running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.


About the Authors

Fabio Nonato de Paula is a Principal Solutions Architect for Autonomous Computing in AWS. He works with large-scale deployments of ML and AI for autonomous and intelligent systems. Fabio is passionate about democratizing access to accelerated computing and distributed ML. Outside of work, you can find Fabio riding his motorcycle on the hills of Livermore valley or reading ComiXology.

 

 

 

Haichen Li is a software development engineer in the AWS Neuron SDK team. He works on integrating machine learning frameworks with the AWS Neuron compiler and runtime systems, as well as developing deep learning models that benefit particularly from the Inferentia hardware.

 

 

 

Samuel Jacob is a senior software engineer in the AWS Neuron team. He works on AWS Neuron runtime to enable high performance inference data paths between AWS Neuron SDK and AWS Inferentia hardware. He also works on tools to analyze and improve AWS Neuron SDK performance. Outside of work, you can catch him playing video games or tinkering with small boards such as RaspberryPi.

 

 

Read More

Collaborating with AI to create Bach-like compositions in AWS DeepComposer

Collaborating with AI to create Bach-like compositions in AWS DeepComposer

AWS DeepComposer provides a creative and hands-on experience for learning generative AI and machine learning (ML). We recently launched the Edit melody feature, which allows you to add, remove, or edit specific notes, giving you full control of the pitch, length, and timing for each note. In this post, you can learn to use the Edit melody feature to collaborate with the autoregressive convolutional neural network (AR-CNN) algorithm and create interesting Bach-style compositions.

Through human-AI collaboration, we can surpass what humans and AI systems can create independently. For example, you can seek inspiration from AI to create art or music outside their area of expertise or offload the more routine tasks, like creating variations on a melody, and focus on the more interesting and creative tasks. Alternatively, you can assist the AI by correcting mistakes or removing artifacts it creates. You can also influence the output generated by the AI system by controlling the various training and inference parameters.

You can co-create music in the AWS DeepComposer Music Studio by collaborating with the AI (AR-CNN) model using the Edit melody feature. The AR-CNN Bach model modifies a melody note by note to guide the track towards sounding more Bach-like. You can modify four advanced parameters when you perform inference to influence how the input melody is modified:

  • Maximum notes to add – Changes the maximum number of notes added to your original melody
  • Maximum notes to remove – Changes the maximum number of notes removed from your original melody
  • Sampling iterations – Changes the exact number of times you add or remove a note based on note-likelihood distributions inferred by the model
  • Creative risk – Allows the AI model to deviate from creating Bach-like harmonies

The values you choose directly impact the composition created by the model by nudging the model in one way or another. For more information about these parameters, see AWS DeepComposer Learning Capsule on using the AR-CNN model.

Although the advanced parameters allow you to guide the output the AR-CNN model creates, they don’t provide note-level control over the music produced. For example, the AR-CNN model allows you to control the number of notes to add or remove during inference, but you don’t have control over the exact notes the model adds or removes.

The Edit melody feature bridges this gap by providing an interactive view of the generated melody so you can add missing notes, remove out-of-tune notes, or even change a note’s pitch and length. This granular level of editing facilitates better human-AI collaboration. It enables you to correct mistakes the model makes and harmonize the output to your liking, giving you more ownership of the creation process.

For this post, we explore the use case of co-creating Bach-like background music to match the following video.

Collaborating with AI using the AWS DeepComposer Music Studio

To start composing your melody, complete the following steps:

  1. Open the AWS DeepComposer Music Studio console.
  2. Choose an Input melody.

You can record a custom melody, import a melody, or choose a sample melody on the console.  For this post, we experimented with two melodies: the New World sample melody and a custom melody we created using the MIDI keyboard.

New World melody:

Custom melody:

  1. Choose the Autoregressive generative AI technique.
  2. Choose the Autoregressive CNN Bach model.

There are several considerations when choosing the advanced parameters. First, we wanted the original input melody to be recognizable. After some iterating, we found that setting the Maximum notes to add to 60 and Maximum notes to remove to 40 created a desirable outcome. For Creative risk, we wanted the model to create something interesting and adventurous. At the same time, we realized that a very high Creative risk value would deviate too much from the Bach style, so we took a moderate approach and chose a Creative risk of 2.

  1. You can repeat these steps a few times to iteratively create music.

Editing your input melody

After the AR-CNN model has generated a composition to your satisfaction, you can use the Edit melody feature to modify the melody and try to match the video’s transitions as much as possible.

  1. Choose the right arrow to open the input melody section.
  2. Choose Edit melody.
  3. On the Edit melody page, edit your track in any of the following ways:
    • Choose a cell (double-click) to add or remove a note at that pitch or time.
    • Drag a cell up or down to change a note’s pitch.
    • Drag the edge of a cell left or right to change a note’s length.
  4. When finished, choose Apply changes.

We drew inspiration from the AI-generated notes in different ways. For the New World melody, we noticed the model added short and bouncy notes (the circles with solid lines in the following screenshot), which made the composition sound similar to an American folk song. To match that style, we added a few notes in the second half of the composition (the dotted-lined circles).

For our custom melody, we noticed the model changed the chords slightly earlier than expected (see the following screenshot). This created lingering and overlapping sounds that we liked for the mountain road scenes.

On the other hand, we noticed the AI model needed our help to remove some notes that sounded out of place. After we listened to the track a few times, we decided to change some pitches manually to nudge the track towards something that sounded a bit more harmonious.

Generating accompaniments using the GAN generative AI technique

After using the AR-CNN Bach model to explore options for our melody track, we decided to try using a different generative AI model (GAN) to create musical accompaniments.

  1. Under Model parameters, for Generative AI technique, choose Generative adversarial network.
  2. Feed the edited compositions to the GAN model to generate accompaniments.

We chose the MuseGAN generative algorithm and the Symphony model because we wanted to create accompaniments to match the serene and somber setting in the video.

  1. You can optionally export your compositions into a music-editing tool of your choice to change the instrument set and perform post-processing.

Let’s watch the videos containing our AI-inspired creations in the background.

The first video uses the New World melody.

The following video uses our custom melody.

Conclusion

In this post, we demonstrated how to use the Edit melody feature in the AWS DeepComposer Music Studio to collaborate with generative AI models and create interesting Bach-style compositions. You can modify a melody to your liking by adding, removing, and editing specific notes. This gives you full control of the pitch, length, and timing for each note to produce an original melody.


About the Authors

 Rahul Suresh is an Engineering Manager with the AWS AI org, where he has been working on AI based products for making machine learning accessible for all developers. Prior to joining AWS, Rahul was a Senior Software Developer at Amazon Devices and helped launch highly successful smart home products. Rahul is passionate about building machine learning systems at scale and is always looking for getting these advanced technologies in the hands of customers. In addition to his professional career, Rahul is an avid reader and a history buff.

 

 

Enoch Chen is a Senior Technical Program Manager for AWS AI Devices. He is a big fan of machine learning and loves to explore innovative AI applications. Recently he helped bring DeepComposer to thousands of developers. Outside of work, Enoch enjoys playing piano and listening to classical music.

 

 

 

Carlos Daccarett is a Front-End Engineer at AWS. He loves bringing design mocks to life. In his spare time, he enjoys hiking, golfing, and snowboarding.

 

 

 

 

Dylan Jackson is a Senior ML Engineer and AI Researcher at AWS. He works to build experiences which facilitate the exploration of AI/ML, making new and exciting techniques accessible to all developers. Before AWS, Dylan was a Senior Software Developer at Goodreads where he leveraged both a full-stack engineering and machine learning skillset to protect millions of readers from spam, high-volume robotic traffic, and scaling bottlenecks. Dylan is passionate about exploring both the theoretical underpinnings and the real-world impact of AI/ML systems. In addition to his professional career, he enjoys reading, cooking, and working on small crafts projects.

Read More

Evaluating an automatic speech recognition service

Evaluating an automatic speech recognition service

Over the past few years, many automatic speech recognition (ASR) services have entered the market, offering a variety of different features. When deciding whether to use a service, you may want to evaluate its performance and compare it to another service. This evaluation process often analyzes a service along multiple vectors such as feature coverage, customization options, security, performance and latency, and integration with other cloud services.

Depending on your needs, you’ll want to check for features such as speaker labeling, content filtering, and automatic language identification. Basic transcription accuracy is often a key consideration during these service evaluations. In this post, we show how to measure the basic transcription accuracy of an ASR service in six easy steps, provide best practices, and discuss common mistakes to avoid.

Illustration showing a table of contents: The evaluation basics, six steps for performing an evaluation, and best practices and common mistakes to avoid.

The evaluation basics

Defining your use case and performance metric

Before starting an ASR performance evaluation, you first need to consider your transcription use case and decide how to measure a good or bad performance. Literal transcription accuracy is often critical. For example, how many word errors are in the transcripts? This question is especially important if you pay annotators to review the transcripts and manually correct the ASR errors, and you want to minimize how much of the transcript needs to be re-typed.

The most common metric for speech recognition accuracy is called word error rate (WER), which is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems. WER is the proportion of transcription errors that the ASR system makes relative to the number of words that were actually said. The lower the WER, the more accurate the system. Consider this example:

Reference transcript (what the speaker said): well they went to the store to get sugar

Hypothesis transcript (what the ASR service transcribed): they went to this tour kept shook or

In this example, the ASR service doesn’t appear to be accurate, but how many errors did it make? To quantify WER, there are three categories of errors:

  • Substitutions – When the system transcribes one word in place of another. Transcribing the fifth word as this instead of the is an example of a substitution error.
  • Deletions – When the system misses a word entirely. In the example, the system deleted the first word well.
  • Insertions – When the system adds a word into the transcript that the speaker didn’t say, such as or inserted at the end of the example.

Of course, counting errors in terms of substitutions, deletions, and insertions isn’t always straightforward. If the speaker says “to get sugar” and the system transcribes kept shook or, one person might count that as a deletion (to), two substitutions (kept instead of get and shook instead of sugar), and an insertion (or). A second person might count that as three substitutions (kept instead of to, shook instead of get, and or instead of sugar). Which is the correct approach?

WER gives the system the benefit of the doubt, and counts the minimum number of possible errors. In this example, the minimum number of errors is six. The following aligned text shows how to count errors to minimize the total number of substitutions, deletions, and insertions:

REF: WELL they went to THE  STORE TO   GET   SUGAR
HYP: **** they went to THIS TOUR  KEPT SHOOK OR
     D                 S    S     S    S     S

Many ASR evaluation tools use this format. The first line shows the reference transcript, labeled REF, and the second line shows the hypothesis transcript, labeled HYP. The words in each transcript are aligned, with errors shown in uppercase. If a word was deleted from the reference or inserted into the hypothesis, asterisks are shown in place of the word that was deleted or inserted. The last line shows D for the word that was deleted by the ASR service, and S for words that were substituted.

Don’t worry if these aren’t the actual errors that the system made. With the standard WER metric, the goal is to find the minimum number of words that you need to correct. For example, the ASR service probably didn’t really confuse “get” and “shook,” which sound nothing alike. The system probably misheard “sugar” as “shook or,” which do sound very similar. If you take that into account (and there are variants of WER that do), you might end up counting seven or eight word errors. However, for the simple case here, all that matters is counting how many words you need to correct without needing to identify the exact mistakes that the ASR service made.

You might recognize this as the Levenshtein edit distance between the reference and the hypothesis. WER is defined as the normalized Levenshtein edit distance:

In other words, it’s the minimum number of words that need to be corrected to change the hypothesis transcript into the reference transcript, divided by the number of words that the speaker originally said. Our example would have the following WER calculation:

WER is often multiplied by 100, so the WER in this example might be reported as 0.67, 67%, or 67. This means the service made errors for 67% of the reference words. Not great! The best achievable WER score is 0, which means that every word is transcribed correctly with no inserted words. On the other hand, there is no worst WER score—it can even go above 1 (above 100%) if the system made a lot of insertion errors. In that case, the system is actually making more errors than there are words in the reference—not only does it get all the words wrong, but it also manages to add new wrong words to the transcript.

For other performance metrics besides WER, see the section Adapting the performance metric to your use case later in this post.

Normalizing and preprocessing your transcripts

When calculating WER and many other metrics, keep in mind that the problem of text normalization can drastically affect the calculation. Consider this example:

Reference: They will tell you again: our ballpark estimate is $450.

ASR hypothesis: They’ll tell you again our ball park estimate is four hundred fifty dollars.

The following code shows how most tools would count the word errors if you just leave the transcripts as-is:

REF: THEY WILL    tell you AGAIN: our **** BALLPARK estimate is **** ******* ***** $450.   
HYP: **** THEY'LL tell you AGAIN  our BALL PARK     estimate is FOUR HUNDRED FIFTY DOLLARS.
     D    S                S          I    S                    I    I       I     S

The word error rate would therefore be:

According to this calculation, there were errors for 90% of the reference words. That doesn’t seem right. The ASR hypothesis is basically correct, with only small differences:

  • The words they will are contracted to they’ll
  • The colon after again is omitted
  • The term ballpark is spelled as a single compound word in the reference, but as two words in the hypothesis
  • $450 is spelled with numerals and a currency symbol in the reference, but the ASR system spells it using the alphabet as four hundred fifty dollars

The problem is that you can write down the original spoken words in more than one way. The reference transcript spells them one way and the ASR service spells them in a different way. Depending on your use case, you may or may not want to count these written differences as errors that are equivalent to missing a word entirely.

If you don’t want to count these kinds of differences as errors, you should normalize both the reference and the hypothesis transcripts before you calculate WER. Normalizing involves changes such as:

  • Lowercasing all words
  • Removing punctuation (except apostrophes)
  • Contracting words that can be contracted
  • Expanding written abbreviations to their full forms (such Dr. as to doctor)
  • Spelling all compound words with spaces (such as blackboard to black board or part-time to part time)
  • Converting numerals to words (or vice-versa)

If you there are other differences that you don’t want to count as errors, you might consider additional normalizations. For example, some languages have multiple spellings for some words (such as favorite and favourite) or optional diacritics (such as naïve vs. naive), and you may want to convert these to a single spelling before calculating WER. We also recommend removing filled pauses like uh and um, which are irrelevant for most uses of ASR, and therefore shouldn’t be included in the WER calculation.

A second, related issue is that WER by definition counts the number of whole word errors. Many tools define words as strings separated by spaces for this calculation, but not all writing systems use spaces to separate words. In this case, you may need to tokenize the text before calculating WER. Alternatively, for writing systems where a single character often represents a word (such as Chinese), you can calculate a character error rate instead of a word error rate, using the same procedure.

Six steps for performing an ASR evaluation

To evaluate an ASR service using WER, complete the following steps:

  1. Choose a small sample of recorded speech.
  2. Transcribe it carefully by hand to create reference transcripts.
  3. Run the audio sample through the ASR service.
  4. Create normalized ASR hypothesis transcripts.
  5. Calculate WER using an open-source tool.
  6. Make an assessment using the resulting measurement.

Choosing a test sample

Choosing a good sample of speech to evaluate is critical, and you should do this before you create any ASR transcripts in order to avoid biasing the results. You should think about the sample in terms of utterances. An utterance is a short, uninterrupted stretch of speech that one speaker produces without any silent pauses. The following are three example utterances:

An utterance is sometimes one complete sentence, but people don’t always talk in complete sentences—they hesitate, start over, or jump between multiple thoughts within the same utterance. Utterances are often only one or two words long and are rarely more than 50 words. For the test sample, we recommend selecting utterances that are 25–50 words long. However, this is flexible and can be adjusted if your audio contains mostly short utterances, or if short utterances are especially important for your application.

Your test sample should include at least 800 spoken utterances. Ideally, each utterance should be spoken by a different person, unless you plan to transcribe speech from only a few individuals. Choose utterances from representative portions of your audio. For example, if there is typically background traffic noise in half of your audio, then half of the utterances in your test sample should include traffic noise as well. If you need to extract utterances from long audio files, you can use a tool like Audacity.

Creating reference transcripts

The next step is to create reference transcripts by listening to each utterance in your test sample and writing down what they said word-for-word. Creating these reference transcripts by hand can be time-consuming, but it’s necessary for performing the evaluation. Write the transcript for each utterance on its own line in a plain text file named reference.txt, as shown below.

hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still under warranty so i wanted to see if someone could come look at it
no i checked everywhere the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
i tried to update my address on the on your web site but it just says error code 402 disabled account id after i filled out the form

The reference transcripts are extremely literal, including when the speaker hesitates and restarts in the third utterance (on the on your). If the transcripts are in English, write them using all lowercase with no punctuation except for apostrophes, and in general be sure to pay attention to the text normalization issues that we discussed earlier. In this example, besides lowercasing and removing punctuation from the text, compound words have been normalized by spelling them as two words (ice maker, web site), the initialism I.D. has been spelled as a single lowercase word id, and the number 402 is spelled using numerals rather than the alphabet. By applying these same strategies to both the reference and the hypothesis transcripts, you can ensure that different spelling choices aren’t counted as word errors.

Running the sample through the ASR service

Now you’re ready to run the test sample through the ASR service. For instructions on doing this on the Amazon Transcribe console, see Create an Audio Transcript. If you’re running a large number of individual audio files, you may prefer to use the Amazon Transcribe developer API.

Creating ASR hypothesis transcripts

Take the hypothesis transcripts generated by the ASR service and paste them into a plain text file with one utterance per line. The order of the utterances must correspond exactly to the order in the reference transcript file that you created: if line 3 of your reference transcripts file has the reference for the utterance pat went to the store, then line 3 of your hypothesis transcripts file should have the ASR output for that same utterance.

The following is the ASR output for the three utterances:

Hi I'm calling about a refrigerator I bought from you The ice maker stopped working and it's still in the warranty so I wanted to see if someone could come look at it
No I checked everywhere in the mailbox The package room I asked my neighbor who sometimes gets my packages but it hasn't shown up yet
I tried to update my address on the on your website but it just says error code 40 to Disabled Accounts idea after I filled out the form

These transcripts aren’t ready to use yet—you need to normalize them first using the same normalization conventions that you used for the reference transcripts. First, lowercase the text and remove punctuation except apostrophes, because differences in case or punctuation aren’t considered as errors for this evaluation. The word website should be normalized to web site to match the reference transcript. The number is already spelled with numerals, and it looks like the initialism I.D. was transcribed incorrectly, so no need to do anything there.

After the ASR outputs have been normalized, the final hypothesis transcripts look like the following:

hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still in the warranty so i wanted to see if someone could come look at it
no i checked everywhere in the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
i tried to update my address on the on your web site but it just says error code 40 to disabled accounts idea after i filled out the form

Save these transcripts to a plain text file named hypothesis.txt.

Calculating WER

Now you’re ready to calculate WER by comparing the reference and hypothesis transcripts. This post uses the open-source asr-evaluation evaluation tool to calculate WER, but other tools such as SCTK or JiWER are also available.

Install the asr-evaluation tool (if you’re using it) with pip install asr-evaluation, which makes the wer script available on the command line. Use the following command to compare the reference and hypothesis text files that you created:

wer -i reference.txt hypothesis.txt

The script prints something like the following:

REF: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still ** UNDER warranty so i wanted to see if someone could come look at it
HYP: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still IN THE   warranty so i wanted to see if someone could come look at it
SENTENCE 1
Correct          =  96.9%   31   (    32)
Errors           =   6.2%    2   (    32)
REF: no i checked everywhere ** the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
HYP: no i checked everywhere IN the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
SENTENCE 2
Correct          = 100.0%   24   (    24)
Errors           =   4.2%    1   (    24)
REF: i tried to update my address on the on your web site but it just says error code ** 402 disabled ACCOUNT  ID   after i filled out the form
HYP: i tried to update my address on the on your web site but it just says error code 40 TO  disabled ACCOUNTS IDEA after i filled out the form
SENTENCE 3
Correct          =  89.3%   25   (    28)
Errors           =  14.3%    4   (    28)
Sentence count: 3
WER:     8.333% (         7 /         84)
WRR:    95.238% (        80 /         84)
SER:   100.000% (         3 /          3)

If you want to calculate WER manually instead of using a tool, you can do so by calculating the Levenshtein edit distance between the reference and hypothesis transcript pairs divided by the total number of words in the reference transcripts. When you’re calculating the Levenshtein edit distance between the reference and hypothesis, be sure to calculate word-level edits, rather than character-level edits, unless you’re evaluating a written language where every character is a word.

In the evaluation output above, you can see the alignment between each reference transcript REF and hypothesis transcript HYP. Errors are printed in uppercase, or using asterisks if a word was deleted or inserted. This output is useful if you want to re-count the number of errors and recalculate WER manually to exclude certain types of words and errors from your calculation. It’s also useful to verify that the WER tool is counting errors correctly.

At the end of the output, you can see the overall WER: 8.333%. Before you go further, skim through the transcript alignments that the wer script printed out. Check whether the references correspond to the correct hypotheses. Do the error alignments look reasonable? Are there any text normalization differences that are being counted as errors that shouldn’t be?

Making an assessment

What should the WER be if you want good transcripts? The lower the WER, the more accurate the system. However, the WER threshold that determines whether an ASR system is suitable for your application ultimately depends on your needs, budget, and resources. You’re now equipped to make an objective assessment using the best practices we shared, but only you can decide what error rate is acceptable.

You may want to compare two ASR services to determine if one is significantly better than the other. If so, you should repeat the previous three steps for each service, using exactly the same test sample. Then, count how many utterances have a lower WER for the first service compared to the second service. If you’re using asr-evaluation, the WER for each individual utterance is shown as the percentage of Errors below each utterance.

If one service has a lower WER than the other for at least 429 of the 800 test utterances, you can conclude that this service provides better transcriptions of your audio. 429 represents a conventional threshold for statistical significance when using a sign test for this particular sample size. If your sample doesn’t have exactly 800 utterances, you can manually calculate the sign test to decide if one service has a significantly lower WER than the other. This test assumes that you followed good practices and chose a representative sample of utterances.

Adapting the performance metric to your use case

Although this post uses the standard WER metric, the most important consideration when evaluating ASR services is to choose a performance metric that reflects your use case. WER is a great metric if the hypothesis transcripts will be corrected, and you want to minimize the number of words to correct. If this isn’t your goal, you should carefully consider other metrics.

For example, if your use case is keyword extraction and your goal is to see how often a specific set of target keywords occur in your audio, you might prefer to evaluate ASR transcripts using metrics such as precision, recall, or F1 score for your keyword list, rather than WER.

If you’re creating automatic captions that won’t be corrected, you might prefer to evaluate ASR systems in terms of how useful the captions are to viewers, rather than the minimum number of word errors. With this in mind, you can roughly divide English words into two categories:

  • Content words – Verbs like “run”, “write”, and “find”; nouns like “cloud”, “building”, and “idea”; and modifiers like “tall”, “careful”, and “quickly”
  • Function words – Pronouns like “it” and “they”; determiners like “the” and “this”; conjunctions like “and”, “but”, and “or”; prepositions like “of”, “in”, and “over”; and several other kinds of words

For creating uncorrected captions and extracting keywords, it’s more important to transcribe content words correctly than function words. For these use cases, we recommend ignoring function words and any errors that don’t involve content words in your calculation of WER. There is no definite list of function words, but this file provides one possible list for North American English.

Common mistakes to avoid

If you’re comparing two ASR services, it’s important to evaluate the ASR hypothesis transcript produced by each service using a true reference transcript that you create by hand, rather than comparing the two ASR transcripts to each other. Comparing ASR transcripts to each other lets you see how different the systems are, but won’t give you any sense of which service is more accurate.

We emphasized the importance of text normalization for calculating WER. When you’re comparing two different ASR services, the services may offer different features, such as true-casing, punctuation, and number normalization. Therefore, the ASR output for two systems may be different even if both systems correctly recognized exactly the same words. This needs to be accounted for in your WER calculation, so you may need to apply different text normalization rules for each service to compare them fairly.

Avoid informally eyeballing ASR transcripts to evaluate their quality. Your evaluation should be tailored to your needs, such as minimizing the number of corrections, maximizing caption usability, or counting keywords. An informal visual evaluation is sensitive to features that stand out from the text, like capitalization, punctuation, proper names, and numerals. However, if these features are less important than word accuracy for your use case—such as if the transcripts will be used for automatic keyword extraction and never seen by actual people—then an informal visual evaluation won’t help you make the best decision.

Useful resources

The following are tools and open-source software that you may find useful:

Conclusion

This post discusses a few of the key elements needed to evaluate the performance aspect of an ASR service in terms of word accuracy. However, word accuracy is only one of the many dimensions that you need to evaluate when choosing on a particular ASR service. It’s critical that you include other parameters such as the ASR service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing.


About the Authors

Scott Seyfarth is a Data Scientist at AWS AI. He works on improving the Amazon Transcribe and Transcribe Medical services. Scott is also a phonetician and a linguist who has done research on Armenian, Javanese, and American English.

 

 

 

Paul Zhao is Product Manager at AWS AI. He manages Amazon Transcribe and Amazon Transcribe Medical. In his past life, Paul was a serial entrepreneur, having launched and operated two startups with successful exits.

Read More

Simplify data management with new APIs in Amazon Personalize

Simplify data management with new APIs in Amazon Personalize

Amazon Personalize now makes it easier to manage your growing item and user catalogs with new APIs to incrementally add items and users in your datasets to create personalized recommendations. With the new putItems and putUsers APIs, you can simplify the process of managing your datasets. You no longer need to upload an entire dataset containing historical records and new records just to include new records in your recommendations. Providing new records to Amazon Personalize when they become available reduces your latency for incorporating new information, ensuring your recommendations remain relevant to your users and item catalog.

Based on over 20 years of personalization experience at Amazon.com, Amazon Personalize enables you to improve customer engagement by powering personalized product and content recommendations and targeted marketing promotions. Amazon Personalize uses machine learning (ML) to create higher-quality recommendations for your websites and applications. You can get started without any prior ML experience and use simple APIs to easily build sophisticated personalization capabilities in just a few clicks. Amazon Personalize processes and examines your data, identifies what is meaningful, and trains and optimizes a personalization model that is customized for your data. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

This post walks you through the process of incrementally modifying your items and users datasets in Amazon Personalize.

Adding new items and users to your datasets

For this use case, we create a dataset group with an interaction dataset, an item dataset (item metadata) and a user dataset using the Amazon Personalize CLI. For instructions on creating a dataset group, see Getting Started (CLI).

  1. Create an Interactions dataset using the following schema and import data using the interactions-100k.csv data file:
{
	"type": "record",
	"name": "Interactions",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "USER_ID",
			"type": "string"
		},
		{
			"name": "ITEM_ID",
			"type": "string"
		},
		{
			"name": "EVENT_TYPE",
			"type": [
				"string"
			]
		},
		{
			"name": "EVENT_VALUE",
			"type": [
				"null",
				"float"
			]
		},
		{
			"name": "TIMESTAMP",
			"type": "long"
		}
	]
}
  1. Create an Items dataset using the following schema and import data using the csv data file:
{
	"type": "record",
	"name": "Items",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "ITEM_ID",
			"type": "string"
		},
		{
			"name": "GENRE",
			"type”: "null”
			 "categorical": true
		}
	],
	"version": "1.0"
}
  1. Create a Users dataset using the following schema and import data using the csv data file:
{
	"type": "record",
	"name": "Users",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "USER_ID",
			"type": "string"
		},
		{
			"name": "AGE",
			"type": "int"
		},
		{
			"name": "GENDER",
			"type": "string"
		}
	],
	"version": "1.0"
}

Now that you have created your datasets, you can add data to them in two different ways:

  • Using bulk import for item and user datasets from Amazon Simple Storage Service (Amazon S3). (for more information, see Preparing and Importing Data)
  • Using the new putUsers and putItems You can incrementally add up to 10 records per call to the user dataset using the putUsers API and the items dataset using putItems API.

For the putUsers call, the Users dataset required schema field (USER_ID) is mapped to the camel case userId. For the putItems call, the Items dataset required schema field (ITEM_ID) is mapped to the camel case itemId.

The following code adds two new users to the Users dataset via the putUsers API:

personalize_events.put_users(
datasetArn="arn:aws:personalize:region:acctID:dataset/crud-test/USERS",                          
    	users=[
 {
                 'userId' :"489",
                 'properties': "{"AGE":"29", "GENDER":F}"
             },
             {
                 'userId' : "650",
                 'properties':"{"AGE":"65", "GENDER"":F}"
             }]
)

The following code adds a new item to the Items dataset via the putItems API:

personalize_events.put_items(
datasetArn="arn:aws:personalize:region:acctID:dataset/crud-test/ITEMS",
items=[
{
            'itemId' :"432",
             'properties': "{"GENRE":"Action"}"
         }]
)

An HTTP/1.1 200 response is returned for successful record creation. In cases where your new item or user doesn’t match your dataset’s defined schema, you receive an InvalidInputException detailing the total number of records in your request that don’t match the schema.

For new records created (incrementally or via bulk upload) with the same userId or itemId as a record that already exists in the Users or Items dataset, the most recently created record (ingested by Amazon Personalize) is used in new solutions or solution versions.

Additionally, records added using putUsers or putItems are persisted until your dataset is deleted, so be sure to delete your dataset in the dataset group before importing a refreshed dataset. Amazon Personalize doesn’t replace your catalog or user data management systems.

Incorporating the newly added users and items in recommendations and filters

Now that you’ve added new items and new users to your datasets, incorporating this information into your Amazon Personalize solutions makes sure that recommendations remain timely and relevant for your users. When not using the aws-user-personalization recipe, solution re-training is needed to include these new items in your personalized recommendations.

If you have exploration enabled in an Amazon Personalize recipe, your new items are included in recommendations as soon as your next campaign update is complete. New events generated by your users’ interactions with these items are incorporated when your train a new solution or solution version in this dataset group.

Any filters you created in the dataset group are updated with your new item and user data within 15 minutes from the last dataset import job completion or the last incremental record. This update allows your campaigns to use your most recent data when filtering recommendations for your users.

Summary

Amazon Personalize allows you to easily manage your growing item and user catalogs so your personalized product and content recommendations keep pace with your business and your customers. For more information about optimizing your user experience with Amazon Personalize, see What Is Amazon Personalize?


About the Authors

Matt Chwastek is a Senior Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build and use machine learning solutions. In his spare time, he enjoys reading and photography.

 

 

 

 

Gaurav Singh Chauhan is a Software Engineer for Amazon Personalize and works on architecting software systems and big data pipelines that serve customers at scale. Gaurav has a B.Tech in Computer Science from IIT Bombay, India. Outside of work, he likes all things outdoors and is an avid runner. In his spare time, he likes reading about and exploring new technologies. He tweets on startups, technology, and India at @bazingaurav.

 

 

Read More

Announcing the winner of the AWS DeepComposer Chartbusters The Sounds of Science challenge

Announcing the winner of the AWS DeepComposer Chartbusters The Sounds of Science challenge

We’re excited to announce the top 10 compositions and the winner of the AWS DeepComposer Chartbusters The Sounds of Science challenge. AWS DeepComposer provides a creative and hands-on experience for learning generative AI and machine learning (ML). Chartbusters is a global monthly challenge where you can use AWS DeepComposer to create original compositions and compete to top the charts and win prizes. To participate in The Sounds of Science, developers composed background music for a video clip using the Autoregressive CNN (AR-CNN) algorithm and edited notes with the newly launched Edit melody feature to better match the provided video.

Top 10 compositions

The high-quality submissions made it challenging for our judges to select the chart-toppers. Our panel of experts—Kesha Williams, Sally Revell, and Prachi Kumar—selected the top 10 ranked compositions by evaluating the quality of the music, creativity, and how well the music matched the video clip.

The winner of The Sounds of Science is… (cue drum roll) Sungin Lee! You can listen to his winning composition and the top 10 compositions on SoundCloud or on the AWS DeepComposer console. The top 10 compositions for the Sounds of Science challenge are:

Sungin will receive an AWS DeepComposer Chartbusters gold record and will tell his story in an upcoming post, right here on the AWS ML blog.

Congratulations, Sungin Lee!

It’s time to move on to the next Chartbusters challengeTrack or Treat, which is Halloween-themed. The challenge launches today and is open until October 23rd, 2020.


About the Author

Maryam Rezapoor is a Senior Product Manager with AWS AI Ecosystem team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

Read More

Join AWS and NVIDIA at GTC, October 5–9

Join AWS and NVIDIA at GTC, October 5–9

Starting Monday, October 5, 2020, the NVIDIA GPU Technology Conference (GTC) is offering online sessions for you to learn AWS best practices to accomplish your machine learning (ML), virtual workstations, high performance computing (HPC), and internet of things (IoT) goals faster and more easily.

Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast ML training, cost-effective ML inference, flexible remote virtual workstations, and powerful HPC computations. At the edge, you can use AWS IoT Greengrass and SageMaker Neo to extend a wide range of AWS Cloud services and ML inference to NVIDIA-based edge devices so the devices can act locally on the data they generate.

AWS is a Global Diamond Sponsor of the conference.

Available sessions

The following sessions are available from AWS:

A Developer’s Guide to Choosing the Right GPUs for Deep Learning (Scheduled session IDs: A22318, A22319, A22320, and A22321)

  • As a deep learning developer or data scientist, you can choose from multiple GPU EC2 instance types based on your training and deployment requirements. You can access instances with different GPU memory sizes, NVIDIA GPU architectures, capabilities (precisions, Tensor Cores, NVLink), GPUs per instance, number of vCPUs, system memory, and network bandwidth. We’ll share some guidance on how you can choose the right GPU instance on AWS for your deep learning projects. You’ll get all the information you need to make an informed choice for GPU instance for your training and inference workload.
  • Speaker: Shashank Prasanna, Senior Developer Advocate, AI/ML, Amazon Web Services

Virtual Workstations on AWS for Digital Content Creation (On-Demand session IDs: A22276, A22311, A22312, and A22314)

  • Virtual workstations on AWS enable studios, departments, and freelancers to take on bigger projects, work from anywhere, and pay only for what they need. Running on Amazon EC2 G4 instances, virtual workstations employ the power of NVIDIA T4 Tensor Core GPUs and Quadro technology, the visual computing platform trusted by creative and technical professionals. Virtual workstations have become essential to creative professionals seeking cloud solutions that enable remote teams to work more efficiently, and keep creative productions moving forward. Join this session to learn more about how virtual workstations on AWS work, who is using them today, and how to get started.
  • Speaker: Haley Kannall, CG Supervisor, Amazon Web Services

Empower DeepStream Applications with AWS Data Services (On-Demand session IDs: A22279, A22315, A22316, and A22317)

  • We’ll discuss how we can optimize edge video inferencing performance by leveraging AWS infrastructure and NVIDA Deepstream. We’ll emphasize three major features at the edge: (1) massively deploying trained models to NVIDIA Jetson devices using AWS IoT Greengrass, (2) local communication and control between AWS IoT Greengrass engines and Deepstream applications through MQTT messaging, and (3) uploading inferencing results to the cloud for further analytics.
  • Speaker: Yuxin Yang, IoT Architect, Amazon Web Services

GPU-Powered Edge Computing Applications Enabled by AWS Wavelength (On-Demand session IDs: A22374, A22375, A22376, and A22377)

  • In this presentation, we provide an overview of AWS Wavelength, how it integrates with the Mobile Edge carrier network and improves the performance of Mobile Edge applications. Wavelength Zones are AWS infrastructure deployments that embed AWS compute and storage services within telecommunications providers’ datacenters at the edge of the 5G network, so application traffic can reach application servers running in Wavelength Zones without leaving the mobile providers’ network. Customers with edge data processing needs such as image and video recognition, inference, data aggregation, and responsive analytics can use Wavelength to perform low-latency operations and processing right where their data is generated, reducing the need to move large amounts of data to be processed in centralized locations. We deep dive into these Mobile Edge applications running at the AWS Wavelength Zones using Amazon EC2 G4 instances powered by NVIDIA T4 Tensor Core GPUs.
  • Speaker: Sebastian Dreisch, Head of Wavelength GTM, Amazon Web Services

Next Generation Cloud Platform for Autonomous Vehicle (AV) Development (Scheduled session ID: A21517)

  • Development of autonomous driving systems presents a massive computational challenge, including processing petabytes of sensor data, which impacts time to market, scale, and cost, throughout the development cycle. Training, testing, validating, and deploying self-driving systems requires large-scale compute and storage infrastructure to support the end-to-end workflow. AWS offers a highly scalable and reliable solution for AV development including the latest generation GPUs from NVIDIA. By attending this webinar, you will learn about AWS AV solution architectures for data ingest, data management, simulation, and distributed model training, as well as strategies for cost optimization. NVIDIA will share new details about the next generation NVIDIA Ampere (A100) architecture. Attendees will walk away with an understanding of how AWS and NVIDIA can help streamline AV development and reduce IT costs and time-to-market.
  • Speakers: Shyam Kumar, Principal HPC Business Development Manager, Amazon Web Services, and Norm Marks, Global Senior Director, Automotive Industry, NVIDIA

Embracing Volatility: Using ML to Become More Efficient Amid Epic Uncertainty (Scheduled session ID: A22219)

  • We’re all used to change. In business, change is often predictable—different seasons, large-scale events, and new releases all drive fluctuations we’re used to. But right now, there’s nothing normal about the changes you’re facing. The only constant is uncertainty. And uncertainty is expensive. In the absence of an omniscient crystal ball, the next best thing is cloud and ML. This presentation is going to cover how to deal with the unexpected. Whether it’s rapidly changing traffic, shifting data sources, or model drift, we’ll cover how you can better manage spikes and dips of all sizes and improve predictions with AI to maximize your efficiencies today.
  • Speaker: Allie Miller, US Head of ML Business Development for Startups and Venture Capital at AWS, Amazon Web Services

Accelerating Data Science with NVIDIA RAPIDS (Scheduled session ID: A22042)

  • Data science workflows have become increasingly computationally intensive in recent years, and GPUs have stepped up to address this challenge. With the RAPIDS suite of open-source software libraries and APIs, data scientists can run end-to-end data science and analytics pipelines entirely on GPUs, allowing organizations to deliver results faster than ever. The AWS Cloud lets you access a large number of powerful NVIDIA GPUs with Amazon EC2 P3 based on V100 GPUs, Amazon EC2 G4 based on T4 GPUs, and upcoming A100-based GPU instances. We’ll go through the end-to-end process of running on RAPIDS on AWS. We’ll start by running RAPIDS libraries on a single GPU instance. Next, we’ll see how you can run large-scale hyperparameter search experiments with RAPIDS and Amazon SageMaker. Finally, we’ll run RAPIDS distributed ML using Dask clusters on Amazon EKS and Amazon ECS.
  • Speaker: Shashank Prasanna, Senior Developer Advocate, AI/ML, Amazon Web Services

Interactive Scientific Visualization on AWS with NVIDIA IndeX SDK (On-Demand session ID: A21610)

  • Scientific visualization is critical to understanding complex phenomena modeled using HPC simulations. However, it has been challenging to do this effectively due to the inability to visualize large data volumes (> 1 PB) and lack of collaborative workflow solutions. NVIDIA IndeX on AWS, a 3D volumetric interactive visualization toolkit, addresses these problems by providing a scalable scientific visualization solution. NVIDIA IndeX allows you to make real-time modifications and navigate to the most pertinent parts of the data to gather better insights faster. IndeX leverages GPU clusters for scalable, real-time visualization and computing of multi-valued volumetric data together with embedded geometry data. We’ll demonstrate 3D volume rendering at scale on AWS using IndeX.
  • Speakers: Karthik Raman, Senior Solutions Architect, HPC, Amazon Web Services, and Dragos Tatulea, Software Engineer, NVIDIA

Conclusion

You can also visit AWS and NVIDIA to learn more or apply for a free trial to use NVIDIA GPU-based Amazon EC2 P3 instances powered by NVIDIA V100 Tensor Core GPUs and Amazon EC2 G4 instances powered by NVIDIA T4 Tensor Core GPUs. Learn more about GTC on the GTC 2020 website. We look forward to seeing you there!


About the Author

Geoff Murase is a Senior Product Marketing Manager for AWS EC2 accelerated computing instances, helping customers meet their compute needs by providing access to hardware-based compute accelerators such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). In his spare time, he enjoys playing basketball and biking with his family.

Read More

Building an end-to-end intelligent document processing solution using AWS

Building an end-to-end intelligent document processing solution using AWS

As organizations grow larger in size, so does the need for having better document processing. In industries such as healthcare, legal, insurance, and banking, the continuous influx of paper-based or PDF documents (like invoices, health charts, and insurance claims) have pushed businesses to consider evolving their document processing capabilities. In such scenarios, businesses and organizations find themselves in a race against time to deploy a sophisticated document analysis pipeline that can handle these documents in an automated and scalable fashion.

You can use Amazon Textract and Amazon Augmented AI (Amazon A2I) to process critical documents and for your NLP-based entity recognition models with Amazon SageMaker Ground Truth, Amazon Comprehend, and Amazon A2I. This post introduces another way to create a retrainable end-to-end document analysis solution with Amazon Textract, Amazon Comprehend, and Amazon A2I.

This solution takes scanned images of physical documents as input and extracts the text using Amazon Textract. It sends the text to be analyzed by a custom entity recognizer trained in Amazon Comprehend. Machine Learning applications such as Amazon Comprehend work really well at scale, and in order to achieve 100% accuracy, you can use human reviewers to review and validate low confidence predictions. Additionally, you can use this human input to improve your underlying machine learning models. This is done by sending the output from Amazon Comprehend to be reviewed by human reviewers using Amazon A2I so that you can feed it back to retrain the models and improve the quality for future iterations. You can also use Amazon A2I to provide human oversight to your machine learning models and randomly send some data for human review to sample the output quality of your custom entity recognizer. This automated pipeline can scale to millions of documents with the help of these services and allow businesses to do more detailed analysis of their documents.

Solution overview

The following diagram illustrates the solution architecture.

This solution takes images (scanned documents or screenshots or pictures of documents) as input. You can upload these files programmatically or through the AWS Management Console into an Amazon Simple Storage Service (Amazon S3) bucket in the input folder. This action triggers an AWS Lambda function, TextractComprehendLambda, through event notifications.

The TextractComprehendLambda function sends the image to Amazon Textract to extract the text from the image. When it acquires the results, it collates the results and sends the text to the Amazon Comprehend custom entity recognizer. The custom entity recognizer is a pre-trained model that identifies entities in the text that are valuable to your business. This post demonstrates how to do this, in detail, in the following sections.

The custom entity recognizer stores the results in a separate bucket, which acts as a temporary storage for this data. This bucket has another event notification, which triggers the ComprehendA2ILambda function. This Lambda function takes the output from the custom entity recognizer, processes it, and send the results to Amazon A2I by creating a human loop for review and verification.

Amazon A2I starts the human loop, providing reviewers an interface to double-check and correct the results that may not have been identified in the custom entity recognition process. These reviewers submit their responses through the Amazon A2I worker console. When the human loop is complete, Amazon A2I sends an Amazon CloudWatch event, which triggers the HumanReviewCompleted Lambda.

The HumanReviewCompleted function checks if the human reviewers have added any more annotations (because they found more custom entities). If the human reviewers found something that the custom entity recognizer missed, the function creates a new file called updated_entity_list.txt. This file contains all the entities that weren’t present in the previous training dataset.

At the end of each day, a CloudWatch alarm triggers the NewEntityCheck function. This function compares the entity_list.txt file and the updated_entity_list.txt file to check if any new entities were added in the last day. If so, it starts a new Amazon Comprehend custom entity recognizer training job and enables the CloudWatch time-based event trigger that triggers the CERTrainingCompleteCheck function every 15 minutes.

The CERTrainingCompleteCheck function checks if the Amazon Comprehend custom entity recognizer has finished training. If so, the function adds the entries from updated_entity_list.txt to entity_list.txt so it doesn’t train the model again, unless even more entities are found by the human reviewers. It also disables its own CloudWatch time-based event trigger, because it doesn’t need to check the training process until it starts again. The next invocation of the TextractComprehend function uses the new custom entity recognizer, which has learned from the previous reviews of the humans.

All these Lambda functions use AWS Systems Manager Parameter Store for sharing, retaining, and updating the various variables, like which custom entity recognizer is the current one and where all the data is stored.

We demonstrate this solution in the us-east-1 Region but, you can run it in any compatible Region. For more information about availability of services in your Region, see the AWS Region Table.

Prerequisites

This post requires that you have an AWS account with appropriate AWS Identity and Access Management (IAM) permissions to launch the AWS CloudFormation template.

Deploying your solution

To deploy your solution, you complete the following high-level steps:

  1. Create an S3 bucket.
  2. Create a custom entity recognizer.
  3. Create a human review workflow.
  4. Deploy the CloudFormation stack.

Creating an S3 bucket

You first create the main bucket for this post. You use it to receive the input (the original scans of documents), and store the outputs for each step of the analysis. The Lambda functions pick up the results at the end of each state and collate them for further use and record-keeping. For instructions on creating a bucket, see Create a Bucket.

Capture the name of the S3 bucket and save it to use later in this walkthrough. We refer this bucket as <primary_bucket> in this post. Replace this with the name of your actual bucket as you follow along.

Creating a custom entity recognizer

Amazon Comprehend allows you to bring your own training data, and train custom entity recognition models to customize the entity recognition process to your business-specific use cases. You can do this without having to write any code or have any in-house machine learning (ML) expertise. For this post, we provide a training dataset and document image, but you can use your own datasets when customizing Amazon Comprehend to suit your use case.

  1. Download the training dataset.
  2. Locate the bucket you created on the Amazon S3 console.

For this post, we use the bucket textract-comprehend-a2i-data, but you should use the name that you used for <primary_bucket>.

  1. Open the bucket and choose Create folder.
  2. For name, enter comprehend_data.

  1. Uncompress the file you downloaded earlier and upload the files to the comprehend_data folder.

  1. On the Amazon Comprehend console, click on Launch Amazon Comprehend.

  1. Under Customization, choose Custom entity recognition.

  1. Choose Train Recognizer to open the entity recognizer training page.

  1. For Recognizer name, enter a name.

The name that you choose appears in the console hereafter, so something human readable and easily identifiable is ideal.

  1. For Custom entity types, enter your custom entity type (for this post, we enter DEVICE).

At the time of this writing, you can have up to 25 entity types per custom entity recognizer in Amazon Comprehend.

  1. In the Training data section, select Using entity list and training docs.
  2. Add the paths to entity_list.csv and raw_txt.csv for your <primary_bucket>.

  1. In the IAM role section, select Create a new role.
  2. For Name suffix, enter a suffix you can identify later (for this post, we enter TDA).
  3. Leave the remaining settings as default and choose Train.

  1. When the training is complete, choose your recognizer and copy the ARN for your custom entity recognizer for future use.

Creating a human review workflow

To create a human review workflow, you need to have three things ready:

  • Reviewing workforce – A work team is a group of people that you select to review your documents. You can create a work team from a workforce, which is made up of Amazon Mechanical Turk workers, vendor-managed workers, or your own private workers that you invite to work on your tasks. Whichever workforce type you choose, Amazon A2I takes care of sending tasks to workers. For this post, you create a work team using a private workforce and add yourself to the team to preview the Amazon A2I workflow.
  • Worker task template – This is a template that defines what the console looks like to the reviewers.
  • S3 bucket – This is where the output of Amazon A2I is stored. You already created a bucket earlier, so this post uses the same bucket.

Creating a workforce

To create and manage your private workforce, you can use the Labeling workforces page on the Amazon SageMaker console. When following the instructions, you can create a private workforce by entering worker emails or importing a pre-existing workforce from an Amazon Cognito user pool.

If you already have a work team, you can use the same work team with Amazon A2I and skip to the following section.

To create your private work team, complete the following steps:

  1. Navigate to the Labeling workforces page on the Amazon SageMaker console.
  2. On the Private tab, choose Create private team.

  1. Choose Invite new workers by email.
  2. For this post, enter your email address to work on your document processing tasks.

You can enter a list of up to 50 email addresses, separated by commas, into the Email addresses box.

  1. Enter an organization name and contact email.
  2. Choose Create private team.

  1. After you create a private team, choose the team to start adding reviewers to your private workforce.

  1. On the Workers tab, choose Add workers to team.

  1. Enter the email addresses you want to add and choose Invite new workers.

After you add the workers (in this case, yourself), you get an email invitation. The following screenshot shows an example email.

After you choose the link and change your password, you’re registered as a verified worker for this team. Your one-person team is now ready to review.

  1. Choose the link for Labeling Portal Sign-in URL and log in using the credentials generated in the previous step.

You should see a page similar to the following screenshot.

This is the Amazon A2I worker portal.

Creating a worker task template

You can use a worker template to customize the interface and instructions that your workers see when working on your tasks. To create a worker task template, complete the following steps:

  1. Navigate to the Worker task templates page on the Amazon SageMaker console.

For this post, we use Region us-east-1. For availability details for Amazon A2I and Amazon Translate in your preferred Region, see the AWS Region Table.

  1. Choose Create template.

  1. For Template name, enter translate-a2i-template.

  1. In the Template editor field, enter the code from the following task-template.html.zip file:
<!-- Copyright Amazon.com, Inc. and its affiliates. All Rights Reserved.
SPDX-License-Identifier: MIT

Licensed under the MIT License. See the LICENSE accompanying this file
for the specific language governing permissions and limitations under
the License. -->

<script src="https://assets.crowd.aws/crowd-html-elements.js"></script>

<crowd-entity-annotation
        name="crowd-entity-annotation"
        header="Highlight parts of the text below"
        labels="{{ task.input.labels | to_json | escape }}"
        initial-value="{{ task.input.initialValue }}"
        text="{{ task.input.originalText }}"
>
    <full-instructions header="Named entity recognition instructions">
        <ol>
            <li><strong>Read</strong> the text carefully.</li>
            <li><strong>Highlight</strong> words, phrases, or sections of the text.</li>
            <li><strong>Choose</strong> the label that best matches what you have highlighted.</li>
            <li>To <strong>change</strong> a label, choose highlighted text and select a new label.</li>
            <li>To <strong>remove</strong> a label from highlighted text, choose the X next to the abbreviated label name on the highlighted text.</li>
            <li>You can select all of a previously highlighted text, but not a portion of it.</li>
        </ol>
    </full-instructions>

    <short-instructions>
        Highlight the custom entities that went missing.
    </short-instructions>

</crowd-entity-annotation>

<script>
    document.addEventListener('all-crowd-elements-ready', () => {
        document
            .querySelector('crowd-entity-annotation')
            .shadowRoot
            .querySelector('crowd-form')
            .form;
    });
</script>

  1. Choose Create

Creating a human review workflow

Human review workflows allow human reviewers to audit the custom entities that are detected using Amazon Comprehend on an ongoing basis. To create a human review workflow, complete the following steps:

  1. Navigate to the Human review workflow page the Amazon SageMaker console.
  2. Choose Create human review workflow.

  1. In the Workflow settings section, for Name, enter a unique workflow name.
  2. For S3 bucket, enter the S3 bucket where you want to store the human review results.

For this post, we use the same bucket that we created earlier, but add the suffix /a2i-raw-output. For example, if you created a bucket called textract-comprehend-a2i-data, enter the path s3://textract-comprehend-a2i-data/a2i-raw-output. This subfolder contains the edits that the reviewers make in all the human review workflow jobs that are created for Amazon Comprehend custom entity recognition. (Replace the bucket name with the value of <primary_bucket>.)

  1. For IAM role, choose Create a new role from the drop-down menu.

Amazon A2I can create a role automatically for you.

  1. For S3 buckets you specify, select Specific S3 buckets.
  2. Enter the name of the S3 bucket you created earlier (<primary_bucket>).
  3. Choose Create.

You see a confirmation when role creation is complete and your role is now pre-populated in the IAM role drop-down menu.

  1. For Task type, select Custom.

  1. In the Worker task template section, for Template, choose custom-entity-review-template.
  2. For Task description, add a description that briefly describes the task for your workers.

  1. In the Workers section, select
  2. For Private teams, choose textract-comprehend-a2i-review-team.
  3. Choose Create.

You see a confirmation when human review workflow creation is complete.

Copy the workflow ARN and save it somewhere. You need this in the upcoming steps. You also need to keep the Amazon A2I Worker Portal (created earlier) open and ready after this step.

Deploying the CloudFormation stack

Launch the following CloudFormation stack to deploy the stack required for running the entire flow:

This creates the remaining elements for running your human review workflow for the custom entity recognizer. When creating the stack, enter the following values:

  • CustomEntityRecognizerARN – The ARN for the custom entity recognizer.
  • CustomEntityTrainingDatasetS3URI – The path to the training dataset that you used for creating the custom entity recognizer.
  • CustomEntityTrainingListS3URI – The path to the entity list that you used for training the custom entity recognizer.
  • FlowDefinitionARN – The ARN of the human review workflow.
  • S3BucketName – The name of the bucket you created.
  • S3ComprehendBucketName – A random name that must be unique so the template can create an empty S3 bucket to store temporary output from Amazon Comprehend in. You don’t need to create this bucket—the Cloudformation template does that for you, just provide a unique name here.

Choose the defaults of the stack deployment wizard. On the Review page, in the Capabilities and transforms section, select the three check-boxes and choose Create stack.

You need to confirm that the stack was deployed successfully on your account. You can do so by navigating to the AWS CloudFormation console and looking for the stack name TDA.

When the status of the stack changes to CREATE_COMPLETE, you have successfully deployed the document analysis solution to your account.

Testing the solution

You can now test the end-to-end flow of this solution. To test each component, you complete the following high-level steps:

  1. Upload a file.
  2. Verify the Amazon Comprehend job status.
  3. Review the worker portal.
  4. Verify the changes were recorded.

Uploading a file

In real-world situations, when businesses receive a physical document, they scan, photocopy, email, or upload it to some form of an image-based format for safe-keeping as a backup mechanism. The following is the sample document we use in this post.

To upload the file, complete the following steps:

  1. Download the image.
  2. On the Amazon S3 console, navigate to your <primary_bucket>.
  3. Choose Create folder.
  4. For Name, enter input.
  5. Choose Save.

  1. Upload the image you downloaded into this folder.

This upload triggers the TextractComprehendA2ILambda function, which sends the uploaded image to Amazon Textract and sends the response received from Amazon Comprehend.

Verifying Amazon Comprehend job status

You can now verify that the Amazon Comprehend job is working.

  1. On the Amazon Comprehend console, choose Analysis jobs.
  2. Verify that your job is in status In progress.

When the status switches to Completed, you can proceed to the next step.

Reviewing the worker portal

You can now test out the human review worker portal.

  1. Navigate to the Amazon A2I worker portal that you created.

You should have a new job waiting to be processed.

  1. Select the job and choose Start working.

You’re redirected to the review screen.

  1. Tag any new entities that the algorithm missed.
  2. When you’re finished, choose Submit.

Verify that the changes were recorded

Now that you have added your inputs in the A2I console, the HumanWorkflowCompleted Lambda function adds the identified entities to the already existing file and stores it in a separate entity list in the S3 bucket. You can verify that this has happened by navigating to <primary_bucket> on the Amazon S3 console.

In the folder comprehend_data, you should see a new file called updated_entity_list.csv.

The NewEntityCheck Lambda function uses this file at the end of each day to compare against the original entity_list.csv file. If new entities are in the updated_entity_list.csv file, the model is retrained and replaces the older custom entity recognition model.

This allows the Amazon Comprehend custom entity recognition model to improve continuously by incorporating the feedback received from human reviewers through Amazon A2I. Over time, this can reduce the need for reviewers and manual intervention by analyzing documents in a more intelligent and sophisticated manner.

Cost

With this solution, you can now process scanned and physical documents at scale and do ML-powered analysis on them. The cost to run this example is less than $5.00. For more information about exact costs, see Amazon Textract pricing, Amazon Comprehend pricing, and Amazon A2I pricing.

Cleaning up

To avoid incurring future charges, delete the resources when not in use.

Conclusion

This post demonstrated how you can build an end-to-end document analysis solution for analyzing scanned images of documents using Amazon Textract, Amazon Comprehend, and Amazon A2I. This allows you to create review workflows for the critical documents you need to analyze using your own private workforce, and provides increased accuracy and context.

This solution also demonstrated how you can improve your Amazon Comprehend custom entity recognizer over time by retraining the models on the newer entities that the reviewers identify.

For the code used in this walkthrough, see the GitHub repo. For information about adding another review layer for Amazon Textract using Amazon A2I, see Using Amazon Textract with Amazon Augmented AI for processing critical documents.


About the Author

Purnesh Tripathi is a Solutions Architect at Amazon Web Services. He has been a data scientist in his previous life, and is passionate about the benefits that Machine Learning and Artificial Intelligence bring to a business. He works with small and medium businesses, and startups in New Zealand to help them innovate faster using AWS.

Read More