Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia

Achieving 1.85x higher performance for deep learning based object detection with an AWS Neuron compiled YOLOv4 model on AWS Inferentia

In this post, we show you how to deploy a TensorFlow based YOLOv4 model, using Keras optimized for inference on AWS Inferentia based Amazon EC2 Inf1 instances. You will set up a benchmarking environment to evaluate throughput and precision, comparing Inf1 with comparable Amazon EC2 G4 GPU-based instances. Deploying YOLOv4 on AWS Inferentia provides the highest throughput, lowest latency with minimal latency jitter, and the lowest cost per image.

The following charts show a 2-hour run in which Inf1 provides higher throughout and lower latency. The Inf1 instances achieved up to 1.85 times higher throughput and 37% lower cost per image when compared to the most optimized Amazon EC2 G4 GPU-based instances.

In addition, the following graph records the P90 inference latency is 60% lower on Inf1, and with significant lower variance compared to the G4 instances.

When you use the AWS Neuron data type auto-casting feature, there is no measurable degradation in accuracy. The compiler automatically converts the pipeline to mixed precision with BF16 data types for increased performance. The model reaches 48.7% mean average precision—thanks to the state-of-the-art YOLOv4 model implementation.

About AWS Inferentia and AWS Neuron SDK

AWS Inferentia chips are custom built by AWS to provide high-inference performance, with the lowest cost of inference in the cloud, with seamless features such as auto-conversion of trained FP32 models to Bfloat16, and elasticity in its machine learning (ML) models’ compute architecture, which supports a wide range of model types from image recognition, object detection, natural language processing (NLP), and modern recommender models.

AWS Neuron is a software development kit (SDK) consisting of a compiler, runtime, and profiling tools that optimize the ML inference performance of the Inferentia chips. Neuron is natively integrated with popular ML frameworks such as TensorFlow and PyTorch, and comes pre-installed in the AWS Deep Learning AMIs. Therefore, deploying deep learning models on AWS Inferentia is done in the same familiar environment used in other platforms, and your applications benefit from the boost in performance and lowest cost.

Since its launch, the Neuron SDK has seen dramatic improvement in the breadth of models that deliver high performance at a fraction of the cost. This includes NLP models like the popular BERT, image classification models (ResNet, VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new Neuron tool that allows you to easily scale monitoring of large fleets of Inference applications.

You Only Look Once

Object detection stands out as a computer vision (CV) task that has seen large accuracy improvements (average precision at 50 IoU > 70) due to deep learning model architectures. An object detection model tries to localize and classify objects in an image, allowing for applications ranging from real-time inspection of manufacturing defects to medical imaging and tracking your favorite player and ball on a soccer match.

Addressing the real-time inference challenges of such computer vision tasks is key for deploying these models at scale.

YOLO is part of the deep learning (DL) single-stage object detection model family, which includes models such as Single-Shot Detector (SSD) and RetinaNet. These models are usually built from stacking a backbone, neck, and head neural network that together perform detection and classification tasks. The main predictions are bounding boxes for identified objects and associated classes.

The backbone network takes care of extracting features of the input image, while the head gets trained on the supervised task, to predict the edges of the bounding box and classify its contents. The addition of a neck neural network allows for the head network to process features from intermediate steps of the backbone. The whole pipeline processes the images only once, hence the name You Only Look Once (YOLO).

On the other hand, models with two-stage detectors process further features from the previous convolutional layers to obtain proposals of regions, prior to generating object class prediction. In this way, the network focuses on detecting and classifying objects on regions of high object probability.

The following diagram illustrates this architecture (from YOLOv4: Optimal Speed and Accuracy of Object Detection, arXiv:2004.10934v1).

Single-stage models allow for multiple predictions of the same object in a single image. These predictions get disambiguated later by a process called non-max suppression (NMS), which takes care of leaving only the highest probability bounding box and label for the object. It’s a less computationally costly workflow than the two-stage approach.

Models like YOLO are all about performance. Its latest incarnation, version 4, aims at pushing the prediction accuracy further. The research paper YOLOv4: Optimal Speed and Accuracy of Object Detection shows how real-time inference can be achieved above the human perception of around 30 frames per second (FPS). In this post, you explore ways to push the performance of this model even further and use AWS Inferentia as a cost-effective hardware accelerator for real-time object detection.

Prerequisites

For this walkthrough, you need an AWS account with access to the AWS Management Console and the ability to create Amazon Elastic Compute Cloud (Amazon EC2) instances with public-facing IP.

Working knowledge of AWS Deep Learning AMIs and Jupyter notebooks with Conda environments is beneficial, but not required.

Building a YOLOv4 predictor from a pre-trained model

To start building the model, set up an inf1.2xlarge EC2 instance in AWS, with 8 vCPU cores and 16 GB of memory. The Inf1 instance allows for optimizing the ratio between CPU and Inferentia devices through the selection of inf1.xlarge or inf1.2xlarge. We found that for YOLOv4, the optimal CPU to accelerator balance is achieved with inf.2xlarge. Going up to the second size instance improves throughput for a lower cost per image. Use the AWS Deep Learning AMI (Ubuntu 18.04) version 34.0—ami-06a25ee8966373068—in the US East (N. Virginia) Region. This AMI comes pre-packaged with the Neuron SDK and the required Neuron runtime for AWS Inferentia. For more information about running AWS Deep Learning AMIs on EC2 instances, see Launching and Configuring a DLAMI.

Next you can connect to the instance through SSH, activate the aws_neuron_tensorflow_p36 Conda environment, and update the Neuron compiler to the latest release. The compilation script depends on requirements listed in the YOLOv4 tutorial posted on the Neuron GitHub repo. Install them by running the following code in the terminal:

pip install neuron-cc tensorflow-neuron requests pillow matplotlib pycocotools==2.0.1 torch~=1.5.0 --force --extra-index-url=https://pip.repos.neuron.amazonaws.com

You can also run the following steps directly from the provided Jupyter notebook. If doing so, skip to the Running a performance benchmark on Inferentia section to explore the performance benefits of running YOLOv4 on AWS Inferentia.

The benchmark of the models requires an object detection validation dataset. Start by downloading the COCO 2017 validation dataset. The COCO (Common Objects in Context) is a large-scale object detection, segmentation, and captioning dataset, with over 300,000 images and 1.5 million object instances. The 2017 version of COCO contains 5,000 images for validation.

To download the dataset, enter the following code on the terminal:

curl -LO http://images.cocodataset.org/zips/val2017.zip
curl -LO http://images.cocodataset.org/annotations/annotations_trainval2017.zip
unzip -q val2017.zip
unzip annotations_trainval2017.zip

When the download is complete, you should see a val2017 and an annotations folder available in your working directory. At this stage, you’re ready to build and compile the model.

The GitHub repo contains the script yolo_v4_coco_saved_model.py for downloading the pretrained weights of a PyTorch implementation of YOLOv4, and the model definition for YOLOv4 using TensorFlow 1.15 and Keras. The code was adapted from an earlier implementation and converts the PyTorch checkpoint to a Keras h5 saved model. This implementation of YOLOv4 is optimized to run on AWS Inferentia. For more information about optimizations, see Working with YOLO v4 using AWS Neuron SDK.

To download, convert, and save your Keras model to the yolo_v4_coco_saved_model folder, enter the following code:

python3 yolo_v4_coco_saved_model.py ./yolo_v4_coco_saved_model

To instantiate a new predictor from the saved model, use tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model') on your inference script.

The following code implements a single batch predictor and image annotation script, so you can test the saved model:

import json
import tensorflow as tf
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

yolo_pred_cpu = tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model')
image_path = './val2017/000000581781.jpg'
with open(image_path, 'rb') as f:
    feeds = {'image': [f.read()]}

results = yolo_pred_cpu(feeds)

# load annotations to decode classification result
with open('./annotations/instances_val2017.json') as f:
    annotate_json = json.load(f)
label_info = {idx+1: cat['name'] for idx, cat in enumerate(annotate_json['categories'])}

# draw picture and bounding boxes
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(Image.open(image_path).convert('RGB'))

wanted = results['scores'][0] > 0.1

for xyxy, label_no_bg in zip(results['boxes'][0][wanted], results['classes'][0][wanted]):
    xywh = xyxy[0], xyxy[1], xyxy[2] - xyxy[0], xyxy[3] - xyxy[1]
    rect = patches.Rectangle((xywh[0], xywh[1]), xywh[2], xywh[3], linewidth=1, edgecolor='g', facecolor='none')
    ax.add_patch(rect)
    rx, ry = rect.get_xy()
    rx = rx + rect.get_width() / 2.0
    ax.annotate(label_info[label_no_bg + 1], (rx, ry), color='w', backgroundcolor='g', fontsize=10,
                ha='center', va='center', bbox=dict(boxstyle='square,pad=0.01', fc='g', ec='none', alpha=0.5))
plt.show()

The performance in this setup isn’t optimal because you ran YOLO only on CPU. Despite the native parallelization from TensorFlow, the eight cores aren’t enough to bring the inference time close to real time. For that, you use AWS Inferentia.

Compiling YOLOv4 to run on AWS Inferentia

The compilation of YOLOv4 uses the TensorFlow-Neuron API tfn.saved_mode.compile, working directly with the saved model directory created before. To further reduce the Neuron runtime overhead, two extra arguments are added to the compiler call: no_fuse_ops and minimum_segment_size.

The first argument, no_fuse_ops, partitions the graph prior to casting the FP16 tensors running in the sub-graph back to FP32, as defined in the model script. This allows for operations that run more efficiently on CPU to be skipped while the Neuron compiler runs its automatic smart partitioning. The argument minimum_segment_size sets the minimum number of operations in a sub-graph, to enforce trivial compilable sections to run on CPU. For more information, see Reference: TensorFlow-Neuron Compilation API.

To compile the model, enter the following code:

import shutil
import tensorflow as tf
import tensorflow.neuron as tfn


def no_fuse_condition(op):
    return any(op.name.startswith(pat) for pat in ['reshape', 'lambda_1/Cast', 'lambda_2/Cast', 'lambda_3/Cast'])

with tf.Session(graph=tf.Graph()) as sess:
    tf.saved_model.loader.load(sess, ['serve'], './yolo_v4_coco_saved_model')
    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]

shutil.rmtree('./yolo_v4_coco_saved_model_neuron', ignore_errors=True)

result = tfn.saved_model.compile(
                './yolo_v4_coco_saved_model', './yolo_v4_coco_saved_model_neuron',
                # we partition the graph before casting from float16 to float32, to help reduce the output tensor size by 1/2
                no_fuse_ops=no_fuse_ops,
                # to enforce trivial compilable subgraphs to run on CPU
                minimum_segment_size=100,
                batch_size=1,
                dynamic_batch_size=True,
)

print(result)

On an inf1.2xlarge, the compilation takes only a few minutes and outputs the ratio of the graph operations run on the AWS Inferentia chip. For our model, it’s approximately 79%. As mentioned earlier, to optimize the compiled model for performance, the target of the compilation shouldn’t be to maximize operations on the AWS inferential chip, but to balance the use of the available CPUs for efficient combined hardware utilization.

AWS Inferentia is designed to reach peak throughput at small—usually single-digit—batch sizes. When optimizing a specific model for throughput, explore compiling the model with different values of the batch_size argument and test what batch size yields the maximum throughput for your model. In the case of our YOLOv4 model, the best batch size is 1.

Replace the model path on the predictor instantiation to tf.contrib.predictor.from_saved_model('./yolo_v4_coco_saved_model_neuron') for a comparison with the previous CPU only inference. You get similar detection accuracy at a fraction of the inference time, approximately 40 milliseconds.

Setting up a benchmarking pipeline

To set up a performance measuring pipeline, create a multi-threaded loop running inference on all the COCO images downloaded. The code available in the notebook adapts the original implementation of the eval function. The following adapted version implements a ThreadPoolExecutor to send four parallel prediction calls at a time:

from concurrent import futures

def evaluate(yolo_predictor, images, eval_pre_path, anno_file, eval_batch_size, _clsid2catid):
    batch_im_id_list, batch_im_name_list, batch_img_bytes_list = get_image_as_bytes(images, eval_pre_path)

    # warm up
    yolo_predictor({'image': np.array(batch_img_bytes_list[0], dtype=object)})

    with futures.ThreadPoolExecutor(4) as exe:
        fut_im_list = []
        fut_list = []
        start_time = time.time()
        for batch_im_id, batch_im_name, batch_img_bytes in zip(batch_im_id_list, batch_im_name_list, batch_img_bytes_list):
            if len(batch_img_bytes) != eval_batch_size:
                continue
            fut = exe.submit(yolo_predictor, {'image': np.array(batch_img_bytes, dtype=object)})
            fut_im_list.append((batch_im_id, batch_im_name))
            fut_list.append(fut)
        bbox_list = []
        count = 0
        for (batch_im_id, batch_im_name), fut in zip(fut_im_list, fut_list):
            results = fut.result()
            bbox_list.extend(analyze_bbox(results, batch_im_id, _clsid2catid))
            for _ in batch_im_id:
                count += 1
                if count % 100 == 0:
                    print('Test iter {}'.format(count))
        
        print('==================== Performance Measurement ====================')
        print('Finished inference on {} images in {} seconds'.format(len(images), time.time() - start_time))
        print('=================================================================')
    
    # start evaluation
    box_ap_stats = bbox_eval(anno_file, bbox_list)
    return box_ap_stats

Additional helper functions are used to calculate average precision scores of the deployed model.

Running a performance benchmark on Inferentia

To run the COCO evaluation and benchmark the time to infer over the 5,000 images, run the evaluate function as shown in the following code:

val_coco_root = './val2017'
val_annotate = './annotations/instances_val2017.json'
clsid2catid = {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 10: 11, 11: 13, 12: 14, 13: 15, 14: 16,
               15: 17, 16: 18, 17: 19, 18: 20, 19: 21, 20: 22, 21: 23, 22: 24, 23: 25, 24: 27, 25: 28, 26: 31,
               27: 32, 28: 33, 29: 34, 30: 35, 31: 36, 32: 37, 33: 38, 34: 39, 35: 40, 36: 41, 37: 42, 38: 43,
               39: 44, 40: 46, 41: 47, 42: 48, 43: 49, 44: 50, 45: 51, 46: 52, 47: 53, 48: 54, 49: 55, 50: 56,
               51: 57, 52: 58, 53: 59, 54: 60, 55: 61, 56: 62, 57: 63, 58: 64, 59: 65, 60: 67, 61: 70, 62: 72,
               63: 73, 64: 74, 65: 75, 66: 76, 67: 77, 68: 78, 69: 79, 70: 80, 71: 81, 72: 82, 73: 84, 74: 85,
               75: 86, 76: 87, 77: 88, 78: 89, 79: 90}
eval_batch_size = 8

with open(val_annotate, 'r', encoding='utf-8') as f2:
    for line in f2:
        line = line.strip()
        dataset = json.loads(line)
        images = dataset['images']

box_ap = evaluate(yolo_pred, images, val_coco_root, val_annotate, eval_batch_size, clsid2catid)

When the evaluation is complete, you see logs on the screen like the following:

…

Test iter 4500
Test iter 4600
Test iter 4700
Test iter 4800
Test iter 4900
==================== Performance Measurement ====================
Finished inference on 5000 images in 47.50522780418396 seconds
=================================================================

…

Accumulating evaluation results...
DONE (t=6.78s).
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.487
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.741
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.531
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.330
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.546
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.604
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.357
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.573
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.601
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.430
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.657
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.744

At 5,000 images processed in 47 seconds, this deployment achieves 106 FPS, 3.5 times faster than the real-time threshold of 30 FPS. The research paper YOLOv4: Optimal Speed and Accuracy of Object Detection lists the results for batch one performance over the same COCO 2017 dataset running on a NVIDIA Volta GPU, such as the V100. The largest frame rate obtained was 96 FPS, at 41.2% mAP. Our model architecture and deployment achieves higher mAP, 48.7%, with a higher frame rate.

To have a direct comparison between AWS Inferentia, NVIDIA Volta, and Turing architectures, we replicated the same experiment in two GPU based instances, g4dn.xlarge and p3.2xlarge, by running the exact same model prior to compilation, with no further GPU optimization. This time we achieved 39 FPS and 111 FPS for the g4dn.xlarge and p3.2xlarge, respectively.

A YOLO model deployed in production usually doesn’t see a defined batch of 5,000 images at a time. To measure production like performance, we set up a prediction-only multi-threaded pipeline that runs inference for extended periods.

For a total time of 2 hours, we continually ran 8 parallel prediction calls with a batch of 4 images on each, totaling 32 images at a time. To maximize GPU throughput and try to decrease the performance gap between the Inf1 and G4 instances, we use the TensorFlow XLA compiler. This setup mimics a live endpoint behavior running at maximum throughput.

GPU thermal throttling

In contrast to AWS Inferentia chips, GPU throughput is inversely proportional to GPU temperature. GPU temperature can vary on endpoints running for extended periods at high throughput, which leads to FPS and latency fluctuations. This effect is known as thermal throttling. Some production systems can define a limit throughput below the maximum achievable to avoid performance swings over time. The following graph shows the average FPS over 30 second increments for the duration of the test . We observed up to 12% variation of the FPS rolling average on the GPU instance. On AWS Inferentia, this variation is below 3% for a substantially larger FPS average.

During the 2-hour period, we ran inference on over 856,000 images on the inf1.2xlarge instance. On the g4dn.xlarge, the maximum number of inferences achieved was 486,000. That amounts to 76% more images processed over the same amount of time using AWS Inferentia! Latency averages for batch 4 inference are also 60% lower for AWS Inferentia.

Using the total throughput collected during our 2-hour test, we calculated that the price of running 1 million inferences is $1.362 on an inf1.xlarge in the us-east-1 Region. For the g4dn.xlarge, the price is $2.163—a 37% price reduction for the YOLOv4 object detection pipeline on AWS Inferentia.

Safely shutting down and cleaning up

On the Amazon EC2 console, choose the instances used to perform the benchmark, and choose Terminate from the Actions drop-down menu. Stopping the instance discards data stored only in the instance’s home volume. You can persist the compiled model in an Amazon Simple Storage Service (S3) bucket, so it can be reused later. If you’ve made changes to the code inside the instances, remember to persist those as well.

Conclusion

In this post, you walked through the steps of optimizing a TensorFlow YOLOv4 model to run on AWS Inferentia. You explored AWS Neuron optimizations that yield better model performance with improved average precision, and in a much more cost-effective way. In production, the Neuron compiled model is up to 37% less expensive in the long run, with little throughput and latency fluctuations, when compared to the most optimized GPU instance.

Some of the steps described in this post also apply to other ML model types and frameworks. For more information, see the AWS Neuron SDK GitHub repo.

Learn more about the AWS Inferentia chip and the Amazon EC2 Inf1 instances to get started with running your own custom ML pipelines on AWS Inferentia using the Neuron SDK.


About the Authors

Fabio Nonato de Paula is a Principal Solutions Architect for Autonomous Computing in AWS. He works with large-scale deployments of ML and AI for autonomous and intelligent systems. Fabio is passionate about democratizing access to accelerated computing and distributed ML. Outside of work, you can find Fabio riding his motorcycle on the hills of Livermore valley or reading ComiXology.

 

 

 

Haichen Li is a software development engineer in the AWS Neuron SDK team. He works on integrating machine learning frameworks with the AWS Neuron compiler and runtime systems, as well as developing deep learning models that benefit particularly from the Inferentia hardware.

 

 

 

Samuel Jacob is a senior software engineer in the AWS Neuron team. He works on AWS Neuron runtime to enable high performance inference data paths between AWS Neuron SDK and AWS Inferentia hardware. He also works on tools to analyze and improve AWS Neuron SDK performance. Outside of work, you can catch him playing video games or tinkering with small boards such as RaspberryPi.

 

 

Read More

Collaborating with AI to create Bach-like compositions in AWS DeepComposer

Collaborating with AI to create Bach-like compositions in AWS DeepComposer

AWS DeepComposer provides a creative and hands-on experience for learning generative AI and machine learning (ML). We recently launched the Edit melody feature, which allows you to add, remove, or edit specific notes, giving you full control of the pitch, length, and timing for each note. In this post, you can learn to use the Edit melody feature to collaborate with the autoregressive convolutional neural network (AR-CNN) algorithm and create interesting Bach-style compositions.

Through human-AI collaboration, we can surpass what humans and AI systems can create independently. For example, you can seek inspiration from AI to create art or music outside their area of expertise or offload the more routine tasks, like creating variations on a melody, and focus on the more interesting and creative tasks. Alternatively, you can assist the AI by correcting mistakes or removing artifacts it creates. You can also influence the output generated by the AI system by controlling the various training and inference parameters.

You can co-create music in the AWS DeepComposer Music Studio by collaborating with the AI (AR-CNN) model using the Edit melody feature. The AR-CNN Bach model modifies a melody note by note to guide the track towards sounding more Bach-like. You can modify four advanced parameters when you perform inference to influence how the input melody is modified:

  • Maximum notes to add – Changes the maximum number of notes added to your original melody
  • Maximum notes to remove – Changes the maximum number of notes removed from your original melody
  • Sampling iterations – Changes the exact number of times you add or remove a note based on note-likelihood distributions inferred by the model
  • Creative risk – Allows the AI model to deviate from creating Bach-like harmonies

The values you choose directly impact the composition created by the model by nudging the model in one way or another. For more information about these parameters, see AWS DeepComposer Learning Capsule on using the AR-CNN model.

Although the advanced parameters allow you to guide the output the AR-CNN model creates, they don’t provide note-level control over the music produced. For example, the AR-CNN model allows you to control the number of notes to add or remove during inference, but you don’t have control over the exact notes the model adds or removes.

The Edit melody feature bridges this gap by providing an interactive view of the generated melody so you can add missing notes, remove out-of-tune notes, or even change a note’s pitch and length. This granular level of editing facilitates better human-AI collaboration. It enables you to correct mistakes the model makes and harmonize the output to your liking, giving you more ownership of the creation process.

For this post, we explore the use case of co-creating Bach-like background music to match the following video.

Collaborating with AI using the AWS DeepComposer Music Studio

To start composing your melody, complete the following steps:

  1. Open the AWS DeepComposer Music Studio console.
  2. Choose an Input melody.

You can record a custom melody, import a melody, or choose a sample melody on the console.  For this post, we experimented with two melodies: the New World sample melody and a custom melody we created using the MIDI keyboard.

New World melody:

Custom melody:

  1. Choose the Autoregressive generative AI technique.
  2. Choose the Autoregressive CNN Bach model.

There are several considerations when choosing the advanced parameters. First, we wanted the original input melody to be recognizable. After some iterating, we found that setting the Maximum notes to add to 60 and Maximum notes to remove to 40 created a desirable outcome. For Creative risk, we wanted the model to create something interesting and adventurous. At the same time, we realized that a very high Creative risk value would deviate too much from the Bach style, so we took a moderate approach and chose a Creative risk of 2.

  1. You can repeat these steps a few times to iteratively create music.

Editing your input melody

After the AR-CNN model has generated a composition to your satisfaction, you can use the Edit melody feature to modify the melody and try to match the video’s transitions as much as possible.

  1. Choose the right arrow to open the input melody section.
  2. Choose Edit melody.
  3. On the Edit melody page, edit your track in any of the following ways:
    • Choose a cell (double-click) to add or remove a note at that pitch or time.
    • Drag a cell up or down to change a note’s pitch.
    • Drag the edge of a cell left or right to change a note’s length.
  4. When finished, choose Apply changes.

We drew inspiration from the AI-generated notes in different ways. For the New World melody, we noticed the model added short and bouncy notes (the circles with solid lines in the following screenshot), which made the composition sound similar to an American folk song. To match that style, we added a few notes in the second half of the composition (the dotted-lined circles).

For our custom melody, we noticed the model changed the chords slightly earlier than expected (see the following screenshot). This created lingering and overlapping sounds that we liked for the mountain road scenes.

On the other hand, we noticed the AI model needed our help to remove some notes that sounded out of place. After we listened to the track a few times, we decided to change some pitches manually to nudge the track towards something that sounded a bit more harmonious.

Generating accompaniments using the GAN generative AI technique

After using the AR-CNN Bach model to explore options for our melody track, we decided to try using a different generative AI model (GAN) to create musical accompaniments.

  1. Under Model parameters, for Generative AI technique, choose Generative adversarial network.
  2. Feed the edited compositions to the GAN model to generate accompaniments.

We chose the MuseGAN generative algorithm and the Symphony model because we wanted to create accompaniments to match the serene and somber setting in the video.

  1. You can optionally export your compositions into a music-editing tool of your choice to change the instrument set and perform post-processing.

Let’s watch the videos containing our AI-inspired creations in the background.

The first video uses the New World melody.

The following video uses our custom melody.

Conclusion

In this post, we demonstrated how to use the Edit melody feature in the AWS DeepComposer Music Studio to collaborate with generative AI models and create interesting Bach-style compositions. You can modify a melody to your liking by adding, removing, and editing specific notes. This gives you full control of the pitch, length, and timing for each note to produce an original melody.


About the Authors

 Rahul Suresh is an Engineering Manager with the AWS AI org, where he has been working on AI based products for making machine learning accessible for all developers. Prior to joining AWS, Rahul was a Senior Software Developer at Amazon Devices and helped launch highly successful smart home products. Rahul is passionate about building machine learning systems at scale and is always looking for getting these advanced technologies in the hands of customers. In addition to his professional career, Rahul is an avid reader and a history buff.

 

 

Enoch Chen is a Senior Technical Program Manager for AWS AI Devices. He is a big fan of machine learning and loves to explore innovative AI applications. Recently he helped bring DeepComposer to thousands of developers. Outside of work, Enoch enjoys playing piano and listening to classical music.

 

 

 

Carlos Daccarett is a Front-End Engineer at AWS. He loves bringing design mocks to life. In his spare time, he enjoys hiking, golfing, and snowboarding.

 

 

 

 

Dylan Jackson is a Senior ML Engineer and AI Researcher at AWS. He works to build experiences which facilitate the exploration of AI/ML, making new and exciting techniques accessible to all developers. Before AWS, Dylan was a Senior Software Developer at Goodreads where he leveraged both a full-stack engineering and machine learning skillset to protect millions of readers from spam, high-volume robotic traffic, and scaling bottlenecks. Dylan is passionate about exploring both the theoretical underpinnings and the real-world impact of AI/ML systems. In addition to his professional career, he enjoys reading, cooking, and working on small crafts projects.

Read More

Evaluating an automatic speech recognition service

Evaluating an automatic speech recognition service

Over the past few years, many automatic speech recognition (ASR) services have entered the market, offering a variety of different features. When deciding whether to use a service, you may want to evaluate its performance and compare it to another service. This evaluation process often analyzes a service along multiple vectors such as feature coverage, customization options, security, performance and latency, and integration with other cloud services.

Depending on your needs, you’ll want to check for features such as speaker labeling, content filtering, and automatic language identification. Basic transcription accuracy is often a key consideration during these service evaluations. In this post, we show how to measure the basic transcription accuracy of an ASR service in six easy steps, provide best practices, and discuss common mistakes to avoid.

Illustration showing a table of contents: The evaluation basics, six steps for performing an evaluation, and best practices and common mistakes to avoid.

The evaluation basics

Defining your use case and performance metric

Before starting an ASR performance evaluation, you first need to consider your transcription use case and decide how to measure a good or bad performance. Literal transcription accuracy is often critical. For example, how many word errors are in the transcripts? This question is especially important if you pay annotators to review the transcripts and manually correct the ASR errors, and you want to minimize how much of the transcript needs to be re-typed.

The most common metric for speech recognition accuracy is called word error rate (WER), which is recommended by the US National Institute of Standards and Technology for evaluating the performance of ASR systems. WER is the proportion of transcription errors that the ASR system makes relative to the number of words that were actually said. The lower the WER, the more accurate the system. Consider this example:

Reference transcript (what the speaker said): well they went to the store to get sugar

Hypothesis transcript (what the ASR service transcribed): they went to this tour kept shook or

In this example, the ASR service doesn’t appear to be accurate, but how many errors did it make? To quantify WER, there are three categories of errors:

  • Substitutions – When the system transcribes one word in place of another. Transcribing the fifth word as this instead of the is an example of a substitution error.
  • Deletions – When the system misses a word entirely. In the example, the system deleted the first word well.
  • Insertions – When the system adds a word into the transcript that the speaker didn’t say, such as or inserted at the end of the example.

Of course, counting errors in terms of substitutions, deletions, and insertions isn’t always straightforward. If the speaker says “to get sugar” and the system transcribes kept shook or, one person might count that as a deletion (to), two substitutions (kept instead of get and shook instead of sugar), and an insertion (or). A second person might count that as three substitutions (kept instead of to, shook instead of get, and or instead of sugar). Which is the correct approach?

WER gives the system the benefit of the doubt, and counts the minimum number of possible errors. In this example, the minimum number of errors is six. The following aligned text shows how to count errors to minimize the total number of substitutions, deletions, and insertions:

REF: WELL they went to THE  STORE TO   GET   SUGAR
HYP: **** they went to THIS TOUR  KEPT SHOOK OR
     D                 S    S     S    S     S

Many ASR evaluation tools use this format. The first line shows the reference transcript, labeled REF, and the second line shows the hypothesis transcript, labeled HYP. The words in each transcript are aligned, with errors shown in uppercase. If a word was deleted from the reference or inserted into the hypothesis, asterisks are shown in place of the word that was deleted or inserted. The last line shows D for the word that was deleted by the ASR service, and S for words that were substituted.

Don’t worry if these aren’t the actual errors that the system made. With the standard WER metric, the goal is to find the minimum number of words that you need to correct. For example, the ASR service probably didn’t really confuse “get” and “shook,” which sound nothing alike. The system probably misheard “sugar” as “shook or,” which do sound very similar. If you take that into account (and there are variants of WER that do), you might end up counting seven or eight word errors. However, for the simple case here, all that matters is counting how many words you need to correct without needing to identify the exact mistakes that the ASR service made.

You might recognize this as the Levenshtein edit distance between the reference and the hypothesis. WER is defined as the normalized Levenshtein edit distance:

In other words, it’s the minimum number of words that need to be corrected to change the hypothesis transcript into the reference transcript, divided by the number of words that the speaker originally said. Our example would have the following WER calculation:

WER is often multiplied by 100, so the WER in this example might be reported as 0.67, 67%, or 67. This means the service made errors for 67% of the reference words. Not great! The best achievable WER score is 0, which means that every word is transcribed correctly with no inserted words. On the other hand, there is no worst WER score—it can even go above 1 (above 100%) if the system made a lot of insertion errors. In that case, the system is actually making more errors than there are words in the reference—not only does it get all the words wrong, but it also manages to add new wrong words to the transcript.

For other performance metrics besides WER, see the section Adapting the performance metric to your use case later in this post.

Normalizing and preprocessing your transcripts

When calculating WER and many other metrics, keep in mind that the problem of text normalization can drastically affect the calculation. Consider this example:

Reference: They will tell you again: our ballpark estimate is $450.

ASR hypothesis: They’ll tell you again our ball park estimate is four hundred fifty dollars.

The following code shows how most tools would count the word errors if you just leave the transcripts as-is:

REF: THEY WILL    tell you AGAIN: our **** BALLPARK estimate is **** ******* ***** $450.   
HYP: **** THEY'LL tell you AGAIN  our BALL PARK     estimate is FOUR HUNDRED FIFTY DOLLARS.
     D    S                S          I    S                    I    I       I     S

The word error rate would therefore be:

According to this calculation, there were errors for 90% of the reference words. That doesn’t seem right. The ASR hypothesis is basically correct, with only small differences:

  • The words they will are contracted to they’ll
  • The colon after again is omitted
  • The term ballpark is spelled as a single compound word in the reference, but as two words in the hypothesis
  • $450 is spelled with numerals and a currency symbol in the reference, but the ASR system spells it using the alphabet as four hundred fifty dollars

The problem is that you can write down the original spoken words in more than one way. The reference transcript spells them one way and the ASR service spells them in a different way. Depending on your use case, you may or may not want to count these written differences as errors that are equivalent to missing a word entirely.

If you don’t want to count these kinds of differences as errors, you should normalize both the reference and the hypothesis transcripts before you calculate WER. Normalizing involves changes such as:

  • Lowercasing all words
  • Removing punctuation (except apostrophes)
  • Contracting words that can be contracted
  • Expanding written abbreviations to their full forms (such Dr. as to doctor)
  • Spelling all compound words with spaces (such as blackboard to black board or part-time to part time)
  • Converting numerals to words (or vice-versa)

If you there are other differences that you don’t want to count as errors, you might consider additional normalizations. For example, some languages have multiple spellings for some words (such as favorite and favourite) or optional diacritics (such as naïve vs. naive), and you may want to convert these to a single spelling before calculating WER. We also recommend removing filled pauses like uh and um, which are irrelevant for most uses of ASR, and therefore shouldn’t be included in the WER calculation.

A second, related issue is that WER by definition counts the number of whole word errors. Many tools define words as strings separated by spaces for this calculation, but not all writing systems use spaces to separate words. In this case, you may need to tokenize the text before calculating WER. Alternatively, for writing systems where a single character often represents a word (such as Chinese), you can calculate a character error rate instead of a word error rate, using the same procedure.

Six steps for performing an ASR evaluation

To evaluate an ASR service using WER, complete the following steps:

  1. Choose a small sample of recorded speech.
  2. Transcribe it carefully by hand to create reference transcripts.
  3. Run the audio sample through the ASR service.
  4. Create normalized ASR hypothesis transcripts.
  5. Calculate WER using an open-source tool.
  6. Make an assessment using the resulting measurement.

Choosing a test sample

Choosing a good sample of speech to evaluate is critical, and you should do this before you create any ASR transcripts in order to avoid biasing the results. You should think about the sample in terms of utterances. An utterance is a short, uninterrupted stretch of speech that one speaker produces without any silent pauses. The following are three example utterances:

An utterance is sometimes one complete sentence, but people don’t always talk in complete sentences—they hesitate, start over, or jump between multiple thoughts within the same utterance. Utterances are often only one or two words long and are rarely more than 50 words. For the test sample, we recommend selecting utterances that are 25–50 words long. However, this is flexible and can be adjusted if your audio contains mostly short utterances, or if short utterances are especially important for your application.

Your test sample should include at least 800 spoken utterances. Ideally, each utterance should be spoken by a different person, unless you plan to transcribe speech from only a few individuals. Choose utterances from representative portions of your audio. For example, if there is typically background traffic noise in half of your audio, then half of the utterances in your test sample should include traffic noise as well. If you need to extract utterances from long audio files, you can use a tool like Audacity.

Creating reference transcripts

The next step is to create reference transcripts by listening to each utterance in your test sample and writing down what they said word-for-word. Creating these reference transcripts by hand can be time-consuming, but it’s necessary for performing the evaluation. Write the transcript for each utterance on its own line in a plain text file named reference.txt, as shown below.

hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still under warranty so i wanted to see if someone could come look at it
no i checked everywhere the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
i tried to update my address on the on your web site but it just says error code 402 disabled account id after i filled out the form

The reference transcripts are extremely literal, including when the speaker hesitates and restarts in the third utterance (on the on your). If the transcripts are in English, write them using all lowercase with no punctuation except for apostrophes, and in general be sure to pay attention to the text normalization issues that we discussed earlier. In this example, besides lowercasing and removing punctuation from the text, compound words have been normalized by spelling them as two words (ice maker, web site), the initialism I.D. has been spelled as a single lowercase word id, and the number 402 is spelled using numerals rather than the alphabet. By applying these same strategies to both the reference and the hypothesis transcripts, you can ensure that different spelling choices aren’t counted as word errors.

Running the sample through the ASR service

Now you’re ready to run the test sample through the ASR service. For instructions on doing this on the Amazon Transcribe console, see Create an Audio Transcript. If you’re running a large number of individual audio files, you may prefer to use the Amazon Transcribe developer API.

Creating ASR hypothesis transcripts

Take the hypothesis transcripts generated by the ASR service and paste them into a plain text file with one utterance per line. The order of the utterances must correspond exactly to the order in the reference transcript file that you created: if line 3 of your reference transcripts file has the reference for the utterance pat went to the store, then line 3 of your hypothesis transcripts file should have the ASR output for that same utterance.

The following is the ASR output for the three utterances:

Hi I'm calling about a refrigerator I bought from you The ice maker stopped working and it's still in the warranty so I wanted to see if someone could come look at it
No I checked everywhere in the mailbox The package room I asked my neighbor who sometimes gets my packages but it hasn't shown up yet
I tried to update my address on the on your website but it just says error code 40 to Disabled Accounts idea after I filled out the form

These transcripts aren’t ready to use yet—you need to normalize them first using the same normalization conventions that you used for the reference transcripts. First, lowercase the text and remove punctuation except apostrophes, because differences in case or punctuation aren’t considered as errors for this evaluation. The word website should be normalized to web site to match the reference transcript. The number is already spelled with numerals, and it looks like the initialism I.D. was transcribed incorrectly, so no need to do anything there.

After the ASR outputs have been normalized, the final hypothesis transcripts look like the following:

hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still in the warranty so i wanted to see if someone could come look at it
no i checked everywhere in the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
i tried to update my address on the on your web site but it just says error code 40 to disabled accounts idea after i filled out the form

Save these transcripts to a plain text file named hypothesis.txt.

Calculating WER

Now you’re ready to calculate WER by comparing the reference and hypothesis transcripts. This post uses the open-source asr-evaluation evaluation tool to calculate WER, but other tools such as SCTK or JiWER are also available.

Install the asr-evaluation tool (if you’re using it) with pip install asr-evaluation, which makes the wer script available on the command line. Use the following command to compare the reference and hypothesis text files that you created:

wer -i reference.txt hypothesis.txt

The script prints something like the following:

REF: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still ** UNDER warranty so i wanted to see if someone could come look at it
HYP: hi i'm calling about a refrigerator i bought from you the ice maker stopped working and it's still IN THE   warranty so i wanted to see if someone could come look at it
SENTENCE 1
Correct          =  96.9%   31   (    32)
Errors           =   6.2%    2   (    32)
REF: no i checked everywhere ** the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
HYP: no i checked everywhere IN the mailbox the package room i asked my neighbor who sometimes gets my packages but it hasn't shown up yet
SENTENCE 2
Correct          = 100.0%   24   (    24)
Errors           =   4.2%    1   (    24)
REF: i tried to update my address on the on your web site but it just says error code ** 402 disabled ACCOUNT  ID   after i filled out the form
HYP: i tried to update my address on the on your web site but it just says error code 40 TO  disabled ACCOUNTS IDEA after i filled out the form
SENTENCE 3
Correct          =  89.3%   25   (    28)
Errors           =  14.3%    4   (    28)
Sentence count: 3
WER:     8.333% (         7 /         84)
WRR:    95.238% (        80 /         84)
SER:   100.000% (         3 /          3)

If you want to calculate WER manually instead of using a tool, you can do so by calculating the Levenshtein edit distance between the reference and hypothesis transcript pairs divided by the total number of words in the reference transcripts. When you’re calculating the Levenshtein edit distance between the reference and hypothesis, be sure to calculate word-level edits, rather than character-level edits, unless you’re evaluating a written language where every character is a word.

In the evaluation output above, you can see the alignment between each reference transcript REF and hypothesis transcript HYP. Errors are printed in uppercase, or using asterisks if a word was deleted or inserted. This output is useful if you want to re-count the number of errors and recalculate WER manually to exclude certain types of words and errors from your calculation. It’s also useful to verify that the WER tool is counting errors correctly.

At the end of the output, you can see the overall WER: 8.333%. Before you go further, skim through the transcript alignments that the wer script printed out. Check whether the references correspond to the correct hypotheses. Do the error alignments look reasonable? Are there any text normalization differences that are being counted as errors that shouldn’t be?

Making an assessment

What should the WER be if you want good transcripts? The lower the WER, the more accurate the system. However, the WER threshold that determines whether an ASR system is suitable for your application ultimately depends on your needs, budget, and resources. You’re now equipped to make an objective assessment using the best practices we shared, but only you can decide what error rate is acceptable.

You may want to compare two ASR services to determine if one is significantly better than the other. If so, you should repeat the previous three steps for each service, using exactly the same test sample. Then, count how many utterances have a lower WER for the first service compared to the second service. If you’re using asr-evaluation, the WER for each individual utterance is shown as the percentage of Errors below each utterance.

If one service has a lower WER than the other for at least 429 of the 800 test utterances, you can conclude that this service provides better transcriptions of your audio. 429 represents a conventional threshold for statistical significance when using a sign test for this particular sample size. If your sample doesn’t have exactly 800 utterances, you can manually calculate the sign test to decide if one service has a significantly lower WER than the other. This test assumes that you followed good practices and chose a representative sample of utterances.

Adapting the performance metric to your use case

Although this post uses the standard WER metric, the most important consideration when evaluating ASR services is to choose a performance metric that reflects your use case. WER is a great metric if the hypothesis transcripts will be corrected, and you want to minimize the number of words to correct. If this isn’t your goal, you should carefully consider other metrics.

For example, if your use case is keyword extraction and your goal is to see how often a specific set of target keywords occur in your audio, you might prefer to evaluate ASR transcripts using metrics such as precision, recall, or F1 score for your keyword list, rather than WER.

If you’re creating automatic captions that won’t be corrected, you might prefer to evaluate ASR systems in terms of how useful the captions are to viewers, rather than the minimum number of word errors. With this in mind, you can roughly divide English words into two categories:

  • Content words – Verbs like “run”, “write”, and “find”; nouns like “cloud”, “building”, and “idea”; and modifiers like “tall”, “careful”, and “quickly”
  • Function words – Pronouns like “it” and “they”; determiners like “the” and “this”; conjunctions like “and”, “but”, and “or”; prepositions like “of”, “in”, and “over”; and several other kinds of words

For creating uncorrected captions and extracting keywords, it’s more important to transcribe content words correctly than function words. For these use cases, we recommend ignoring function words and any errors that don’t involve content words in your calculation of WER. There is no definite list of function words, but this file provides one possible list for North American English.

Common mistakes to avoid

If you’re comparing two ASR services, it’s important to evaluate the ASR hypothesis transcript produced by each service using a true reference transcript that you create by hand, rather than comparing the two ASR transcripts to each other. Comparing ASR transcripts to each other lets you see how different the systems are, but won’t give you any sense of which service is more accurate.

We emphasized the importance of text normalization for calculating WER. When you’re comparing two different ASR services, the services may offer different features, such as true-casing, punctuation, and number normalization. Therefore, the ASR output for two systems may be different even if both systems correctly recognized exactly the same words. This needs to be accounted for in your WER calculation, so you may need to apply different text normalization rules for each service to compare them fairly.

Avoid informally eyeballing ASR transcripts to evaluate their quality. Your evaluation should be tailored to your needs, such as minimizing the number of corrections, maximizing caption usability, or counting keywords. An informal visual evaluation is sensitive to features that stand out from the text, like capitalization, punctuation, proper names, and numerals. However, if these features are less important than word accuracy for your use case—such as if the transcripts will be used for automatic keyword extraction and never seen by actual people—then an informal visual evaluation won’t help you make the best decision.

Useful resources

The following are tools and open-source software that you may find useful:

Conclusion

This post discusses a few of the key elements needed to evaluate the performance aspect of an ASR service in terms of word accuracy. However, word accuracy is only one of the many dimensions that you need to evaluate when choosing on a particular ASR service. It’s critical that you include other parameters such as the ASR service’s total feature set, ease of use, existing integrations, privacy and security, customization options, scalability implications, customer service, and pricing.


About the Authors

Scott Seyfarth is a Data Scientist at AWS AI. He works on improving the Amazon Transcribe and Transcribe Medical services. Scott is also a phonetician and a linguist who has done research on Armenian, Javanese, and American English.

 

 

 

Paul Zhao is Product Manager at AWS AI. He manages Amazon Transcribe and Amazon Transcribe Medical. In his past life, Paul was a serial entrepreneur, having launched and operated two startups with successful exits.

Read More

Simplify data management with new APIs in Amazon Personalize

Simplify data management with new APIs in Amazon Personalize

Amazon Personalize now makes it easier to manage your growing item and user catalogs with new APIs to incrementally add items and users in your datasets to create personalized recommendations. With the new putItems and putUsers APIs, you can simplify the process of managing your datasets. You no longer need to upload an entire dataset containing historical records and new records just to include new records in your recommendations. Providing new records to Amazon Personalize when they become available reduces your latency for incorporating new information, ensuring your recommendations remain relevant to your users and item catalog.

Based on over 20 years of personalization experience at Amazon.com, Amazon Personalize enables you to improve customer engagement by powering personalized product and content recommendations and targeted marketing promotions. Amazon Personalize uses machine learning (ML) to create higher-quality recommendations for your websites and applications. You can get started without any prior ML experience and use simple APIs to easily build sophisticated personalization capabilities in just a few clicks. Amazon Personalize processes and examines your data, identifies what is meaningful, and trains and optimizes a personalization model that is customized for your data. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

This post walks you through the process of incrementally modifying your items and users datasets in Amazon Personalize.

Adding new items and users to your datasets

For this use case, we create a dataset group with an interaction dataset, an item dataset (item metadata) and a user dataset using the Amazon Personalize CLI. For instructions on creating a dataset group, see Getting Started (CLI).

  1. Create an Interactions dataset using the following schema and import data using the interactions-100k.csv data file:
{
	"type": "record",
	"name": "Interactions",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "USER_ID",
			"type": "string"
		},
		{
			"name": "ITEM_ID",
			"type": "string"
		},
		{
			"name": "EVENT_TYPE",
			"type": [
				"string"
			]
		},
		{
			"name": "EVENT_VALUE",
			"type": [
				"null",
				"float"
			]
		},
		{
			"name": "TIMESTAMP",
			"type": "long"
		}
	]
}
  1. Create an Items dataset using the following schema and import data using the csv data file:
{
	"type": "record",
	"name": "Items",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "ITEM_ID",
			"type": "string"
		},
		{
			"name": "GENRE",
			"type”: "null”
			 "categorical": true
		}
	],
	"version": "1.0"
}
  1. Create a Users dataset using the following schema and import data using the csv data file:
{
	"type": "record",
	"name": "Users",
	"namespace": "com.amazonaws.personalize.schema",
	"fields": [
		{
			"name": "USER_ID",
			"type": "string"
		},
		{
			"name": "AGE",
			"type": "int"
		},
		{
			"name": "GENDER",
			"type": "string"
		}
	],
	"version": "1.0"
}

Now that you have created your datasets, you can add data to them in two different ways:

  • Using bulk import for item and user datasets from Amazon Simple Storage Service (Amazon S3). (for more information, see Preparing and Importing Data)
  • Using the new putUsers and putItems You can incrementally add up to 10 records per call to the user dataset using the putUsers API and the items dataset using putItems API.

For the putUsers call, the Users dataset required schema field (USER_ID) is mapped to the camel case userId. For the putItems call, the Items dataset required schema field (ITEM_ID) is mapped to the camel case itemId.

The following code adds two new users to the Users dataset via the putUsers API:

personalize_events.put_users(
datasetArn="arn:aws:personalize:region:acctID:dataset/crud-test/USERS",                          
    	users=[
 {
                 'userId' :"489",
                 'properties': "{"AGE":"29", "GENDER":F}"
             },
             {
                 'userId' : "650",
                 'properties':"{"AGE":"65", "GENDER"":F}"
             }]
)

The following code adds a new item to the Items dataset via the putItems API:

personalize_events.put_items(
datasetArn="arn:aws:personalize:region:acctID:dataset/crud-test/ITEMS",
items=[
{
            'itemId' :"432",
             'properties': "{"GENRE":"Action"}"
         }]
)

An HTTP/1.1 200 response is returned for successful record creation. In cases where your new item or user doesn’t match your dataset’s defined schema, you receive an InvalidInputException detailing the total number of records in your request that don’t match the schema.

For new records created (incrementally or via bulk upload) with the same userId or itemId as a record that already exists in the Users or Items dataset, the most recently created record (ingested by Amazon Personalize) is used in new solutions or solution versions.

Additionally, records added using putUsers or putItems are persisted until your dataset is deleted, so be sure to delete your dataset in the dataset group before importing a refreshed dataset. Amazon Personalize doesn’t replace your catalog or user data management systems.

Incorporating the newly added users and items in recommendations and filters

Now that you’ve added new items and new users to your datasets, incorporating this information into your Amazon Personalize solutions makes sure that recommendations remain timely and relevant for your users. When not using the aws-user-personalization recipe, solution re-training is needed to include these new items in your personalized recommendations.

If you have exploration enabled in an Amazon Personalize recipe, your new items are included in recommendations as soon as your next campaign update is complete. New events generated by your users’ interactions with these items are incorporated when your train a new solution or solution version in this dataset group.

Any filters you created in the dataset group are updated with your new item and user data within 15 minutes from the last dataset import job completion or the last incremental record. This update allows your campaigns to use your most recent data when filtering recommendations for your users.

Summary

Amazon Personalize allows you to easily manage your growing item and user catalogs so your personalized product and content recommendations keep pace with your business and your customers. For more information about optimizing your user experience with Amazon Personalize, see What Is Amazon Personalize?


About the Authors

Matt Chwastek is a Senior Product Manager for Amazon Personalize. He focuses on delivering products that make it easier to build and use machine learning solutions. In his spare time, he enjoys reading and photography.

 

 

 

 

Gaurav Singh Chauhan is a Software Engineer for Amazon Personalize and works on architecting software systems and big data pipelines that serve customers at scale. Gaurav has a B.Tech in Computer Science from IIT Bombay, India. Outside of work, he likes all things outdoors and is an avid runner. In his spare time, he likes reading about and exploring new technologies. He tweets on startups, technology, and India at @bazingaurav.

 

 

Read More

Announcing the winner of the AWS DeepComposer Chartbusters The Sounds of Science challenge

Announcing the winner of the AWS DeepComposer Chartbusters The Sounds of Science challenge

We’re excited to announce the top 10 compositions and the winner of the AWS DeepComposer Chartbusters The Sounds of Science challenge. AWS DeepComposer provides a creative and hands-on experience for learning generative AI and machine learning (ML). Chartbusters is a global monthly challenge where you can use AWS DeepComposer to create original compositions and compete to top the charts and win prizes. To participate in The Sounds of Science, developers composed background music for a video clip using the Autoregressive CNN (AR-CNN) algorithm and edited notes with the newly launched Edit melody feature to better match the provided video.

Top 10 compositions

The high-quality submissions made it challenging for our judges to select the chart-toppers. Our panel of experts—Kesha Williams, Sally Revell, and Prachi Kumar—selected the top 10 ranked compositions by evaluating the quality of the music, creativity, and how well the music matched the video clip.

The winner of The Sounds of Science is… (cue drum roll) Sungin Lee! You can listen to his winning composition and the top 10 compositions on SoundCloud or on the AWS DeepComposer console. The top 10 compositions for the Sounds of Science challenge are:

Sungin will receive an AWS DeepComposer Chartbusters gold record and will tell his story in an upcoming post, right here on the AWS ML blog.

Congratulations, Sungin Lee!

It’s time to move on to the next Chartbusters challengeTrack or Treat, which is Halloween-themed. The challenge launches today and is open until October 23rd, 2020.


About the Author

Maryam Rezapoor is a Senior Product Manager with AWS AI Ecosystem team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

Read More

Join AWS and NVIDIA at GTC, October 5–9

Join AWS and NVIDIA at GTC, October 5–9

Starting Monday, October 5, 2020, the NVIDIA GPU Technology Conference (GTC) is offering online sessions for you to learn AWS best practices to accomplish your machine learning (ML), virtual workstations, high performance computing (HPC), and internet of things (IoT) goals faster and more easily.

Amazon Elastic Compute Cloud (Amazon EC2) instances powered by NVIDIA GPUs deliver the scalable performance needed for fast ML training, cost-effective ML inference, flexible remote virtual workstations, and powerful HPC computations. At the edge, you can use AWS IoT Greengrass and SageMaker Neo to extend a wide range of AWS Cloud services and ML inference to NVIDIA-based edge devices so the devices can act locally on the data they generate.

AWS is a Global Diamond Sponsor of the conference.

Available sessions

The following sessions are available from AWS:

A Developer’s Guide to Choosing the Right GPUs for Deep Learning (Scheduled session IDs: A22318, A22319, A22320, and A22321)

  • As a deep learning developer or data scientist, you can choose from multiple GPU EC2 instance types based on your training and deployment requirements. You can access instances with different GPU memory sizes, NVIDIA GPU architectures, capabilities (precisions, Tensor Cores, NVLink), GPUs per instance, number of vCPUs, system memory, and network bandwidth. We’ll share some guidance on how you can choose the right GPU instance on AWS for your deep learning projects. You’ll get all the information you need to make an informed choice for GPU instance for your training and inference workload.
  • Speaker: Shashank Prasanna, Senior Developer Advocate, AI/ML, Amazon Web Services

Virtual Workstations on AWS for Digital Content Creation (On-Demand session IDs: A22276, A22311, A22312, and A22314)

  • Virtual workstations on AWS enable studios, departments, and freelancers to take on bigger projects, work from anywhere, and pay only for what they need. Running on Amazon EC2 G4 instances, virtual workstations employ the power of NVIDIA T4 Tensor Core GPUs and Quadro technology, the visual computing platform trusted by creative and technical professionals. Virtual workstations have become essential to creative professionals seeking cloud solutions that enable remote teams to work more efficiently, and keep creative productions moving forward. Join this session to learn more about how virtual workstations on AWS work, who is using them today, and how to get started.
  • Speaker: Haley Kannall, CG Supervisor, Amazon Web Services

Empower DeepStream Applications with AWS Data Services (On-Demand session IDs: A22279, A22315, A22316, and A22317)

  • We’ll discuss how we can optimize edge video inferencing performance by leveraging AWS infrastructure and NVIDA Deepstream. We’ll emphasize three major features at the edge: (1) massively deploying trained models to NVIDIA Jetson devices using AWS IoT Greengrass, (2) local communication and control between AWS IoT Greengrass engines and Deepstream applications through MQTT messaging, and (3) uploading inferencing results to the cloud for further analytics.
  • Speaker: Yuxin Yang, IoT Architect, Amazon Web Services

GPU-Powered Edge Computing Applications Enabled by AWS Wavelength (On-Demand session IDs: A22374, A22375, A22376, and A22377)

  • In this presentation, we provide an overview of AWS Wavelength, how it integrates with the Mobile Edge carrier network and improves the performance of Mobile Edge applications. Wavelength Zones are AWS infrastructure deployments that embed AWS compute and storage services within telecommunications providers’ datacenters at the edge of the 5G network, so application traffic can reach application servers running in Wavelength Zones without leaving the mobile providers’ network. Customers with edge data processing needs such as image and video recognition, inference, data aggregation, and responsive analytics can use Wavelength to perform low-latency operations and processing right where their data is generated, reducing the need to move large amounts of data to be processed in centralized locations. We deep dive into these Mobile Edge applications running at the AWS Wavelength Zones using Amazon EC2 G4 instances powered by NVIDIA T4 Tensor Core GPUs.
  • Speaker: Sebastian Dreisch, Head of Wavelength GTM, Amazon Web Services

Next Generation Cloud Platform for Autonomous Vehicle (AV) Development (Scheduled session ID: A21517)

  • Development of autonomous driving systems presents a massive computational challenge, including processing petabytes of sensor data, which impacts time to market, scale, and cost, throughout the development cycle. Training, testing, validating, and deploying self-driving systems requires large-scale compute and storage infrastructure to support the end-to-end workflow. AWS offers a highly scalable and reliable solution for AV development including the latest generation GPUs from NVIDIA. By attending this webinar, you will learn about AWS AV solution architectures for data ingest, data management, simulation, and distributed model training, as well as strategies for cost optimization. NVIDIA will share new details about the next generation NVIDIA Ampere (A100) architecture. Attendees will walk away with an understanding of how AWS and NVIDIA can help streamline AV development and reduce IT costs and time-to-market.
  • Speakers: Shyam Kumar, Principal HPC Business Development Manager, Amazon Web Services, and Norm Marks, Global Senior Director, Automotive Industry, NVIDIA

Embracing Volatility: Using ML to Become More Efficient Amid Epic Uncertainty (Scheduled session ID: A22219)

  • We’re all used to change. In business, change is often predictable—different seasons, large-scale events, and new releases all drive fluctuations we’re used to. But right now, there’s nothing normal about the changes you’re facing. The only constant is uncertainty. And uncertainty is expensive. In the absence of an omniscient crystal ball, the next best thing is cloud and ML. This presentation is going to cover how to deal with the unexpected. Whether it’s rapidly changing traffic, shifting data sources, or model drift, we’ll cover how you can better manage spikes and dips of all sizes and improve predictions with AI to maximize your efficiencies today.
  • Speaker: Allie Miller, US Head of ML Business Development for Startups and Venture Capital at AWS, Amazon Web Services

Accelerating Data Science with NVIDIA RAPIDS (Scheduled session ID: A22042)

  • Data science workflows have become increasingly computationally intensive in recent years, and GPUs have stepped up to address this challenge. With the RAPIDS suite of open-source software libraries and APIs, data scientists can run end-to-end data science and analytics pipelines entirely on GPUs, allowing organizations to deliver results faster than ever. The AWS Cloud lets you access a large number of powerful NVIDIA GPUs with Amazon EC2 P3 based on V100 GPUs, Amazon EC2 G4 based on T4 GPUs, and upcoming A100-based GPU instances. We’ll go through the end-to-end process of running on RAPIDS on AWS. We’ll start by running RAPIDS libraries on a single GPU instance. Next, we’ll see how you can run large-scale hyperparameter search experiments with RAPIDS and Amazon SageMaker. Finally, we’ll run RAPIDS distributed ML using Dask clusters on Amazon EKS and Amazon ECS.
  • Speaker: Shashank Prasanna, Senior Developer Advocate, AI/ML, Amazon Web Services

Interactive Scientific Visualization on AWS with NVIDIA IndeX SDK (On-Demand session ID: A21610)

  • Scientific visualization is critical to understanding complex phenomena modeled using HPC simulations. However, it has been challenging to do this effectively due to the inability to visualize large data volumes (> 1 PB) and lack of collaborative workflow solutions. NVIDIA IndeX on AWS, a 3D volumetric interactive visualization toolkit, addresses these problems by providing a scalable scientific visualization solution. NVIDIA IndeX allows you to make real-time modifications and navigate to the most pertinent parts of the data to gather better insights faster. IndeX leverages GPU clusters for scalable, real-time visualization and computing of multi-valued volumetric data together with embedded geometry data. We’ll demonstrate 3D volume rendering at scale on AWS using IndeX.
  • Speakers: Karthik Raman, Senior Solutions Architect, HPC, Amazon Web Services, and Dragos Tatulea, Software Engineer, NVIDIA

Conclusion

You can also visit AWS and NVIDIA to learn more or apply for a free trial to use NVIDIA GPU-based Amazon EC2 P3 instances powered by NVIDIA V100 Tensor Core GPUs and Amazon EC2 G4 instances powered by NVIDIA T4 Tensor Core GPUs. Learn more about GTC on the GTC 2020 website. We look forward to seeing you there!


About the Author

Geoff Murase is a Senior Product Marketing Manager for AWS EC2 accelerated computing instances, helping customers meet their compute needs by providing access to hardware-based compute accelerators such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). In his spare time, he enjoys playing basketball and biking with his family.

Read More