Halloween-themed AWS DeepComposer Chartbusters Challenge: Track or Treat

Halloween-themed AWS DeepComposer Chartbusters Challenge: Track or Treat

We are back with a spooktacular AWS DeepComposer Chartbusters challenge, Track or Treat! In this challenge, you can interactively collaborate with the ghost in the machine (learning) and compose spooky music! Chartbusters is a global monthly challenge where you can use AWS DeepComposer to create original compositions on the console using machine learning techniques, compete to top the charts, and win prizes. This challenge launches today and participants can submit their compositions until October 23, 2020.

Participation is easy: you can generate spooky compositions using one of the supported generative AI techniques and models on the AWS DeepComposer console. You can add or remove notes and interactively collaborate with AI using the Edit melody feature, and can include spooky instruments in your composition.

How to compete

To participate in Track or Treat, just do the following:

  1. Go to AWS DeepComposer Music Studio and create a melody with the keyboard, import a melody, or choose a sample melody on the console.
  2. Under Generative AI technique, for Model parameters, choose Autoregressive.
  3. For Model, choose Autoregressive CNN Bach.

You have four advanced parameters that you can choose to adjust: Maximum notes to add, Maximum notes to remove, Sampling iterations, and Creative risk.

  1. After adjusting the values to your liking, choose Enhance input melody.

  1. Choose Edit melody to add or remove notes.
  2. You can also change the note duration and pitch.

  1. When finished, choose Apply changes.

  1. Repeat these steps until you’re satisfied with the generated music.
  2. To add accompaniments to your melody, switch the Generative AI technique to Generative Adversarial Networks, then choose Generate composition.
  1. Choose the right arrow next to an accompaniment track (green bar).
  2. For Instrument type, choose Spooky.

  1. When you’re happy with your composition, choose Download composition.

You can choose to post-process your composition; however, one of the judging criteria is how close your final submission is to the track generated using AWS DeepComposer.

  1. In the navigation pane, choose Chartbusters.
  2. Choose Submit a composition.
  3. Select Import a post-processed audio track and upload your composition.
  4. Provide a track name for your composition and choose Submit.

AWS DeepComposer then submits your composition to the Track or Treat playlist on SoundCloud.

Conclusion

You’ve successfully submitted your composition to the AWS DeepComposer Chartbusters challenge Track or Treat. Now you can invite your friends and family to listen to your creation on SoundCloud, vote for their favorite, and join the fun by participating in the competition.

Although you don’t need a physical keyboard to compete, you can buy the AWS DeepComposer keyboard for $99.00 to enhance your music generation experience. To learn more about the different generative AI techniques supported by AWS DeepComposer, check out the learning capsules available on the AWS DeepComposer console.


About the Author

Maryam Rezapoor is a Senior Product Manager with AWS AI Ecosystem team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.

 

Read More

This month in AWS Machine Learning: September 2020 edition

This month in AWS Machine Learning: September 2020 edition

Every day there is something new going on in the world of AWS Machine Learning—from launches to new use cases to interactive trainings. We’re packaging some of the not-to-miss information from the ML Blog and beyond for easy perusing each month. Check back at the end of each month for the latest roundup.

Launches

This month we announced native support for TorchServe in Amazon SageMaker, launched a new NFL Next Gen Stat, and enhanced our language services including Amazon Transcribe and Amazon Comprehend. Read on for our September launches:

 TorchServe is now natively supported in Amazon SageMaker as the default model server for PyTorch inference to help you bring models to production quickly without having to write custom code.

  • With the new automatic language detection feature for Amazon Transcribe, you can now simply provide audio files and Amazon Transcribe detects the dominant language from the speech signal and generates transcriptions in the identified language. Amazon Transcribe now also expands support for Channel Identification to streaming audio transcription. With Channel Identification, you can process live audio from multiple channels, and produce a single transcript of the conversation with channel labels.
  • You can now use Amazon Comprehend to detect and redact personally identifiable information (PII) in customer emails, support tickets, product reviews, social media, and more.
  • To kick off football season, AWS and the National Football League announced new Next Gen Stats powered by AWS. One of those stats, Expected Rushing Yards, was developed at the NFL’s Big Data Bowl, powered by AWS. Expected Rushing Yards is designed to show how many rushing yards a ball carrier is expected to gain on a given carry based on the relative location, speed, and direction of blockers and defenders.

Use cases

Get ideas and architectures from AWS customers, partners, ML Heroes, and AWS experts on how to apply ML to your use case:

Explore more ML stories

Want more news about developments in ML? Check out the following stories:

  • DBS Bank is training 3,000 employees in AI with the help of the AWS DeepRacer League. Watch to see how they are developing ML skills and fostering community for their workforce.
  • Amazon scientist Amaia Salvador shared her views on the challenges facing women in computer vision in a recent Science article.
  • Aella Credit is making banking more accessible using AWS ML.
  • Read about how ML is identifying and tracking pandemics like COVID-19, and check out this VentureBeat Webinar featuring Michelle Lee, VP of the Amazon ML Solutions Lab, on how companies are finding new ways to operate and respond in this pandemic with AI/ML.
  • Build, Train and Deploy a real-world flower classifier of 102 flower types with instruction from this tutorial by AWS ML Community Builder Juv Chan.
  • Earlier this month, we announced that Lena Taupie, a software developer for Blubrr, won the AWS DeepComposer Chartbusters challenge. Learn about Lena’s experience competing in the challenge and how she drew from her own background, having grown up in St. Lucia, a city with a history of oral and folk traditional music, to develop a new custom genre model using Generative AI.

Mark your calendars

Join us for the following exciting ML events:

  • SageMaker Fridays is back with season 2 and it is going to be bigger. You can get started faster with machine learning using Amazon SageMaker, where we will discuss practical use cases and applications through the season. Join us on the season premiere on Oct. 9, and we will continue to meet every week with a new use case.
  • On October 18, join Quantiphi for a conversation on how you can add AI to your contact centers to drive better customer engagement. Register now!
  • AWS is sponsoring the WSJ Pro AI Forum on October 28. Register to learn strategies for bringing AI and ML to your organization.

About the Author

Laura Jones is a product marketing lead for AWS AI/ML where she focuses on sharing the stories of AWS’s customers and educating organizations on the impact of machine learning. As a Florida native living and surviving in rainy Seattle, she enjoys coffee, attempting to ski and enjoying the great outdoors.

Read More

Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions

Using Amazon Rekognition Custom Labels and Amazon A2I for detecting pizza slices and augmenting predictions

Customers need machine learning (ML) models to detect objects that are interesting for their business. In most cases doing so is hard as these models needs thousands of labelled images and deep learning expertise.  Generating this data can take months to gather, and can require large teams of labelers to prepare it for use. In addition, setting up a workflow for auditing or reviewing model predictions to validate adherence to your requirements can further add to the overall complexity.

With Amazon Rekognition Custom Labels, which is built on the existing capabilities of Amazon Rekognition, you can identify the objects and scenes in images that are specific to your business needs. For example, you can find your logo in social media posts, identify your products on store shelves, classify machine parts in an assembly line, distinguish healthy and infected plants, or detect animated characters in videos.

But what if the custom label model you trained can’t recognize images with a high level of confidence, or you need your team of experts to validate the results from your model during the testing phase or review the results in production? You can easily send predictions from Amazon Rekognition Custom Labels to Amazon Augmented AI (Amazon A2I). Amazon A2I makes it easy to integrate a human review into your ML workflow. This allows you to automatically have humans step into your ML pipeline to review results below a confidence threshold, set up review and auditing workflows, and augment the prediction results to improve model accuracy.

In this post, we show you to how to build a custom object detection model trained to detect pepperoni slices in a pizza using Amazon Rekognition Custom Labels with a dataset labeled using Amazon SageMaker Ground Truth. We then show how to create your own private workforce and set up an Amazon A2I workflow definition to conditionally trigger human loops for review and augmenting tasks. You can use the annotations created by Amazon A2I for model retraining.

We walk you through the following steps using the accompanying Amazon SageMaker Jupyter Notebook:

  1. Complete prerequisites.
  2. Create an Amazon Rekognition custom model.
  3. Set up an Amazon A2I workflow and send predictions to an Amazon A2I human loop.
  4. Evaluate results and retrain the model.

Prerequisites

Before getting started, set up your Amazon SageMaker notebook. Follow the steps in the accompanying Amazon SageMaker Jupyter Notebook to create your human workforce and download the datasets.

  1. Create a notebook instance in Amazon SageMaker.

Make sure your Amazon SageMaker notebook has the necessary AWS Identity and Access Management (IAM) roles and permissions mentioned in the prerequisite section of the notebook.

  1. When the notebook is active, choose Open Jupyter.
  2. On the Jupyter dashboard, choose New, and choose Terminal.
  3. In the terminal, enter the following code:
cd SageMaker
git clone https://github.com/aws-samples/amazon-a2i-sample-jupyter-notebooks
  1. Open the notebook by choosing Amazon-A2I-Rekognition-Custom-Label.ipynb in the root folder.

For this post, you create a private work team and add one user (you) to it.

  1. Create your human workforce by following the appropriate section of the notebook.

Alternatively, you can create your private workforce using Amazon Cognito. For instructions, see Create an Amazon Cognito Workforce Using the Labeling Workforces Page.

  1. After you create the private workforce, find the workforce ARN and enter the ARN in step 4 of the notebook.

 The dataset is composed of 12 images for training that contain pepperoni pizza in the data/images folder and 3 images for testing in the data/testimages folder. We sourced our images from pexels.com. We labeled this dataset for this post using Ground Truth custom labeling workflows. A Ground Truth labeled manifest template is available in data/manifest.

  1. Download the dataset.
  2. Run the notebook steps Download the Amazon SageMaker GroundTruth object detection manifest to Amazon S3 to process and upload the manifest in your Amazon Simple Storage Service (Amazon S3) bucket.

Creating an Amazon Rekognition custom model

To create our custom model, we follow these steps:

  1. Create a project in Amazon Rekognition Custom Labels.
  2. Create a dataset with images containing one or more pizzas and label them by applying bounding boxes.

Because we already have a dataset labeled using Ground Truth, we just point to that labeled dataset in this step. Alternatively, you can label the images using the user interface provided by Amazon Rekognition Custom Labels.

  1. Train the model.
  2. Evaluate the model’s performance.
  3. Test the deployed model using a sample image.

Creating a project

In this step, we create a project to detect peperoni pizza slices. For instructions on creating a project, see Creating an Amazon Rekognition Custom Labels Project.

  1. On the Amazon Rekognition console, choose Custom Labels.
  2. Choose Get started.
  3. For Project name, enter a2i-rekog-pepperoni-pizza.
  4. Choose Create project.

Creating a dataset

To create your dataset, complete the following steps:

  1. On the Amazon Rekognition Custom Labels console, choose Datasets.
  2. Choose Create dataset.
  3. For Dataset name, enter rekog-a2i-pizza-dataset.
  4. Select Import images labeled by Amazon SageMaker Ground Truth.
  5. For .manifest file location, enter the S3 bucket location of your .manifest file.

  1. Choose Submit.

You should get a prompt to provide the S3 bucket policy when you provide the S3 path. For more information about these steps, see SageMaker Ground Truth.

After you create the dataset, you should see the images with the bounding boxes and labels, as in the following screenshot.

Make sure to upload the images for our dataset (that you downloaded in the Prerequisites section) to the console S3 bucket for Amazon Rekognition Custom Labels.

Training an Amazon Rekognition custom model

You’re now ready to train your model.

  1. On the Amazon Rekognition console, choose Train model.
  2. For Choose project, choose your newly created project.
  3. For Choose training dataset, choose your newly created dataset.
  4. Select Split training dataset.
  5. Choose Train.

The training takes approximately 45 minutes to complete.

Checking training status

Run the following notebook cell to get information about the project you created using the describe-projects API:

!aws rekognition describe-projects

Use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

To get the project ARN using the describe-project versions API, run the following cell:

#Replace the project-arn below with the project-arn of your project from the describe-projects output above
!aws rekognition describe-project-versions —project-arn "<project-arn>"

Enter the ARN in <project-version-arn-of-your-model> in the following code and run this cell to start running of the version of a model using the start-project-version API:

# Copy/paste the ProjectVersionArn for your model from the describe-project-versions cell output above to the --project-version-arn parameter here
!aws rekognition start-project-version 
--project-version-arn <project-version-arn-of-your-model> 
--min-inference-units 1 
--region us-east-1

Evaluating performance

When training is complete, you can evaluate the performance of the model. To help you, Amazon Rekognition Custom Labels provides summary metrics and evaluation metrics for each label. For information about the available metrics, see Metrics for Evaluating Your Model. To improve your model using metrics, see Improving an Amazon Rekognition Custom Labels Model.

The following screenshot shows our evaluation results and label performance.

Custom Labels determines the assumed threshold for each label based on maximum precision and recall achieved. Your model defaults to returning predictions above that threshold. You can reset this when you start your model. For information about adjusting your assumed threshold (such as looking at predictions either below or above your assumed threshold) and optimizing the model for your use case, see Training a custom single class object detection model with Amazon Rekognition Custom Labels.

For a demonstration of the Amazon Rekognition Custom Labels solution, see Custom Labels Demonstration. The demo shows you how you can analyze an image from your local computer.

Testing the deployed model

Run following notebook cell to load the test image. Use the accompanying Amazon SageMaker Jupyter Notebook to follow the steps in this post.

from PIL import Image, ImageDraw, ExifTags, ImageColor, ImageFont
image=Image.open('./data/images/pexels_brett_jordan_825661__1_.jpg')
display(image)

Enter the project version model ARN from previous notebook step aws rekognition start-project-version and run the following cell to analyze the response from the detect_custom_labels API:

model_arn = '<project-version-arn-of-your-model>'
min_confidence=40
#Call DetectCustomLabels
response = client.detect_custom_labels(Image={'S3Object': {'Bucket': BUCKET, 'Name': PREFIX + test_photo}},
MinConfidence=min_confidence,
ProjectVersionArn=model_arn)

Run the next cell in the notebook to display results (see the following screenshot).

The model detected the pepperoni pizza slices in our test image and drew bounding boxes. We can use Amazon A2I to send the prediction results from our model to a human loop consisting of our private workforce to review and audit the results.

Setting up an Amazon A2I human loop

In this section, you set up a human review loop for low-confidence detection in Amazon A2I. It includes the following steps:

  1. Create a human task UI.
  2. Create a worker task template and workflowflow definition.
  3. Send predictions to Amazon A2I human loops.
  4. Check the human loop status.

Use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

Creating a human task UI

Create a human task UI resource with a UI template in liquid HTML. This template is used whenever a human loop is required.

For this post, we take an object detection UI and fill in the object categories in the labels variable in the template. Run the following template to create the human task AI for object detection:

def create_task_ui():
    '''
    Creates a Human Task UI resource.

    Returns:
    struct: HumanTaskUiArn
    '''
    response = sagemaker_client.create_human_task_ui(
        HumanTaskUiName=taskUIName,
        UiTemplate={'Content': template})
    return response
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']
print(humanTaskUiArn)

For over 70 pre-built UIs, see the Amazon Augmented AI Sample Task UIs GitHub repo.

Creating a worker task template and workflow definition

Workflow definitions allow you to specify the following:

  • The workforce that your tasks are sent to
  • The instructions that your workforce receives

This post uses the following API to create a workflow definition:

create_workflow_definition_response = sagemaker.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn= ROLE,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Identify custom labels in the image",
            "TaskTitle": "Identify custom image"
        },
        OutputConfig={
            "S3OutputPath" : OUTPUT_PATH
        }
    )
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

Optionally, you can create this workflow definition on the Amazon A2I console. For instructions, see Create a Human Review Workflow.

Sending predictions to Amazon A2I human loops

We now loop through our test images and invoke the trained model endpoint to detect the custom label pepperoni pizza slice using the detect_custom_labels API. If the confidence score is less than 60% for the detected labels in our test dataset, we send that data for human review and start the human review A2I loop with the start-human-loop API. When using Amazon A2I for a custom task, a human loops starts when StartHumanLoop is called in your application. Use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

You can change the value of the SCORE_THRESHOLD based on what confidence level you want to trigger the human review. See the following code:

import uuid

human_loops_started = []
SCORE_THRESHOLD = 60

folderpath = r"data/testimages" # make sure to put the 'r' in front and provide the folder where your files are
filepaths  = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if not name.startswith('.')] # do not select hidden directories
for path in filepaths:
    # Call custom label endpoint and not display any object detected with probability lower than 0.01
    response = client.detect_custom_labels(Image={'S3Object': {'Bucket': BUCKET, 'Name': PREFIX+'/'+path}},
        MinConfidence=20,
        ProjectVersionArn=model_arn)    
  
    #Get the custom labels
    labels=response['CustomLabels']
    if labels and labels[0]['Confidence'] < SCORE_THRESHOLD: 
        s3_fname='s3://%s/%s' % (BUCKET, PREFIX+'/'+path)
        print("Images with labels less than 60% confidence score: " + s3_fname)
        humanLoopName = str(uuid.uuid4())
        inputContent = {
            "initialValue": 'null',
            "taskObject": s3_fname
        }
        start_loop_response = a2i.start_human_loop(
            HumanLoopName=humanLoopName,
            FlowDefinitionArn=flowDefinitionArn,
            HumanLoopInput={
                "InputContent": json.dumps(inputContent)
            }
        )
        human_loops_started.append(humanLoopName)
        print(f'Starting human loop with name: {humanLoopName}  n')

The following screenshot shows the output.

Checking the status of human loop

Run the steps in the notebook to check the status of the human loop. You can use the accompanying Amazon SageMaker Jupyter notebook to follow the steps in this post.

  1. Run the following notebook cell to get a login link to navigate to the private workforce portal:
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])
  1. Choose the login link to the private worker portal.

After you log in, you can start working on the task assigned.

  1. Draw bounding boxes with respect to the label as required and choose Submit.

  1. To check if the workers have completed the labeling task, enter the following code:
completed_human_loops = []
for human_loop_name in human_loops_started:
    resp = a2i.describe_human_loop(HumanLoopName=human_loop_name)
    print(f'HumanLoop Name: {human_loop_name}')
    print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
    print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
    print('n')
    
    if resp["HumanLoopStatus"] == "Completed":
        completed_human_loops.append(resp)

The following screenshot shows the output.

Evaluating the results and retraining your model

When the labeling work is complete, your results should be available in the Amazon S3 output path specified in the human review workflow definition. The human answer, label, and bounding box are returned and saved in the JSON file. Run the notebook cell to get the results from Amazon S3. The following screenshot shows the Amazon A2I annotation output JSON file.

Because we created our training set using Ground Truth, it’s in the form of an output.manifest file in the data/manifest folder. We need to do the following:

  1. Convert the Amazon A2I labeled JSON output to a .manifest file for retraining.
  2. Merge the output.manifest file with the existing training dataset .manifest file in data/manifest.
  3. Train the new model using the augmented file.

Converting the JSON output to an augmented .manifest file

You can loop through all the Amazon A2I output, convert the JSON file, and concatenate them into a JSON Lines file, in which each line represents the results of one image. Run the following code in the Amazon SageMaker Jupyter notebook to convert the Amazon A2I results into an augmented manifest:

object_categories = ['pepperoni pizza slice'] # if you have more labels, add them here
object_categories_dict = {str(i): j for i, j in enumerate(object_categories)}

dsname = 'pepperoni_pizza'

def convert_a2i_to_augmented_manifest(a2i_output):
    annotations = []
    confidence = []
    for i, bbox in enumerate(a2i_output['humanAnswers'][0]['answerContent']['annotatedResult']['boundingBoxes']):
        object_class_key = [key for (key, value) in object_categories_dict.items() if value == bbox['label']][0]
        obj = {'class_id': int(object_class_key), 
               'width': bbox['width'],
               'top': bbox['top'],
               'height': bbox['height'],
               'left': bbox['left']}
        annotations.append(obj)
        confidence.append({'confidence': 1})

    # Change the attribute name to the dataset-name_BB for this dataset. This will later be used in setting the training data
    augmented_manifest={'source-ref': a2i_output['inputContent']['taskObject'],
                        dsname+'_BB': {'annotations': annotations,
                                           'image_size': [{'width': a2i_output['humanAnswers'][0]['answerContent']['annotatedResult']['inputImageProperties']['width'],
                                                           'depth':3,
                                                           'height': a2i_output['humanAnswers'][0]['answerContent']['annotatedResult']['inputImageProperties']['height']}]},
                        dsname+'_BB-metadata': {'job-name': 'a2i/%s' % a2i_output['humanLoopName'],
                                                    'class-map': object_categories_dict,
                                                    'human-annotated':'yes',
                                                    'objects': confidence,
                                                    'creation-date': a2i_output['humanAnswers'][0]['submissionTime'],
                                                    'type':'groundtruth/object-detection'}}
    return augmented_manifest

Merging the augmented manifest with the existing training manifest

You now need to merge the output.manifest file, which consists of the existing training dataset manifest in data/manifest, with the Amazon A2I augmented .manifest file you generated from the JSON output.

Run the following notebook cell in the Amazon SageMaker Jupyter notebook to generate this file for training a new model:

f4 = open('./augmented-temp.manifest', 'r')
with open('augmented.manifest', 'w') as outfl:
    for lin1 in f4:
        z_json = json.loads(lin1)
        done_json = json.loads(lin1)
        done_json['source-ref'] = 'a'
        f3 = open('./data/manifest/output.manifest', 'r')
        for lin2 in f3:
            x_json = json.loads(lin2)
            if z_json['source-ref'] == x_json['source-ref']:
                print("replacing the annotations")
                x_json[dsname+'_BB'] = z_json[dsname+'_BB']
                x_json[dsname+'_BB-metadata'] = z_json[dsname+'_BB-metadata']
            elif done_json['source-ref'] != z_json['source-ref']:
                print("This is a net new annotation to augmented file")
                json.dump(z_json, outfl)
                outfl.write('n')
                print(str(z_json))
                done_json = z_json
            json.dump(x_json, outfl)
            outfl.write('n')         
        f3.close()       
f4.close()    

Training a new model

To train a new model with the augmented manifest, enter the following code:

# upload the manifest file to S3
s3r.meta.client.upload_file('./augmented.manifest', BUCKET, PREFIX+'/'+'data/manifest/augmented.manifest')

We uploaded the augmented .manifest file from Amazon A2I to the S3 bucket. You can train a new model by creating a new dataset by following the Creating a dataset step in this post using this augmented manifest. The following screenshot shows some of the results from the dataset.

Follow the instructions provided in the section Creating an Amazon Rekognition custom label model of this post. After you train a new model using this augmented dataset, you can inspect the model metrics for accuracy improvement.

The following screenshot shows the metrics for the trained model using the Amazon AI labeled dataset.

We noticed an overall improvement in the metrics after retraining using the augmented dataset. Moreover, we used only 12 images to achieve these performance metrics.

Cleaning up

To avoid incurring unnecessary charges, delete the resources used in this walkthrough when not in use. For instructions, see the following:

Conclusion

This post demonstrated how you can use Amazon Rekognition Custom Labels and Amazon A2I to train models to detect objects and images unique to your business and define conditions to send the predictions to a human workflow with labelers to review and update the results. You can use the human labeled output to augment the training dataset for retraining, which improves model accuracy, or you can send it to downstream applications for analytics and insights.


About the Authors

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML Explainability areas in AI/ML.

 

 

 

Prem Ranga is an Enterprise Solutions Architect based out of Houston, Texas. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an autonomous vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.

 

 

 

Neel Sendas is a Senior Technical Account Manager at Amazon Web Services. Neel works with enterprise customers to design, deploy, and scale cloud applications to achieve their business goals. He has worked on various ML use cases, ranging from anomaly detection to predictive product quality for manufacturing and logistics optimization. When he is not helping customers, he dabbles in golf and salsa dancing.

Read More

Building custom language models to supercharge speech-to-text performance for Amazon Transcribe

Building custom language models to supercharge speech-to-text performance for Amazon Transcribe

Amazon Transcribe is a fully-managed automatic speech recognition service (ASR) that makes it easy to add speech-to-text capabilities to voice-enabled applications. As our service grows, so does the diversity of our customer base, which now spans domains such as insurance, finance, law, real estate, media, hospitality, and more. Naturally, customers in different market segments have asked Amazon Transcribe for more customization options to further enhance transcription performance.

We’re excited to introduce Custom Language Models (CLM). The new feature allows you to submit a corpus of text data to train custom language models that target domain-specific use cases. Using CLM is easy because it capitalizes on existing data that you already possess (such as marketing assets, website content, and training manuals).

In this post, we show you how to best use your available data to train a custom language model tailored for your speech-to-text use case. Although our walkthrough uses a transcription example from the video gaming industry, you can use CLM to enhance custom speech recognition for any domain of your choosing. This post assumes that you’re already familiar with how to use Amazon Transcribe, and focuses on demonstrating how to use the new CLM feature. Additional resources for using the service are available at the end.

Establishing context for evaluating CLM transcription performance

To evaluate how powerful CLM can be in terms of enhancing transcription accuracy, there are few steps we want to take. First we need to establish a baseline. To do that, we recorded an audio sample that contains speech content and lingo commonly found in video game chats. We also have a human-generated transcription to benchmark the ground truth. This ground truth transcript serves as the reference transcript we use to compare the general transcription output of Amazon Transcribe and the output of CLM. The following is a partial snippet of this reference transcript.

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. The Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz.

Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems also offer 3D audio output for immersive experiences.

Next, we want to run the sample audio through Amazon Transcribe using its generic speech engine and compare the text output to the ground truth transcript. We’ve produced a partial snippet of the side-by-side comparison and highlighted errors for visibility. Compared to the ground truth transcript, the default Amazon Transcribe transcript showed a Word Error Rate (WER) of 31.87%. This WER shouldn’t be interpreted as a full representation of the Amazon Transcribe service performance. It is just one instance for a very specific example audio. Note that accuracy is a measure of 100 minus WER. So the lower the WER, the higher the accuracy. For more information about calculating WER, see Word Error Rate on Wikipedia.

*Note that this WER is not a representation of the Amazon Transcribe service performance. It is just one instance for a very specific and limited test example. All figures are for single-case and limited scope illustration purposes.

The following text is the human-generated ground-truth reference transcript:

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. Meanwhile, the Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz. Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems also offer 3D audio output for immersive experiences.

The following text is the machine-generated transcript by the Amazon Transcribe generic speech engine:

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in heart respects between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these Consul’s. The PS five features in a M descend to CPU with up to 3.5 gigahertz frequency is sports and AM the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dahlin at 16 gigabytes in 825 gigabytes. The PS five supports both PS. The PS five supports both four K and A K resolutions. Meanwhile, the Xbox Series X also features an AM descend to CPU, but clocks in at three point take. It hurts instead, the cost. Almost a similar AMG custom GPU, with 12 teraflops and 1.825 gigahertz memory, is the same as out of the PS. Fives come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five. The Siri’s X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

Although the Amazon Transcribe generic speech model has done a decent job of transcribing the audio, it’s more likely that a tailored custom model can yield higher transcription accuracy.

Solution overview

In this post, we walk you through how to train a custom language model and evaluate how its transcription output compares to the reference transcript. You complete the following high-level steps:

  1. Prepare your training data.
  2. Train your CLM.
  3. Use your CLM to transcribe audio.
  4. Evaluate the results by comparing the CLM transcript against the generic model transcript’s accuracy.

Preparing your training data

Before we begin, it’s important to distinguish between training data and tuning data.

Training data for CLM typically includes text data that is specific to your domain. Some examples of training data could include relevant text content from your website, training manuals, sales and marketing collateral, or other text sources.

Meanwhile, human-annotated audio transcripts of actual phone calls or media content that are directly relevant to your use case can be used as tuning data. Ideally, both training and tuning data ought to be domain-specific, but in practice you may only have a small amount of audio transcriptions available. In that case, transcriptions should be used as tuning data. If more transcription data is available, it can and should be used as part of the training set as well. For more information about the difference between training and tuning data, see Improving Domain-Specific Transcription Accuracy with Custom Language Models.

Like many domains, video gaming has its own set of technical jargon, syntax, and speech dynamics that may make the use of a general speech recognition engine suboptimal. To build a custom model, we first need data that’s representative of the domain. For this use case, we want free form text from the video gaming industry. We use a variety of publicly available information from a variety of sources about video gaming. For convenience, we’ve compiled that training and tuning set, which you can download to follow along. Keep in mind that the nature, quality, and quantity of your training data has a dramatic impact on the resultant custom model you build. All else equal, it’s better to have more data than less.

As a general guideline, your training and tuning datasets should meet the following parameters:

  • Is in plain text (it’s not a file such as a Microsoft Word document, CSV file, or PDF).
  • Has a single sentence per line.
  • Is encoded in UTF-8.
  • Doesn’t contain any formatting characters, such as HTML tags.
  • Is less than 2 GB in size if you intend to use the file as training data. You can provide a maximum of 2 GB of training data.
  • Is less than 200 MB in size if you intend to use the file as tuning data. You can provide a maximum of 200 MB of optional tuning data.

The following test is a partial snippet of our example training set:

The PS5 will feature a custom eight-core AMD Zen 2 CPU clocked at 3.5GHz (variable frequency) and a custom GPU based on AMD’s RDNA 2 architecture hardware that promises 10.28 teraflops and 36 compute units clocked at 2.23GHz (also variable frequency).

It’ll also have 16GB of GDDR6 RAM and a custom 825GB SSD that Sony has previously promised will offer super-fast loading times in gameplay, via Eurogamer.

One of the biggest technical updates in the PS5 was already announced last year: a switch to SSD storage for the console’s main hard drive, which Sony says will result in dramatically faster load times.

A previous demo showed Spider-Man loading levels in less than a second on the PS5, compared to the roughly eight seconds it took on a PS4.PlayStation hardware lead Mark Cerny dove into some of the details about those SSD goals at the announcement.

Where it took a PS4 around 20 seconds to load a single gigabyte of data, the goal with the PS5’s SSD was to enable loading

five gigabytes of data in a single second.

Training your custom language model

In the following steps, we show you how to train a CLM using base training data and a tuning split. Using a tuning split is entirely optional. If you prefer not to train a CLM using a tuning split, you can skip step 5.

  1. Upload your training data (and/or tuning data) to their respective Amazon Simple Storage Service (Amazon S3) buckets.
  2. On the Amazon Transcribe console, for Name, enter a name for your custom model so you can reference it for later use.
  3. For Base model, choose a model type that matches your use case, based on audio sample rate.
  4. For Training data, enter the appropriate S3 bucket where you previously uploaded your training data.
  5. For Tuning data, enter the appropriate S3 bucket where you uploaded your tuning data. (This step is optional)
  6. For Access permissions, designate the appropriate access permissions.
  7. Choose Train model.

Amazon Transcribe service does the heavy lifting of automatically training a custom model for you.

To track the progress of the model training, go to the Custom language models page on the Amazon Transcribe console. The status indicates if the training is in progress, complete, or failed.

Using your custom language model to transcribe audio

When your custom model training is complete, you’re ready to use it. Simply start a typical transcription job as you would using Amazon Transcribe. However, this time, we want to invoke the custom language model we just trained, as opposed to using the default speech engine. For this post, we assume you already have familiarity with how to run a typical transcription job, so we only call out the new component: for Model type, select Custom language model.

Evaluating the results

When your transcription job is complete, it’s time to see how well the CLM performed. We can evaluate the output transcript against the human-annotated reference transcript, just as we had compared the generic Amazon Transcribe machine output against the human-annotated reference transcript. We actually made two CLMs (one without tuning split and one with tuning split), which we showcase as a full summary in the following table.

Transcription Type WER (%) Accuracy (100-WER)
Amazon Transcribe generic model 31.34% 68.66%
Amazon Transcribe CLM with no tuning split 26.19% 73.81%
Amazon Transcribe CLM with tuning split 20.23% 79.77%

 A lower WER is better. These WERs aren’t representative of overall Amazon Transcribe performance. All numbers are relative to demonstrate the point of using custom models over generic models, and are specific only to this singular audio sample.

The WER reductions are pretty significant! As you can see, although AmazonTranscribe’s generic engine performed decently in transcribing the sample audio from the video gaming domain, the CLM we built using training data performed 5% better. And the CLM built using training data and a tuning split performed approximately 11% better. These comparative results are unsurprising because the more relevant training and tuning that a model experiences, the more tailored it is to the specific domain and use case.

To give a qualitative visual comparison, we’ve taken a snippet of the transcript from each CLM’s output and put it against the original human annotated reference transcript to share a qualitative comparison of the different terms that were recognized by each model. We used green highlights to show the progressive accuracy improvements in each iteration.

The following text is the human-generated ground truth reference transcript:

The 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next-gen video game consoles coming out soon. So, what’s the difference in hardware specs between the upcoming Playstation 5 and Xbox Series X? Well, let’s take a look under the hoods of each next-gen gaming console. The PS5 features an AMD Zen 2 CPU with up to 3.5 GHz frequency. It sports an AMD Radeon GPU that touts 10.3 teraflops, running up to 2.23 GHz. Memory and storage respectively dial in at 16 GB and 825 GB. The PS5 supports both PS The PS5 supports both 4K and 8K resolution screens. Meanwhile, the Xbox Series X also features an AMD Zen 2 CPU, but clocks in at 3.8 GHz instead. The console boasts a similar AMD custom GPU with 12 teraflops and 1.825 GHz. Memory is the same as that of the PS5’s, coming in at 16 GB. But the default storage is where the system has an edge, bringing out a massive 1 TB hard drive. Like the PS5, the Series X also supports 4K and 8K resolution screens as well. Those of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out in practice. It’s worth noting that both systems have incorporated ray-tracing technology, something that’s used to make light and shadows look better in-game. Both systems alsooffer 3D audio output for immersive experiences.

The following text is the machine transcription output by Amazon Transcribe’s generic speech engine:

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in heart respects between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these Consul’s. The PS five features in a M descend to CPU with up to 3.5 gigahertz frequency is sports and AM the radio on GPU that tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dahlin at 16 gigabytes in 825 gigabytes. The PS five supports both PS. The PS five supports both four K and A K resolutions. Meanwhile, the Xbox Series X also features an AM descend to CPU, but clocks in at three point take. It hurts instead, the cost. Almost a similar AMG custom GPU, with 12 teraflops and 1.825 gigahertz memory, is the same as out of the PS. Fives come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five. The Siri’s X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

The following text is the machine transcription output by CLM (base training, no tuning split):

the 2020 holiday season is right around the corner. And with the way that the years been going, we can all hope for a little excitement around the next Gen videogame consoles coming out soon. So what’s the difference in hardware specs between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hood of each of these consoles. The PS five features an A M D Zen two CPU with up to 3.5 gigahertz frequency it sports and am the radio GPU. That tells 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively, Dahlin at 16 gigabytes and 825 gigabytes, The PS five supports both PS. The PS five supports both four K and eight K resolutions. Meanwhile, the Xbox Series X also features an A M D Zen two CPU, but clocks in at 3.8 hertz. Instead. The console boasts a similar a AMD custom GPU, with 12 teraflops and 1.825 hertz. Memory is the same as out of the PS five’s come in at 16 gigabytes, but the default storage is where the system has an edge. Bring out a massive one terabyte hard drive. Like the PS five, the Series X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences

The following text is the machine transcription output by CLM (base training, with tuning split):

the 2020 holiday season is right around the corner. And with the way that the year’s been going, we can all hope for a little excitement around the next Gen video game consoles coming out soon. So what’s the difference in hardware specs between the upcoming PlayStation five and Xbox Series X? Well, let’s take a look under the hoods of each of these consoles. The PS five features an AMG Zen two CPU with up to 3.5 givers frequency it sports an AMD Radeon GPU that touts 10.3 teraflops running up to 2.23 gigahertz memory and storage, respectively. Dial in at 16 gigabytes and 825 gigabytes. The PS five supports both PS. The PS five supports both four K and eight K resolutions. Meanwhile, the Xbox Series X also features an AMG Zen two CPU, but clocks in at 3.8 had hurts. Instead, the console boasts a similar AMD custom. GPU, with 12 teraflops and 1.825 gigahertz memory is the same as out of the PS five’s coming in at 16 gigabytes. But the default storage is where the system has an edge, bringing out a massive one terabyte hard drive. Like the PS five, the Series X also supports four K and eight K resolution screens as well. Those, of course, are just the numbers. Therefore, it remains to be seen exactly how the performance plays out. In practice. It’s worth noting that both systems have incorporated Ray tracing technology, something that’s used to make light and shadows look better in game. Both systems also offer three D audio output for immersive experiences.

The difference in improvement varies according to your use case and the quality of your training data and tuning set. Experimentation is encouraged. In general, the more training and tuning that your custom model undergoes, the better the performance. CLM doesn’t guarantee 100% accuracy, but it can offer significant performance improvements over generic speech recognition models.

Best practices

It’s important to note that the resultant custom language model depends directly on what you use as your training dataset. All else equal, the closer the representation of your training data to real use cases, the more performant your custom model is. Moreover, more data is always preferred. For more information about the general guidelines, see Improving Domain-Specific Transcription Accuracy with Custom Language Models.

The Amazon Transcribe CLM doesn’t charge for model training, so feel free to experiment. In a single AWS account, you can train up to 10 custom models to address different domains, use cases, or new training datasets. After you have your CLM, you can choose which transcription jobs to utilize your CLM. You only incur an additional CLM charge for the transcription jobs in which you apply a custom language model.

Conclusion

CLMs can be a powerful capability when it comes to improving transcription accuracy for domain-specific use cases. The new feature is available in all AWS Regions where Amazon Transcribe already operates. At the time of this writing, the feature only supports US English. Additional language support will come with time. Start training your own custom models by visiting Amazon Transcribe and checking out Improving Domain-Specific Transcription Accuracy with Custom Language Models.

Related Resources

For additional resources, see the following:

 


About the Authors

Paul Zhao is Lead Product Manager at AWS Machine Learning. He manages speech recognition services like Amazon Transcribe and Amazon Transcribe Medical. He was formerly a serial entrepreneur, having launched, operated, and exited two successful businesses in the areas of IoT and FinTech, respectively.

 

 

 

Vivek Govindan is Senior Software Development engineer at AWS Machine Learning. Outside of work, Vivek is an ardent soccer fan.

 

Read More

Running on-demand, serverless Apache Spark data processing jobs using Amazon SageMaker managed Spark containers and the Amazon SageMaker SDK

Running on-demand, serverless Apache Spark data processing jobs using Amazon SageMaker managed Spark containers and the Amazon SageMaker SDK

Apache Spark is a unified analytics engine for large scale, distributed data processing. Typically, businesses with Spark-based workloads on AWS use their own stack built on top of Amazon Elastic Compute Cloud (Amazon EC2), or Amazon EMR to run and scale Apache Spark, Hive, Presto, and other big data frameworks. This is useful for persistent workloads, in which you want these Spark clusters to be up and running 24/7, or at best, would have to come up with an architecture to spin up and spin down the cluster on a schedule or on demand.

Amazon SageMaker Processing lets you easily run preprocessing, postprocessing, model evaluation or other fairly generic transform workloads on a fully managed infrastructure. Previously, Amazon SageMaker Processing included a built-in container for Scikit-learn style preprocessing. For using other libraries like Spark, you have the flexibility to bring in your own Docker containers. Amazon SageMaker Processing jobs can also be part of your Step Functions workflow for ML involving pre- and post-processing steps. For more information, see AWS Step Functions adds support for Amazon SageMaker Processing.

Several machine learning(ML) workflows involve preprocessing data with Spark (or other libraries) and then passing in training data to a training step. The following workflow shows an Extract, Transform and Load (ETL) step that leads to model training and finally to model endpoint deployment using AWS Step Functions.

Including Spark steps in such workflows requires additional steps to provision and set up these clusters. Alternatively, you can do using AWS Glue, a fully managed ETL service that makes it easy for customers to write Python or Scala based scripts to preprocess data for ML training.

We’re happy to add a managed Spark container and associated SDK enhancements to Amazon SageMaker Processing, which lets you perform large scale, distributed processing on Spark by simply submitting a PySpark or Java/Scala Spark application. You can use this feature in Amazon SageMaker Studio and Amazon SageMaker notebook instances.

To demonstrate, the following code example runs a PySpark script on Amazon SageMaker Processing by using the PySparkProcessor:

from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=1200,
) 

spark_processor.run(
    submit_app_py="./path/to/your/preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix],
    spark_event_logs_s3_uri='s3://' + bucket + '/' + prefix + '/spark_event_logs',
    logs=False
)

We can look at this example in some more detail. The PySpark script name ‘preprocess.py’ such as the one shown below, that loads a large CSV file from Amazon Simple Storage Service (Amazon S3) into a Spark dataframe, fits and transforms this dataframe into an output dataframe, and converts and saves a CSV back to Amazon S3:

import time
import sys
import os
import shutil
import csv

import pyspark
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.sql.types import StructField, StructType, StringType, DoubleType
from pyspark.ml.feature import StringIndexer, VectorIndexer, OneHotEncoder, VectorAssembler
from pyspark.sql.functions import *


def csv_line(data):
    r = ','.join(str(d) for d in data[1])
    return str(data[0]) + "," + r


def main():
    spark = SparkSession.builder.appName("PySparkApp").getOrCreate()
    
    # Convert command line args into a map of args
    args_iter = iter(sys.argv[1:])
    args = dict(zip(args_iter, args_iter))

    spark.sparkContext._jsc.hadoopConfiguration().set("mapred.output.committer.class",
                                                      "org.apache.hadoop.mapred.FileOutputCommitter")
    
    # Defining the schema corresponding to the input data. The input data does not contain the headers
    schema = StructType([StructField("sex", StringType(), True), 
                         StructField("length", DoubleType(), True),
                         StructField("diameter", DoubleType(), True),
                         StructField("height", DoubleType(), True),
                         StructField("whole_weight", DoubleType(), True),
                         StructField("shucked_weight", DoubleType(), True),
                         StructField("viscera_weight", DoubleType(), True), 
                         StructField("shell_weight", DoubleType(), True), 
                         StructField("rings", DoubleType(), True)])

    # Downloading the data from S3 into a Dataframe
    total_df = spark.read.csv(('s3://' + os.path.join(args['s3_input_bucket'], args['s3_input_key_prefix'],'abalone.csv')), header=False, schema=schema)

    #StringIndexer on the sex column which has categorical value
    sex_indexer = StringIndexer(inputCol="sex", outputCol="indexed_sex")
    
    #one-hot-encoding is being performed on the string-indexed sex column (indexed_sex)
    sex_encoder = OneHotEncoder(inputCol="indexed_sex", outputCol="sex_vec")

    #vector-assembler will bring all the features to a 1D vector for us to save easily into CSV format
    assembler = VectorAssembler(inputCols=["sex_vec", 
                                           "length", 
                                           "diameter", 
                                           "height", 
                                           "whole_weight", 
                                           "shucked_weight", 
                                           "viscera_weight", 
                                           "shell_weight"], 
                                outputCol="features")
    
    # The pipeline comprises of the steps added above
    pipeline = Pipeline(stages=[sex_indexer, sex_encoder, assembler])
    
    # This step trains the feature transformers
    model = pipeline.fit(total_df)
    
    # This step transforms the dataset with information obtained from the previous fit
    transformed_total_df = model.transform(total_df)
    
    # Split the overall dataset into 80-20 training and validation
    (train_df, validation_df) = transformed_total_df.randomSplit([0.8, 0.2])
    
    # Convert the train dataframe to RDD to save in CSV format and upload to S3
    train_rdd = train_df.rdd.map(lambda x: (x.rings, x.features))
    train_lines = train_rdd.map(csv_line)
    train_lines.saveAsTextFile('s3://' + os.path.join(args['s3_output_bucket'], args['s3_output_key_prefix'], 'train'))
    
    # Convert the validation dataframe to RDD to save in CSV format and upload to S3
    validation_rdd = validation_df.rdd.map(lambda x: (x.rings, x.features))
    validation_lines = validation_rdd.map(csv_line)
    validation_lines.saveAsTextFile('s3://' + os.path.join(args['s3_output_bucket'], args['s3_output_key_prefix'], 'validation'))


if __name__ == "__main__":
    main()

You can easily start a Spark based processing job by using the PySparkProcessor() class as shown below:

from sagemaker.spark.processing import PySparkProcessor

# Upload the raw input dataset to S3
timestamp_prefix = strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = 'sagemaker/spark-preprocess-demo/' + timestamp_prefix
input_prefix_abalone = prefix + '/input/raw/abalone'
input_preprocessed_prefix_abalone = prefix + '/input/preprocessed/abalone'
model_prefix = prefix + '/model'

sagemaker_session.upload_data(path='./data/abalone.csv', bucket=bucket, key_prefix=input_prefix_abalone)

# Run the processing job
spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=1200,
)

spark_processor.run(
    submit_app_py="./code/preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix_abalone,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix_abalone],
    spark_event_logs_s3_uri='s3://' + bucket + '/' + prefix + '/spark_event_logs',
    logs=False
)

When running this in Amazon SageMaker Studio or Amazon SageMaker notebook instance, the output shows the job’s progress:

Job Name:  sm-spark-<...>
Inputs:  [{'InputName': 'code', 'S3Input': {'S3Uri': 's3://<bucketname>/<prefix>/preprocess.py', 'LocalPath': '/opt/ml/processing/input/code', 'S3DataType': 'S3Prefix', 'S3InputMode': 'File', 'S3DataDistributionType': 'FullyReplicated', 'S3CompressionType': 'None'}}]
Outputs:  [{'OutputName': 'output-1', 'S3Output': {'S3Uri': 's3://<bucketname>/<prefix>', 'LocalPath': '/opt/ml/processing/spark-events/', 'S3UploadMode': 'Continuous'}}]

In Amazon SageMaker Studio, you can describe your processing jobs and view relevant details by choosing the processing job name (right-click), and choosing Open in trial details.

You can also track the processing job’s settings, logs, and metrics on the Amazon SageMaker console as shown in the following screenshot.

After a job completes, if the spark_event_logs_s3_uri was specified in the run() function, the Spark UI can be viewed by running the history server:

spark_processor.start_history_server()

If run from an Amazon SageMaker Notebook instance, the output will include a proxy URL where the history server can be accessed:

Starting history server...
History server is up on https://<your-notebook>.notebook.us-west-2.sagemaker.aws/proxy/15050

Visiting this URL will bring you to the history server web interface as shown in the screenshot below:

Additional python and jar file dependencies can also be specified in your Spark jobs. For example, if you want to serialize an MLeap model, you can specify these additional dependencies by modifying the call to the run() function of PySparkProcessor:

spark_processor.run(
    submit_app_py="./code/preprocess-mleap.py",
    submit_py_files=["./spark-mleap/mleap-0.15.0.zip"],
    submit_jars=["./spark-mleap/mleap-spark-assembly.jar"],
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix_abalone,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix_abalone],
    logs=False
)

Finally, overriding Spark configuration is crucial for several tasks such as tuning your Spark application or configuring the Hive metastore. You can override Spark, Hive, Hadoop configurations using our Python SDK.

For example, the following code overrides spark.executor.memory and spark.executor.cores:

spark_processor = PySparkProcessor(
    base_job_name="sm-spark",
    framework_version="2.4",
    role=role,
    instance_count=2,
    instance_type="ml.c5.xlarge",
    max_runtime_in_seconds=1200,
)

configuration = [{
  "Classification": "spark-defaults",
  "Properties": {"spark.executor.memory": "2g", "spark.executor.cores": "1"},
}]

spark_processor.run(
    submit_app_py="./code/preprocess.py",
    arguments=['s3_input_bucket', bucket,
               's3_input_key_prefix', input_prefix_abalone,
               's3_output_bucket', bucket,
               's3_output_key_prefix', input_preprocessed_prefix_abalone],
    configuration=configuration,
    logs=False
)

Try out this example on your own by navigating to the examples tab in your Amazon SageMaker notebook instance, or by cloning the Amazon SageMaker examples directory and navigating to the folder with Amazon SageMaker Processing examples.

Additionally, you can set up an end-to-end Spark workflow for your use cases using Amazon SageMaker and other AWS services:

Conclusion

Amazon SageMaker makes extensive use of Docker containers to allow users to build a runtime environment for data preparation, training, and inference code. Amazon SageMaker’s built-in Spark container for Amazon SageMaker Processing provides a managed Spark runtime including all library components and dependencies needed to run distributed data processing workloads. The example discussed in the blog shows how developers and data scientists can take advantage of the built-in Spark container on Amazon SageMaker to focus on more important aspects of preparing and preprocessing data. Instead of spending time tuning, scaling, or managing Spark infrastructure, developers can focus on core implementation.


About the Authors

 Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using the AWS platform.

 

 

 

 

Andrew Packer is a Software Engineer in Amazon AI where he is excited about building scalable, distributed machine learning infrastructure for the masses. In his spare time, he likes playing guitar and exploring the PNW.

 

 

 

 

Vidhi Kastuar is a Sr. Product Manager for Amazon SageMaker, focusing on making machine learning and artificial intelligence simple, easy to use and scalable for all users and businesses. Prior to AWS, Vidhi was Director of Product Management at Veritas Technologies. For fun outside work, Vidhi loves to sketch and paint, work as a career coach, and spend time with her family and friends.

Read More

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS Inferentia is now available in 11 AWS Regions, with best-in-class performance for running object detection models at scale

AWS has expanded the availability of Amazon EC2 Inf1 instances to four new AWS Regions, bringing the total number of supported Regions to 11: US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Mumbai, Singapore, Sydney, Tokyo), Europe (Frankfurt, Ireland, Paris), and South America (São Paulo).

Amazon EC2 Inf1 instances are powered by AWS Inferentia chips, which are custom-designed to provide you with the lowest cost per inference in the cloud and lower the barriers for everyday developers to use machine learning (ML) at scale. Customers using models such as YOLO v3 and YOLO v4 can get up to 1.85 times higher throughput and up to 40% lower cost per inference compared to the EC2 G4 GPU-based instances.

As you scale your use of deep learning across new applications, you may be bound by the high cost of running trained ML models in production. In many cases, up to 90% of the infrastructure cost spent on developing and running an ML application is on inference, making the need for high-performance, cost-effective ML inference infrastructure critical. Inf1 instances are built from the ground up to deliver faster performance and more cost-effective ML inference than comparable GPU-based instances. This gives you the performance and cost structure you need to confidently deploy your deep learning models across a broad set of applications.

AWS Neuron SDK performance and support for new ML models

You can deploy your ML models to Inf1 instances natively with popular ML frameworks such as TensorFlow, PyTorch, and MXNet. You can deploy your existing models to Amazon EC2 Inf1 instances with minimal code changes by using the AWS Neuron SDK, which is integrated with these popular ML frameworks. This gives you the freedom to maintain hardware portability and take advantage of the latest technologies without being tied to vendor-specific software libraries.

Since its launch, the Neuron SDK has seen a dramatic improvement in the breadth of models that deliver best-in-class performance at a fraction of the cost. This includes natural language processing models like the popular BERT, image classification models (ResNet and VGG), and object detection models (OpenPose and SSD). The latest Neuron release (1.8.0) provides optimizations that improve performance of YOLO v3 and v4, VGG16, SSD300, and BERT. It also improves operational deployments of large-scale inference applications, with a session management agent incorporated into all supported ML frameworks and a new neuron tool that allows you to easily scale monitoring of large fleets of inference applications.

Customer success stories

Since the launch of Inf1 instances, a broad spectrum of customers, from large enterprises to startups, as well as Amazon services, have begun using them to run production workloads.

Anthem is one of the nation’s leading health benefits companies, serving the healthcare needs of over 40 million members across dozens of states. They use deep learning to automate the generation of actionable insights from customer opinions via natural language models.

“Our application is computationally intensive and needs to be deployed in a highly performant manner,” says Numan Laanait, PhD, Principal AI/Data Scientist at Anthem. “We seamlessly deployed our deep learning inferencing workload onto Amazon EC2 Inf1 instances powered by the AWS Inferentia processor. The new Inf1 instances provide two times higher throughput to GPU-based instances and allowed us to streamline our inference workloads.”

Condé Nast, another AWS customer, has a global portfolio that encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair.

“Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips,” says Paul Fryzel, Principal Engineer in AI Infrastructure at Condé Nast. “This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a performance improvement of a 72% reduction in cost than the previously deployed GPU instances.”

Getting started

The easiest and quickest way to get started with Inf1 instances is via Amazon SageMaker, a fully managed service for building, training, and deploying ML models. If you prefer to manage your own ML application development platforms, you can get started by either launching Inf1 instances with AWS Deep Learning AMIs, which include the Neuron SDK, or use Inf1 instances via Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Container Service (Amazon ECS) for containerized ML applications.

For more information, see Amazon EC2 Inf1 Instances.


About the Author

Gadi Hutt is a Sr. Director, Business Development at AWS. Gadi has over 20 years’ experience in engineering and business disciplines. He started his career as an embedded software engineer, and later on moved to product lead positions. Since 2013, Gadi leads Annapurna Labs technical business development and product management focused on hardware acceleration software and hardware products like the EC2 FPGA F1 instances and AWS Inferentia along side with its Neuron SDK, accelerating machine learning in the cloud.

Read More

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

Moving from notebooks to automated ML pipelines using Amazon SageMaker and AWS Glue

A typical machine learning (ML) workflow involves processes such as data extraction, data preprocessing, feature engineering, model training and evaluation, and model deployment. As data changes over time, when you deploy models to production, you want your model to learn continually from the stream of data. This means supporting the model’s ability to autonomously learn and adapt in production as new data is added.

In practice, data scientists often work with Jupyter notebooks for development work and find it hard to translate from notebooks to automated pipelines. To achieve the two main functions of a ML service in production, namely retraining (retrain the model on newer labeled data) and inference (use the trained model to get predictions), you might primarily use the following:

  • Amazon SageMaker – A fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly
  • AWS Glue – A fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data

In this post, we demonstrate how to orchestrate an ML training pipeline using AWS Glue workflows and train and deploy the models using Amazon SageMaker. For this use case, you use AWS Glue workflows to build an end-to-end ML training pipeline that covers data extraction, data processing, training, and deploying models to Amazon SageMaker endpoints.

Use case

For this use case, we use the DBpedia Ontology classification dataset to build a model that performs multi-class classification. We trained the model using the BlazingText algorithm, which is a built-in Amazon SageMaker algorithm that can classify unstructured text data into multiple classes.

This post doesn’t go into the details of the model, but demonstrates a way to build an ML pipeline that builds and deploys any ML model.

Solution overview

The following diagram summarizes the approach for the retraining pipeline.

The workflow contains the following elements:

  • AWS Glue crawler – You can use a crawler to populate the Data Catalog with tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. ETL jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.
  • AWS Glue triggers – Triggers are Data Catalog objects that you can use to either manually or automatically start one or more crawlers or ETL jobs. You can design a chain of dependent jobs and crawlers by using triggers.
  • AWS Glue job – An AWS Glue job encapsulates a script that connects source data, processes it, and writes it to a target location.
  • AWS Glue workflow – An AWS Glue workflow can chain together AWS Glue jobs, data crawlers, and triggers, and build dependencies between the components. When the workflow is triggered, it follows the chain of operations as described in the preceding image.

The workflow begins by downloading the training data from Amazon Simple Storage Service (Amazon S3), followed by running data preprocessing steps and dividing the data into train, test, and validate sets in AWS Glue jobs. The training job runs on a Python shell running in AWS Glue jobs, which starts a training job in Amazon SageMaker based on a set of hyperparameters.

When the training job is complete, an endpoint is created, which is hosted on Amazon SageMaker. This job in AWS Glue takes a few minutes to complete because it makes sure that the endpoint is in InService status.

At the end of the workflow, a message is sent to an Amazon Simple Queue Service (Amazon SQS) queue, which you can use to integrate with the rest of the application. You can also use the queue to trigger an action to send emails to data scientists that signal the completion of training, add records to management or log tables, and more.

Setting up the environment

To set up the environment, complete the following steps:

  1. Configure the AWS Command Line Interface (AWS CLI) and a profile to use to run the code. For instructions, see Configuring the AWS CLI.
  2. Make sure you have the Unix utility wget installed on your machine to download the DBpedia dataset from the internet.
  3. Download the following code into your local directory.

Organization of code

The code to build the pipeline has the following directory structure:

--Glue workflow orchestration
	--glue_scripts
		--DataExtractionJob.py
		--DataProcessingJob.py
		--MessagingQueueJob,py
		--TrainingJob.py
	--base_resources.template
	--deploy.sh
	--glue_resources.template

The code directory is divided into three parts:

  • AWS CloudFormation templates – The directory has two AWS CloudFormation templates: glue_resources.template and base_resources.template. The glue_resources.template template creates the AWS Glue workflow-related resources, and base_resources.template creates the Amazon S3, AWS Identity and Access Management (IAM), and SQS queue resources. The CloudFormation templates create the resources and write their names and ARNs to AWS Systems Manager Parameter Store, which allows easy and secure access to ARNs further in the workflow.
  • AWS Glue scripts – The folder glue_scripts holds the scripts that correspond to each AWS Glue job. This includes the ETL as well as model training and deploying scripts. The scripts are copied to the correct S3 bucket when the bash script runs.
  • Bash script – A wrapper script deploy.sh is the entry point to running the pipeline. It runs the CloudFormation templates and creates resources in the dev, test, and prod environments. You use the environment name, also referred to as stage in the script, as a prefix to the resource names. The bash script performs other tasks, such as downloading the training data and copying the scripts to their respective S3 buckets. However, in a real-world use case, you can extract the training data from databases as a part of the workflow using crawlers.

Implementing the solution

Complete the following steps:

  1. Go to the deploy.sh file and replace algorithm_image name with <ecr_path> based on your Region.

The following code example is a path for Region us-west-2:

algorithm_image="433757028032.dkr.ecr.us-west-2.amazonaws.com/blazingtext:latest"

For more information about BlazingText parameters, see Common parameters for built-in algorithms.

  1. Enter the following code in your terminal:
    sh deploy.sh -s dev AWS_PROFILE=your_profile_name

This step sets up the infrastructure of the pipeline.

  1. On the AWS CloudFormation console, check that the templates have the status CREATE_COMPLETE.
  2. On the AWS Glue console, manually start the pipeline.

In a production scenario, you can trigger this manually through a UI or automate it by scheduling the workflow to run at the prescribed time. The workflow provides a visual of the chain of operations and the dependencies between the jobs.

  1. To begin the workflow, in the Workflow section, select DevMLWorkflow.
  2. From the Actions drop-down menu, choose Run.
  3. View the progress of your workflow on the History tab and select the latest RUN ID.

The workflow takes approximately 30 minutes to complete. The following screenshot shows the view of the workflow post-completion.

  1. After the workflow is successful, open the Amazon SageMaker console.
  2. Under Inference, choose Endpoint.

The following screenshot shows that the endpoint the workflow deployed is ready.

Amazon SageMaker also provides details about the model metrics calculated on the validation set in the training job window. You can further enhance model evaluation by invoking the endpoint using a test set and calculating the metrics as necessary for the application.

Cleaning up

Make sure to delete the Amazon SageMaker hosting services—endpoints, endpoint configurations, and model artifacts. Delete both CloudFormation stacks to roll back all other resources. See the following code:

	def delete_resources(self):

        endpoint_name = self.endpoint

        try:
            sagemaker.delete_endpoint(EndpointName=endpoint_name)
            print("Deleted Test Endpoint ", endpoint_name)
        except Exception as e:
            print('Model endpoint deletion failed')

        try:
            sagemaker.delete_endpoint_config(EndpointConfigName=endpoint_name)
            print("Deleted Test Endpoint Configuration ", endpoint_name)
        except Exception as e:
            print(' Endpoint config deletion failed')

        try:
            sagemaker.delete_model(ModelName=endpoint_name)
            print("Deleted Test Endpoint Model ", endpoint_name)
        except Exception as e:
            print('Model deletion failed')

Conclusion

This post describes a way to build an automated ML pipeline that not only trains and deploys ML models using a managed service such as Amazon SageMaker, but also performs ETL within a managed service such as AWS Glue. A managed service unburdens you from allocating and managing resources, such as Spark clusters, and makes it easy to move from notebook setups to production pipelines.


About the Authors

Sai Sharanya Nalla is a Data Scientist at AWS Professional Services. She works with customers to develop and implement AI and ML solutions on AWS. In her spare time, she enjoys listening to podcasts and audiobooks, long walks, and engaging in outreach activities.

 

 

 

Inchara B Diwakar is a Data Scientist at AWS Professional Services. She designs and engineers ML solutions at scale, with experience across healthcare, manufacturing and retail verticals. Outside of work, she enjoys the outdoors, traveling and a good read.

Read More

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

BERT inference on G4 instances using Apache MXNet and GluonNLP: 1 million requests for 20 cents

Bidirectional Encoder Representations from Transformers (BERT) [1] has become one of the most popular models for natural language processing (NLP) applications. BERT can outperform other models in several NLP tasks, including question answering and sentence classification.

Training the BERT model on large datasets is expensive and time consuming, and achieving low latency when performing inference on this model is challenging. Latency and throughput are key factors to deploy a model in production. In this post, we focus on optimizing these factors for BERT inference tasks. We also compare the cost of deploying BERT on different Amazon Elastic Compute Cloud (Amazon EC2) instances.

When running inference on the BERT-base model, the g4dn.xlarge GPU instance achieves between 2.6–5 times lower latency (3.8 on average) than a c5.24xlarge CPU instance. The g4dn.xlarge instance also achieves the best cost-effective ratio (cost per requests) compared to c5.xlarge, c5.24xlarge, and m5.xlarge CPU instances. Specifically, the cost of processing 1 million BERT-inference requests with sequence length 128 is $0.20 on g4dn.xlarge, whereas on c5.xlarge (the best of these CPU instances), the cost is $3.31—the GPU instance is 16.5 times more efficient.

We achieved these results after a set of GPU optimizations on MXNet, described in the section Optimizing BERT model performance on MXNET 1.6 and 1.7 of this post.

Amazon EC2 G4 instances

G4 instances are optimized for machine learning application deployments. They’re equipped with NVIDIA T4 GPUs, powered by Tensor Cores, and deliver groundbreaking AI performance: up to 65 TFLOPS in FP16 precision and up to 130 TOPS in INT8 precision.

Amazon EC2 offers a variety of G4 instances with one or multiple GPUs, and with different amounts of vCPU and memory. You can perform BERT inference below 5 milliseconds on a single T4 GPU with 16 GB, such as on a g4dn.xlarge instance (the cost of this instance at the time of writing is $0.526 per hour on demand in the US East (N. Virginia) Region.

For more information about G4 instances, see Amazon EC2 G4 Instances.

GluonNLP and MXNet

GluonNLP is a deep learning framework built on top of MXNet, which was specifically designed for NLP applications. It extends MXNet, providing NLP models, datasets, and examples.

GluonNLP includes an efficient implementation of the BERT model, scripts for training and performing inference, and several datasets (such as GLUE benchmark and SQuAD). For more information, see GluonNLP: NLP made easy.

For this post, we use the GluonNLP BERT implementation to perform inference on NLP tasks. Specifically, we use MXNet version 1.7 and GluonNLP version 0.10.0.

BERT-base inference results

We present results performing two different BERT tasks: question answering and classification (sentimental analysis using the Stanford Sentiment Treebank (SST2) dataset). We achieved the results after a set of GPU optimizations on MXNet.

In the following graphs, we compare latency achieved by a single GPU on a g4dn.xlarge instance with FP16 precision vs. the most efficient CPU instance in terms of latency, c5.24xlarge with INT8 precision, MKL BLAS and 24 OpenMP threads.

The following graph shows BERT-base latency on c5.25xlarge (INT8) and g4dn.xlarge (FP16) instances performing a classification inference task (SST2 dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

The following graph shows BERT-base latency on c5.24xlarge (INT8) and g4dn.xlarge (FP16) instances performing a question answering inference task (SQuAD dataset). Different sequence length values (80, 128, 384), and different batch sizes (1, 4, 16, 8 ,32, 64, 128, 300) are shown. In the case of sequence length, 128 values are included as labels.

 

In the following two graphs, we present a cost comparison between several instances based on the throughput (sentences/s) and the cost of each instance on demand (cost per hour) in the US East (N. Virginia) Region.

The following graph shows dollars per 1 million sequence classification requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

The following graph shows dollars per 1 million question answering requests, for different instances, batch size 128, and several sequence lengths (80, 128 and 384). The price on demand of each instance per hour was based on the US East (N. Virginia) Region: $0.192 for m5.xlarge, $0.17 for c5.xlarge, $4.08 for c5.24xlarge, and $0.526 g4dn.xlarge.

Deploying BERT on G4 instances

You can easily reproduce the results in the preceding section on a g4dn.xlarge instance. You can start from a pretrained model and fine-tune it for a specific task before running inference, or you can download one of the following fine-tuned models:

Then complete the following steps:

  1. To initialize a G4 instance, on the Amazon EC2 console, choose Deep Learning AMI (Ubuntu 18.04) Version 28.1 (or posterior) and a G4 instance.
  2. Connect to the instance and set MXNet 1.7 and GluonNLP 0.10.x:
pip install mxnet-cu102==1.7.0
git clone --branch v0.10.x https://github.com/dmlc/gluon-nlp.git
cd gluon-nlp; pip install -e .; cd scripts/bert
python setup.py install

 The command python setup.py install generates a custom graph pass (bertpass_lib.so) that optimizes the graph, and therefore performance. It can be passed to the inference script as an argument.

  1. If you didn’t download any fine-tuned parameters, you can now fine-tune your model to specify a sequence length and use a GPU.
    • For a question answering task, run the following script (approximately 180 minutes):
python3 finetune_squad.py --max_seq_length 128 --gpu
    • For a classification task, run the following script:
python3 finetune_classifier.py --task_name [task_name] --max_len 128 --gpu 0

In the preceding code, task choices include ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’ (refers to SST2), ‘XNLI’, ‘LCQMC’, and ‘ChnSentiCorp’. Computation time depends on the specific task. For SST, it should take less than 15 minutes.

By default, these scripts run 3 epochs (to achieve the published accuracy in [1]).

They generate an output file, output_dir/net.params, where the fine-tuned parameters are stored and from where they can be loaded at inference step. Scripts also perform a prediction test to check accuracy.

You should get an F1 score of 85 or higher in question answering, and a validation metric higher to 0.92 in SST classification task.

You can now perform inference using validation datasets.

  1. Force MXNet to use FP32 precision in Softmax and LayerNorm layers for better accuracy when using FP16.

These two layers are susceptible to overflow, so we recommend always using FP32. MXNet takes care of it if you set the following:

export MXNET_SAFE_ACCUMULATION=1
  1. Activate True FP16 computation for performance purposes.

General matrix multiply operations don’t present accuracy issues in this model. By default, they’re computed using FP32 accumulation (for more information, see the section Optimizing BERT model performance on MXNET 1.6 and 1.7 in this post), but you can activate the FP16 accumulation setting:

 export MXNET_FC_TRUE_FP16=1
  1. Run inference:
python3 deploy.py --model_parameters [path_to_finetuned_params] --task [_task_] --gpu 0 --dtype float16 --custom_pass=bertpass_lib.so

In the preceding code, the  task can be one of ‘QA’, ‘embedding’, ‘MRPC’, ‘QQP’, ‘QNLI’, ‘RTE’, ‘STS-B’, ‘CoLA’, ‘MNLI’, ‘WNLI’, ‘SST’, ‘XNLI’, ‘LCQMC’, or ‘ChnSentiCorp’ [1].

This command exports the model (JSON and parameter files) into the output directory (output_dir/[task_name]), and performs inference using the validation dataset corresponding to each task.

It reports the average latency and throughput.

The second time you run it, you can skip the export step by adding the tag --only_infer and specifying the exported model to use by adding --exported_model followed by the prefix name of the JSON or parameter files.

Optimal latency is achieved on G4 instances with FP16 precision. We recommend adding the flag -dtype float16 and activating MXNET_FC_TRUE_FP16 when performing inference. These flags shouldn’t reduce the final accuracy in your results.

By default, all these scripts use BERT-base (12 transformer-encoder layers). If you want to use BERT-large, use the flag --bert_model bert_24_1024_16 when calling the scripts.

Optimizing BERT model performance on MXNet 1.6 and 1.7

Computationally, the BERT model is mainly dominated by general matrix multiply operations (GEMMs). They represent up to 56% of time consumed when performing inference. The following chart shows the percentage of computational time spent on each operation type performing BERT-base inference (sequence length 128 and batch size 128).

MXNet uses the cuBLAS library to efficiently compute these GEMMs on the GPU. These GEMMs belong to the multi-head self-attention part of the model (4 GEMMs per transformer layer), and the feed-forward network (2 GEMMs per transformer layer).

In this section, we discuss optimizing the most computational-consuming operations.

The following table shows the improvement of each optimization. The performance improvements were achieved by the different GPU BERT optimizations implemented on MXNet and GluonNLP, performing a question answering inference task (SQuAD dataset), and using a sequence length of 128. Speedup achieved is shown for different batch sizes.

LayerNorm, Softmax and AddBias

Although LayerNorm was already optimized for GPUs on MXNet 1.5, the implementation of Softmax was optimized in MXNet 1.6. The new implementation improves inference performance on GPUs by optimizing the device memory accesses and using the CUDA registers and shared memory during reduction operations more efficiently. Additionally, you have the option to apply a max_length mask within the C++ Softmax operator, which removes the need to apply the mask at the Python level.

The addition of bias terms following GEMMs was also optimized. Instead of using an mshadow broadcast summation, a custom CUDA kernel is now attached to the FullyConnected layer, which includes efficient device memory accesses.

Multi-head self-attention

The following equation defines the attention mechanism used in the BERT model [2], where Q represents the query, K the key, V the value, and dk the inner dimension of these three matrices:

Three different linear projections (FullyConnected: GEMMs and Bias-Addition) are performed to obtain Q, K, and V from the same input (when the same input is employed, the mechanism is denominated self-attention), but with different weights:

    • Q = input Wqt
    • K = input Wkt
    • V = input Wvt

The input size is (BatchSize, SeqLength, EmbeddingDim), and each weight tensor W size is (ProjectionDim, EmbeddingDim).

In multi-head attention, many projections and attention functions are applied to the input as the number of heads, augmenting the dimensions of the weights so that each W size is ((NumHeads x ProjectionDim), EmbeddingDim).

All these projections are independent, so we can compute them in parallel within the same operation, producing an output which size is (BatchSize, SeqLength, 3 x NumHeads x ProjectionDim). That is, GluonNLP uses a single FullyConnected layer to compute Q, K, and V.

To compute the attention function (the preceding equation), we first need to compute the dot product QKT. We need to perform this computation independently for each head, with m=SeqLength number of Q rows, n=SeqLength number of K columns, and k=ProjectionDim size of vectors in the dot product. We can use a batched dot product operation, where the number of batches is (BatchSize x NumHeads), to compute all the dot products within the same operation.

However, to perform such an operation in cuBLAS, we need to have the batches and heads dimensions contiguous (in order to have a regular pattern to express distance between batches), but that isn’t the case by default (SeqLength dimension is between them). To avoid rearranging Q, K, and V, GluonNLP transposes the input so that its shape is (SeqLength, BatchSize, EmbeddingDim), and Q, K, and V are directly projected into a tensor with shape (SeqLength, BatchSize, 3 x NumHeads x ProjectionDim).

Moreover, to avoid splitting the joint QKV output, we can compute the projections in an interleaved fashion, allocating continuously the applied weights Wq, Wk, Wv of each individual head. The following diagram depicts the interleaved projection operation, where P is the projection size, and we end with a joint QKV output with shape (SeqLength, BatchSize, NumHeads x 3 x ProjectionDim).

This strategy allows us to compute QKT from a unique joint input tensor with cuBLASGEMMStridedBatched, setting the number of batches to (BatchSize x NumHeads) and the stride to (3 x ProjectionDim). We also use a strided batched GEMM to compute the dot product of V (same stride as before) with the output of the Softmax function. We implemented MXNet operators that deal with this cuBLAS configuration.

True FP16

Since MXNet 1.7, you can compute completely in FP16 precision GEMMs. By default, when the data type is FP16, MXNet sets cuBLAS to internally use FP32 accumulation. You can now set the environment variable MXNET_FC_TRUE_FP16 to 1 to force MXNet to use FP16 as the cuBLAS internal computation type.

Pointwise fusion and prearrangement of MHA weights and bias using a custom graph pass

Finally, the feed-forward part of the model, which happens after each transformer layer, uses Gaussian Exponential Linear Unit (GELU) as its activation function. This operation follows a feed-forward FullyConnected operation, which includes bias addition. We use the MXNet functionality of custom graph passes to detach the bias addition from the FullyConnected operation and fuse it with GELU through the pointwise fusion mechanism.

In our custom graph pass for BERT, we also prearrange the weights and bias terms for the multi-head self-attention computation so that we avoid any overhead at runtime. As explained earlier, weights need to be interleaved, and bias terms need to be joint into a unique tensor. We do this before exporting the model. This strategy shows benefits in small batch size cases.

Conclusion

In this post, we presented an efficient solution for performing BERT inference tasks on EC2 G4 GPU instances. We showed how a set of MXNet optimizations boost GPU performance, achieving speeds up to twice as fast in both question answering and classification tasks.

We have shown that g4dn.xlarge instances offer lower latency (below 4 milliseconds with batch size 1) than any EC2 CPU instance, and g4dn.xlarge is 3.8 times better than c5.24xlarge on average. Finally, g4dn.xlarge offers the best cost per million requests ratio—16 times better than CPU instances (c5.xlarge) on average.

 

 

Acknowledgments

We would like to thank Triston Cao, Murat Guney from NVIDIA, Sandeep Krishnamurthy from Amazon, the Amazon-MXNet team, and the NVIDIA MXNet team for their feedback and support.

Disclaimer

The content and opinions in this post are those of the third-party authors and AWS is not responsible for the content or accuracy of this post.

References

  1. Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
  2. Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

About the Authors

Moises Hernandez Fernandez is an AI DevTech Engineer at NVIDIA. He works on accelerating NLP applications on GPUs. Before joining NVIDIA, he was conducting research into the brain connectivity, optimizing the analysis of diffusion MRI using GPUs. Moises received a PhD in Neurosciences from Oxford University.

Haibin Lin is a former Applied Scientist at Amazon Web Services. He works on distributed systems, deep learning, and NLP. He is a PPMC and committer of Apache MXNet, and a major contributor to the GluonNLP toolkit. He has finished his M.S. in Computer Science at Carnegie Mellon University, advised by Andy Pavlo. Prior to that, he has received a B.Eng. in Computer Science from University of Hong Kong and Shanghai Jiao Tong University jointly.

Przemyslaw Tredak is a senior developer technology engineer on the Deep Learning Frameworks team at NVIDIA. He is a committer of Apache MXNet and leads the MXNet team at NVIDIA.

Anish Mohan is a Machine Learning Architect at Nvidia and the technical lead for ML/DL engagements with key Nvidia customers in the greater Seattle region. Before Nvidia, he was at Microsoft’s AI Division, working to develop and deploy AI/ML algorithms and solutions.

Read More