Translate video captions and subtitles using Amazon Translate

Video is a highly effective a highly effective way to educate, entertain, and engage users. Your company might carry a large collection of videos that include captions or subtitles. To make these videos accessible to a larger audience, you can provide translated captions and subtitles in multiple languages. In this post, we show you how to create an automated and serverless pipeline to translate captions and subtitles using Amazon Translate, without losing their context during translation.

Captions and subtitles help make videos accessible for those hard of hearing, provide flexibility to users in noisy or quiet environments, and assist non-native speakers. Captions or subtitles are normally represented in SRT (.srt) or WebVTT (.vtt) format. SRT stands for SubRip Subtitle, and is the most common file format for subtitles and captions. WebVTT stands for Web Video Text Track, and is becoming a popular format for the same purpose.

Multi-language video subtitling and captioning solution

This solution uses Amazon Translate, a neural machine translation service that delivers fast, high-quality, and affordable language translation. Amazon Translate supports the ability to ignore tags and only translate text content in HTML documents. The following diagram illustrates the workflow of our solution.

The following diagram illustrates the workflow of our solution.

The workflow includes the following steps:

  1. Extract caption text from a WebVTT or SRT file and create a delimited text file using an HTML tag.
  2. Translate this delimited file using the asynchronous batch processing capability in Amazon Translate.
  3. Recreate the WebVTT or SRT files using the translated delimited file.

We provide a more detailed architecture in the next section.

Solution architecture

This solution is based on an event-driven and serverless pipeline architecture, and uses managed services so that it’s scalable and cost-effective. The following diagram illustrates the serverless pipeline architecture.

The following diagram illustrates the serverless pipeline architecture.

The pipeline contains the following steps:

  1. Users upload one or more caption files in the WebVTT (.vtt) or the SRT (.srt) format to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. The upload triggers an AWS Lambda function.
  3. The function extracts text captions from each file, creates a corresponding HTML tag delimited text file, and stores them in Amazon S3.
  4. The function invokes Amazon Translate in batch mode to translate the delimited text files into the target language.
  5. The AWS Step Functions based job poller polls for the translation job to complete.
  6. The Step Functions workflow sends an Amazon Simple Notification Service (Amazon SNS) notification when the translation is complete.
  7. A Lambda function reads the translated delimited files from Amazon S3, creates the caption files in the WebVTT (.vtt) or SRT(.srt) format with the translated text captions, and stores them back in Amazon S3.

We explain Steps 3–7 in more detail in the following sections.

Convert caption files to delimited files

In this architecture, uploading the file with triggerFileName triggers the Lambda function <Stack name>-S3CaptionsFileEventProcessor-<Random string>. The function iterates through the WebVTT and SRT files in the input folder and for each file, it extracts the caption text, converts it into a delimited text file using an HTML (<span>) tag, and places it in the captions-in folder of the Amazon S3 bucket. See the following function code:

try:
        captions = Captions()
        #filter only the VTT and SRT file for processing in the input folder
        objs = S3Helper().getFilteredFileNames(bucketName,"input/",["vtt","srt"])
        for obj in objs:
            try:
                vttObject = {}
                vttObject["Bucket"] = bucketName
                vttObject["Key"] = obj
                captions_list =[]
                #based on the file type call the method that coverts them into python list object
                if(obj.endswith("vtt")):
                    captions_list =  captions.vttToCaptions(vttObject)
                elif(obj.endswith("srt")):
                    captions_list =  captions.srtToCaptions(vttObject)
                #convert the text captions in the list object to a delimited file
                delimitedFile = captions.ConvertToDemilitedFiles(captions_list)
                fileName = obj.split("/")[-1]
                newObjectKey = "captions-in/{}.delimited".format(fileName)
                S3Helper().writeToS3(str(delimitedFile),bucketName,newObjectKey)   
                output = "Output Object: {}/{}".format(bucketName, newObjectKey)

The solution uses a Python library webvtt-py to load, parse, and generate the WebVTT and SRT file formats. All the operations related to the library are abstracted within the Captions module. Also, all Amazon S3 operations are abstracted within the S3Helper module.

Batch translation of delimited files

After the delimited files are stored in the captions-in folder of the Amazon S3 bucket, the Lambda function <Stack name>-S3CaptionsFileEventProcessor-<Random string> invokes the Amazon Translate job startTextTranslationJob with the following parameters:

  • The captions-in folder in the S3 bucket is the input location for files to be translated
  • The captions-out folder in the S3 bucket is the output location for translated files
  • Source language code
  • Destination language code
  • An AWS Identity and Access Management (IAM) role ARN with necessary policy permissions to read and write to the S3 bucket

See the following job code:

translateContext = {}
translateContext["sourceLang"] = sourceLanguageCode
translateContext["targetLangList"] = [targetLanguageCode]
translateContext["roleArn"] = access_role 
translateContext["bucket"] = bucketName
translateContext["inputLocation"] = "captions-in/"
translateContext["outputlocation"] = "captions-out/"
translateContext["jobPrefix"] = "TranslateJob-captions"
#Call Amazon Translate to translate the delimited files in the captions-in folder
jobinfo = captions.TranslateCaptions(translateContext)

Poll the Amazon Translate batch translate job

The solution uses a Step Functions workflow to periodically poll the Amazon Translate service for the status of the submitted job using a Lambda function. When the job is complete, the workflow creates an Amazon SNS notification with details of the Amazon Translate job as the notification payload. For more details on the Step Functions job definition and the Lambda code, see Getting a batch job completion message from Amazon Translate.

Create WebVTT and SRT files from the delimited files

The Amazon SNS notification from the job poller step triggers the Lambda function <Stack name>-TranslateCaptionsJobSNSEventProcessor-<Random string>. The function iterates through the each of the translated delimited files generated in the captions-out folder based on the event details available from the Amazon SNS notification event. See the following code:

output = ""
    logger.info("request: {}".format(request))
    up = urlparse(request["s3uri"], allow_fragments=False)
    accountid = request["accountId"]
    jobid =  request["jobId"]
    bucketName = up.netloc
    objectkey = up.path.lstrip('/')
    basePrefixPath = objectkey  + accountid + "-TranslateText-" + jobid + "/";
    languageCode = request["langCode"]
    logger.debug("Base Prefix Path:{}".format(basePrefixPath))
    captions = Captions()
    #filter only the delimited files with .delimited suffix
    objs = S3Helper().getFilteredFileNames(bucketName,basePrefixPath,["delimited"])
    for obj in objs:
        try:
            #Read the Delimited file contents
            content = S3Helper().readFromS3(bucketName,obj)
            fileName = FileHelper().getFileName(obj)

The solution generates the WebVTT or SRT file using the original WebVTT or SRT file from the input folder for the time markers, but replaces the captions with the translated caption text from the delimited files. See the following code:

logger.debug("SourceFileKey:{}.processed".format(sourceFileName))
            soureFileKey = "input/{}.processed".format(sourceFileName)
            vttObject = {}
            vttObject["Bucket"] = bucketName
            vttObject["Key"] = soureFileKey
            captions_list = []
            #Based on the file format, call the right method to load the file as python object
            if(fileName.endswith("vtt")):
                    captions_list =  captions.vttToCaptions(vttObject)
            elif(fileName.endswith("srt")):
                captions_list =  captions.srtToCaptions(vttObject)
            # Replace the text captions with the translated content
            translatedCaptionsList = captions.DelimitedToWebCaptions(captions_list,content,"<span>",15)
            translatedText = ""
            # Recreate the Caption files in VTT or SRT format
            if(fileName.endswith("vtt")):
                translatedText =  captions.captionsToVTT(translatedCaptionsList)
            elif(fileName.endswith("srt")):
                translatedText =  captions.captionsToSRT(translatedCaptionsList)

The function then writes the new WebVTT or SRT files as S3 objects in the output folder with the following naming convention: TargetLanguageCode-<inputFileName>.vtt or TargetLanguageCode-<inputFileName>.srt. See the following code:

newObjectKey = "output/{}".format(fileName)
# Write the VTT or SRT file into the output S3 folder
S3Helper().writeToS3(str(translatedText),bucketName,newObjectKey)

Solution deployment

You can either deploy the solution using an AWS CloudFormation template or by cloning the GitHub repository.

Deployment using the CloudFormation template

The CloudFormation template provisions the necessary resources needed for the solution, including the IAM roles, IAM policies, and Amazon SNS topics. The template creates the stack the us-east-1 Region.

  1. Launch the CloudFormation template by choosing Launch Stack:

  1. For Stack name, enter a unique stack name for this account; for example, translate-captions-stack.
  2. For SourceLanguageCode, enter the language code for the current language of the caption text; for example, en for English.
  3. For TargetLanguageCode, enter the language code that you want your translated text in; for example, es for Spanish.

For more information about supported languages, see Supported Languages and Language Codes.

  1. For TriggerFileName, enter the name of the file that triggers the translation serverless pipeline (the default is triggerfile).
  2. In the Capabilities and transforms section, and select the check boxes to acknowledge that CloudFormation will create IAM resources and transform the AWS Serverless Application Model (AWS SAM) template.

AWS SAM templates simplify the definition of resources needed for serverless applications. When deploying AWS SAM templates in AWS CloudFormation, AWS CloudFormation performs a transform to convert the AWS SAM template into a CloudFormation template. For more information, see Transform.

  1. Choose Create stack.

Choose Create stack.

The stack creation may take up to 10 minutes, after which the status changes to CREATE_COMPLETE. You can see the name of the newly created S3 bucket along with other AWS resources created on the Outputs tab.

You can see the name of the newly created S3 bucket along with other AWS resources created on the Outputs tab.

Deployment using the GitHub repository

To deploy the solution using GitHub, visit the GitHub repo and follow the instructions in the README.md file. The solution uses AWS SAM to make it easy to deploy in your AWS account.

Test the solution

To test the solution, upload one or more WebVTT (.vtt) or SRT (.srt) files to the input folder. Because this is a batch operation, we recommend uploading multiple files at the same time. The following code shows a sample SRT file:

1
00:00:00,500 --> 00:00:07,000
Hello. My name is John Doe. Welcome to the blog demonstrating the ability to

2
00:00:07,000 --> 00:00:11,890
translate from one language to another using Amazon Translate. Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. 

3
00:00:11,890 --> 00:00:16,320
Neural machine translation is a form of language translation automation that uses deep learning models to deliver more accurate and natural-sounding translation than traditional statistical and rule-based translation algorithms.

4
00:00:16,320 --> 00:00:21,580
The translation service is trained on a wide variety of content across different use cases and domains to perform well on many kinds of content.

5
00:00:21,580 --> 00:00:23,880
Its asynchronous batch processing capability enables you to translate a large collection of text or HTML documents with a single API call.

After you upload all the WebVTT or SRT documents, upload the file that triggers the translation workflow. This file can be a zero-byte file, but the filename should match the TriggerFileName parameter in the CloudFormation stack. The default name for the file is triggerfile.

After you upload all the WebVTT or SRT documents, upload the file that triggers the translation workflow.

After a short time (15–20 minutes), check the output folder to see the WebVTT or SRT files with the following naming convention: TargetLanguageCode-<inputFileName>.vtt or TargetLanguageCode-<inputFileName>.srt.

After a short time (15–20 minutes), check the output folder to see the WebVTT or SRT files

The following snippet shows the SRT file translated into Spanish:

1
00:00:00,500 --> 00:00:07,000
Hola. Mi nombre es John Doe. Bienvenido al blog que demuestra la capacidad de

2
00:00:07,000 --> 00:00:11,890
traducir de un idioma a otro utilizando Amazon Translate. Amazon Translate es un servicio de traducción automática neuronal que ofrece traducción de idiomas rápida, de alta calidad y asequible. 

3
00:00:11,890 --> 00:00:16,320
La traducción automática neuronal es una forma de automatización de la traducción de idiomas que utiliza modelos de aprendizaje profundo para ofrecer una traducción más precisa y natural que los algoritmos de traducción basados en reglas y estadísticas tradicionales. 

4
00:00:16,320 --> 00:00:21,579
El servicio de traducción está capacitado en una amplia variedad de contenido en diferentes casos de uso y dominios para funcionar bien en muchos tipos de contenido. 

5
00:00:21,579 --> 00:00:23,879
Su capacidad de procesamiento por lotes asincrónico le permite traducir una gran colección de documentos de texto o HTML con una sola llamada a la API.

You can monitor the progress of the solution pipeline by checking the Amazon CloudWatch logs generated for each Lambda function that is part of the solution. For more information, see Accessing Amazon CloudWatch logs for AWS Lambda.

To do a translation for a different source-target language combination, you can update the SOURCE_LANG_CODE and TARGET_LANG_CODE environment variable for the <Stack name>-S3CaptionsFileEventProcessor-<Random string> function and trigger the solution pipeline by uploading WebVTT or SRT documents and the TriggerFileName into the input folder.

To do a translation for a different source-target language combination, you can update the SOURCE_LANG_CODE and TARGET_LANG_CODE environment variable

Conclusion

In this post, we demonstrated how to translate video captions and subtitles in WebVTT and SRT file formats using Amazon Translate asynchronous batch processing. This process can be used in several industry verticals, including education, media and entertainment, travel and hospitality, healthcare, finance, law, or any organization with a large collection of subtitled or captioned video assets that wants these translated to their customers in multiple languages.

You can easily integrate the approach into your own pipelines as well as handle large volumes of caption and subtitle text with this scalable architecture. This methodology works for translating captions and subtitles between over 70 languages supported by Amazon Translate (as of this writing). Because this solution uses asynchronous batch processing, you can customize your machine translation output using parallel data. For more information on using parallel data, see Customizing Your Translations with Parallel Data (Active Custom Translation). For a low-latency, low-throughput solution translating smaller caption files, you can perform the translation through the real-time Amazon Translate API. For more information, see Translating documents with Amazon Translate, AWS Lambda, and the new Batch Translate API. If your organization has a large collection of videos that need to be captioned or subtitled, you can use this AWS Subtitling solution.


About the Authors

Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.

 

 

Raju Penmatcha is a Senior AI/ML Specialist Solutions Architect at AWS. He works with education, government, and non-profit customers on machine learning and artificial intelligence related projects, helping them build solutions using AWS. Outside of work, he likes exploring new places.

Read More

Active learning workflow for Amazon Comprehend custom classification models – Part 2

This is the second in a two part series on Amazon Comprehend custom classification models. In Part 1 of this series, we looked at how to build an AWS Step Functions workflow to automatically build, test, and deploy Amazon Comprehend custom classification models and endpoints. In Part 2, we look at real-time classification APIs, feedback loops, and human review workflows that help with continuous model training to keep the model up-to-date with new data and patterns. You can find Part 1 here

The Amazon Comprehend custom classification API enables you to easily build custom text classification models using your business-specific labels without learning machine learning (ML). For example, your customer support organization can use custom classification to automatically categorize inbound requests by problem type based on how the customer described the issue. You can use custom classifiers to automatically label support emails with appropriate issue types, thereby routing customer phone calls to the right agents and categorizing social media posts into user segments.

In Part 1 of this series, we looked at how to build an AWS Step Functions workflow to automatically build, test, and deploy Amazon Comprehend custom classification models and endpoints. In this post, we cover the real-time classification APIs, feedback loops, and human review workflows that help with continuous model training to keep it up to date with new data and patterns.

Solution architecture

This post describes a reference architecture for retraining custom classification models. The architecture comprises real-time classification, feedback pipelines, human review workflows using Amazon Augmented AI (Amazon A2I) , preparing new training data from the human review data, and triggering the model building flow that we covered in Part 1 of this series.

The following diagram illustrates this architecture covering the last three components. In the following sections, we walk you through each step in the workflow.

The following diagram illustrates this architecture covering the last three components.

Real-time classification

To use custom classification in Amazon Comprehend in real time, you need to create an API that calls the custom classification model endpoint with the text that needs to be classified. This stage is represented by Steps 1–3 in the preceding architecture:

  1. The end user application calls an Amazon API Gateway endpoint with a text that needs to be classified.
  2. The API Gateway endpoint then calls an AWS Lambda function configured to call an Amazon Comprehend endpoint.
  3. The Lambda function calls the Amazon Comprehend endpoint, which returns the unlabeled text classification and a confidence score.

Feedback collection

When the endpoint returns the classification and the confidence score during the real-time classification, you can send instances with low confidence scores to human review. This type of feedback is called implicit feedback.

  1. The Lambda function sends the implicit feedback to an Amazon Kinesis Data Firehose delivery stream.

The other type of feedback is called explicit feedback, and comes from the application’s end users that use the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send explicit feedback either in real time through an API or a batch process.

  1. End users of the application submit explicit real-time feedback through an API Gateway endpoint.
  2. The Lambda function backing the API endpoint transforms the data into a standard feedback format and writes it to the Kinesis Data Firehose delivery stream.
  3. End users of the application can also submit explicit feedback as a batch file by uploading it to an S3 bucket.
  4. A trigger configured on the S3 bucket triggers a Lambda function.
  5. The Lambda function transforms the data into a standard feedback format and writes it to the delivery stream.
  6. Both the implicit and explicit feedback data get sent to a delivery stream in a standard format. All this data is buffered and written to an S3 bucket.

Human classification

The human classification stage includes the following steps:

  1. A trigger configured on the feedback bucket in Step 10 invokes a Lambda function.
  2. The Lambda function creates Amazon A2I human review tasks for all the feedback data received.
  3. Workers assigned to the classification jobs log in to the human review portal and either approve the classification by the model or classify the text with the right labels.
  4. After the human review, all these instances are stored in an S3 bucket and used for retraining the models.

Retraining workflow

The retraining workflow stage includes the following steps:

  1. A trigger configured on the human-reviewed data bucket in Step 14 invokes a Lambda function.
  2. The function transforms the human-reviewed data payload to a comma-separated training data format, required by Amazon Comprehend custom classification models. After transformation, this data is written to a Firehose delivery stream, which acts as an accumulator.
  3. Depending on the time frame set for retraining models, the delivery stream flushes the data into the training bucket that was created in Part 1 of this series. For this post, we set the buffer conditions to 1 MiB or 60 seconds. For your own use case, you might want to adjust these settings so model retraining occurs according to your time or size requirements. This completes the active learning loop, and starts the Step Functions workflow for retraining models.

Solution overview

The next few sections of the post go over how to set up this architecture in your AWS account. We classify news into four categories: World, Sports, Business, and Sci/Tech, using the AG News dataset for custom classification, and set up the implicit and explicit feedback loop. You need to complete two manual steps:

  1. Create an Amazon Comprehend custom classifier and an endpoint.
  2. Create an Amazon SageMaker private workforce, worker task template, and human review workflow.

After this, you run the provided AWS CloudFormation template to set up the rest of the architecture.

Prerequisites

If you’re continuing from Part 1 of this series, you can skip to the step Create a private workforce, worker task template, and human review workflow.

Create a custom classifier and an endpoint

Before you get started, download the dataset and upload it to Amazon S3. This dataset comprises a collection of news articles and their corresponding category labels. We have created a training dataset called train.csv from the original dataset and made it available for download.

The following screenshot shows a sample of the train.csv file.

The following screenshot shows a sample of the train.csv file.

After you download the train.csv file, upload it to an S3 bucket in your account for reference during training. For more information about uploading files, see How do I upload files and folders to an S3 bucket?

To create your classifier for classifying news, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. Choose Train classifier.
  3. For Name, enter news-classifier-demo.
  4. Select Using Multi-class mode.
  5. For Training data S3 location, enter the path for train.csv in your S3 bucket, for example, s3://<your-bucketname>/train.csv.
  6. For Output data S3 location, enter the S3 bucket path where you want the output, such as s3://<your-bucketname>/.
  7. For IAM role, select Create an IAM role.
  8. For Permissions to access, choose Input and output (if specified) S3 bucket.
  9. For Name suffix, enter ComprehendCustom.

For Name suffix, enter ComprehendCustom

  1. Scroll down and choose Train Classifier to start the training process.

The training takes some time to complete. You can either wait to create an endpoint or come back to this step later after finishing the steps in the section Create a private workforce, worker task template, and human review workflow.

Create a custom classifier real-time endpoint

To create your endpoint, complete the following steps:

  1. On the Amazon Comprehend console, choose Custom Classification.
  2. From the Classifiers list, choose the name of the custom model for which you want to create the endpoint and select your model news-classifier-demo.
  3. On the Actions drop-down menu, choose Create endpoint.
  4. For Endpoint name, enter classify-news-endpoint and give it one inference unit.
  5. Choose Create endpoint.
  6. Copy the endpoint ARN as shown in the following screenshot. You use it when running the CloudFormation template in a future step.

Copy the endpoint ARN as shown in the following screenshot.

Create a private workforce, worker task template, and human review workflow

This section walks you through creating a private workforce in SageMaker, a worker task template, and your human review workflow.

Create a labeling workforce

For this post, you create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console).

After the user accepts the invitation, you add them to the workforce. For instructions, see the Add a Worker to a Work Team section the Manage a Workforce (Amazon SageMaker Console).

Create a worker task template

To create a worker task template, complete the following steps:

  1. On the Amazon A2I console, choose Worker task templates.
  2. Choose to Create a template.
  3. For Template name, enter custom-classification-template.
  4. For Template type, choose Custom,
  5. In the Template editor, enter the following GitHub UI template code.
  6. Choose Create.

Choose Create.

Create a human review workflow

To create your human review workflow, complete the following steps:

  1. On the Amazon A2I console, choose Human review workflows.
  2. Choose Create human review workflow.
  3. For Name, enter classify-workflow.
  4. Create a S3 bucket to store the human review output. Make a note of this bucket, because we use this in the later part of the post.
  1. Specify an S3 bucket to store output: s3://<your bucketname>/. Use the bucket created earlier.
  2. For IAM role, select Create a new role.
  3. For Task type, choose Custom.
  4. Under Worker task template creation, select the custom classification template you created.
  5. For Task description, enter Read the instructions and review the document.
  6. Under Workers, select Private.
  7. Use the drop-down list to choose the private team that you created.
  8. Choose Create.
  9. Copy the workflow ARN (see the following screenshot). You will use it when initializing the CloudFormation template parameters in a later step.

Copy the workflow ARN

Deploy the CloudFormation template to set up active learning feedback

Now that you have completed the manual steps, you can run the CloudFormation template to set up this architecture’s building blocks, including the real-time classification, feedback collection, and the human classification.

Before deploying the CloudFormation template, make sure you have the following to pass as parameters:

  • Custom classifier endpoint ARN
  • Amazon A2I workflow ARN
  1. Choose Launch Stack:

  1. You must set this parameter, only if you’re continuing from Part 1 of this series. For BucketFromBlogPart1, enter the bucket name that was created for storing training data in Part 1 of this blog series.
  2. You must set this parameter, only if you’re continuing from Part 1 of this series. For ComprehendEndpointParameterKey, enter /<<StackName of Part1 Blog>>/CURRENT_CLASSIFIER_ENDPOINT. This parameter can be found in the Parameter Store section of the Systems Manager.
  1. You’re not required to set this parameter if you’re continuing from Part 1.For ComprehendEndpointARN, enter the endpoint ARN of your Amazon Comprehend custom classification model.
  1. For HumanReviewWorkflowARN, enter the workflow ARN you copied.
  2. For ComrehendClassificationScoreThreshold, enter 0.5, which means a 50% threshold for low confidence scores.

For ComrehendClassificationScoreThreshold, enter 0.5

  1. Choose Next until the Capabilities
  2. Select the check box to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.

For more information about these resources, see AWS IAM resources.

  1. Choose Create stack.

Choose Create stack.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

Wait until the status of the stack changes from CREATE_IN_PROGRESS to CREATE_COMPLETE.

  1. On the Outputs tab of the stack (see the following screenshot), copy the value for BatchUploadS3Bucket, FeedbackAPIGatewayID, and TextClassificationAPIGatewayID to interact with the feedback loop.

Both the TextClassificationAPI and FeedbackAPI require an API key to interact with them. The CloudFormation stack output ApiGWKey refers to the name of the API key. As of this writing, this API key is associated with a usage plan that allows 2,000 requests per month.

  1. On the API Gateway console, choose either the TextClassificationAPI or the FeedbackAPI.
  2. In the navigation pane, choose API Keys.
  3. Expand the API key section and copy the value.

Expand the API key section and copy the value.

You can manage the usage plan by following the instructions on Create, configure, and test usage plans with the API Gateway console.

You can also add fine-grained authentication and authorization to your APIs. For more information on securing your APIs, see Controlling and managing access to a REST API in API Gateway.

Enable the trigger to start the retraining workflow

The last step of the process is to add a trigger to the S3 bucket that we created earlier to store the human-reviewed output. The trigger invokes the Lambda function that begins the payload transformation from the Amazon A2I human review output format to a CSV format required for training Amazon Comprehend custom classification models.

  1. Open the Lambda function HumanReviewTrainingDataTransformerFunction, created by running the CloudFormation template.
  2. In the Trigger configuration section, choose S3.
  3. For Bucket, enter the bucket you created earlier in the step 4 of Create a human review workflow section.

For Bucket, enter the bucket you created earlier.

Test the feedback loop

In this section, we walk you through testing your feedback loop, including real-time classification, implicit and explicit feedback, and human review tasks.

Real-time classification

To interact and test these APIs, you need to download Postman.

The API Gateway endpoint receives an unlabeled text document from a client application and internally calls the custom classification endpoint, which returns the predicted label and a confidence score.

  1. Open Postman and enter the TextClassificationAPIGateway URL in POST method.
  2. In the Headers section, configure the API key: x-api-key : << Your API key >>.
  3. In the text field, enter the following JSON code (make sure you have JSON selected and enable raw):
    {"classifier":"<your custom classifier name>", "sentence":"MS Dhoni retires and a billion people had mixed feelings."}

  1. Choose Send.

You get a response back with a confidence score and class, as seen in the following screenshot.

You get a response back with a confidence score and class, as seen in the following screenshot.

Implicit feedback

When the endpoint returns the classification and the confidence score during the real-time classification, you can route all the instances where the confidence score doesn’t meet the threshold to human review. This type of feedback is called implicit feedback. For this post, we set the threshold as 0.5 as an input to the CloudFormation stack parameter.

You can change this threshold when deploying the CloudFormation template based on your needs.

Explicit feedback

The explicit feedback comes from the end users of the application that uses the custom classification feature. This type of feedback comprises the instances of text where the user wasn’t happy with the prediction. You can send the predicted label by the model’s explicit feedback through the following methods:

  • Real time through an API, which is usually triggered through a like/dislike button on a UI
  • Batch process, where a file with a collection of misclassified utterances is put together based on a user survey conducted by the customer outreach team

Invoke the explicit real-time feedback loop

To test the Feedback API, complete the following steps:

  1. Open Postman and enter the FeedbackAPIGatewayID value from your CloudFormation stack output in POST method.
  2. In the Headers section, configure the API key: x-api-key : << Your API key >>.
  3. In the text field, enter the following JSON code (for classifier, enter the classifier you created, such as news-classifier-demo, and make sure you have JSON selected and enable raw):
    {"classifier":"<your custom classifier name>","sentence":"Sachin is Indian Cricketer."}

  1. Choose Send.

We recommend that you submit at least four test samples that will result in a confidence score lesser than your set threshold.

We recommend that you submit at least four test samples that will result in a confidence score lesser than your set threshold.

Submit explicit feedback as a batch file

Download the following test feedback JSON file, populate it with your data, and upload it into the BatchUploadS3Bucket created when you deployed your CloudFormation template. We recommend that you submit at least four feedback entries in this file. The following code shows some sample data in the file:

{
   "classifier":"news-classifier-demo",
   "sentences":[
      "US music firms take legal action against 754 computer users alleged to illegally swap music online.",
      "A gamer spends $26,500 on a virtual island that exists only in a PC role-playing game."
   ]
}

Uploading the file triggers the Lambda function that starts your human review loop.

Human review tasks

All the feedback collected through the implicit and explicit methods is sent for human classification. The labeling workforce can include Amazon Mechanical Turk, private teams, or AWS Marketplace vendors. For this post, we create a private workforce. The URL to the labeling portal is located on the SageMaker console, on the Labeling workforces page, on the Private tab.

For this post, we create a private workforce.

After you log in, you can see the human review tasks assigned to you. Select the task to complete and choose Start working.

Select the task to complete and choose Start working.

You see the tasks displayed based on the worker template used when creating the human workflow.

You see the tasks displayed based on the worker template used when creating the human workflow.

After you complete the human classification and submit the tasks, the human-reviewed data is stored in the S3 bucket you configured when creating the human review workflow. This bucket is located under Output location on the workflow details page.

his bucket is located under Output location on the workflow details page.

This human-reviewed data is used to retrain the custom classification model to learn newer patterns and improve its overall accuracy. The following screenshot shows the human-annotated output file output.json in the S3 bucket.

The following screenshot shows the human-annotated output file output.json in the S3 bucket.

This human-reviewed data is then converted to a custom classification model training data format, and transferred to the training bucket that was created in Part 1 of this series, which starts the Step Functions workflow for retraining models. The process of retraining the models with human-reviewed data, selecting the best model, and automatically deploying the new endpoints completes the active learning workflow.

Cleanup

To remove all resources created throughout this process and prevent additional costs, complete the following steps:

  1. On the Amazon S3 console, delete the S3 bucket that contains the training dataset.
  2. On the Amazon Comprehend console, delete the endpoint and the classifier.
  3. On the Amazon A2I console, delete the human review workflow, worker template, and private workforce.
  4. On the AWS CloudFormation console, delete the stack you created. (This removes the resources the CloudFormation template created.)

Conclusion

Amazon Comprehend helps you build scalable and accurate natural language processing capabilities without any ML experience. This post provides a reusable pattern and infrastructure for active learning workflows for custom classification models. The feedback pipelines and human review workflow help the custom classifier learn new data patterns continuously. To learn more about automatic model building, selection, and deployment of custom classification models, you can refer to Active learning workflow for Amazon Comprehend custom classification models – Part 1.

For more information, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.


About the Authors

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and developing products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.

 

 

Joyson Neville Lewis Joyson Neville Lewis obtained his masters in Information Technology from Rutgers University in 2018. He has worked as a Software/Data engineer before diving into the conversational AI domain in 2019, where he works with companies to connect the dots between business and AI using voice and chatbot solutions. Joyson joined Amazon Web Services in February of 2018 as a Big Data Consultant for the AWS Professional Services team in NYC.

Read More

Active learning workflow for Amazon Comprehend custom classification models – Part 1

This is the first in a two part series on Amazon Comprehend custom classification models. In Part 1 of this series, we look at how to build an AWS Step Functions workflow to automatically build, test, and deploy Amazon Comprehend custom classification models and endpoints. In Part 2, we will look at real-time classification APIs, feedback loops, and human review workflows that help with continuous model training to keep the model up-to-date with new data and patterns. You can find Part 2 here

The Amazon Comprehend custom classification API enables you to easily build custom text classification models using your business-specific labels without learning ML. For example, your customer support organization can use custom classification to automatically categorize inbound requests by problem type based on how the customer has described the issue. You can use custom classifiers to automatically label support emails with appropriate issue types, route customer phone calls to the right agents, and categorize social media posts into user segments.

For custom classification, you start by creating a training job with a ground truth dataset comprising a collection of text and corresponding category labels. When the job is complete, you have a classifier that can classify any new text into one or more named categories. When the custom classification model classifies a new unlabeled text document, it predicts what it has learned from the training data. Sometimes you may not have a training dataset with various language patterns, or once you deploy the model, you start seeing completely new data patterns. In these cases, the model may not be able to classify these new data patterns accurately. How can we ensure continuous model training to keep it up to date with new data and patterns?

Feedback loops play a pivotal role in keeping the models up to date. This feedback helps the models learn about their misclassifications and learn the right ones. This process of teaching the models continuously through feedback and deploying them is called active learning.

Solution architecture

In this two-part series, we discuss an architecture pattern that allows you to build an active learning workflow for Amazon Comprehend custom classification models. The first post will cover an AWS Step Functions workflow that automates model building, selecting the best model, and deploying an endpoint of the chosen model. The second post describes a workflow comprising real-time classification, feedback pipelines, and human review workflows using Amazon Augmented AI (Amazon A2I).

Step Functions workflow

The following diagram shows the Step Functions workflow for automatic model building, endpoint creation, and deploying Amazon Comprehend custom classification models.

In the following sections, we discuss the six steps in more detail:

  • Model building: Steps 1–2
  • Model selection: Step 3
  • Model deployment: Steps 4–6

Model building

Steps 1–2 in the workflow cover model building, which includes incorporating new data into the ground truth dataset and retraining the model.  If the model is being built for the first time, the new dataset will be marked as ground truth dataset, and the Model selection step would be skipped. This new data can come from different sources, including feedback data that was human reviewed and reclassified, as discussed in Part 2 of this series. The new data is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, which starts a workflow that includes merging the new data with the ground truth dataset and starting a custom classification model training job that uses the newly merged dataset.

Model selection

Step 3 covers model selection, which includes testing the newly created model with a validation dataset, computing the test results, comparing the results of the new model with the current model in production, and finally selecting the model that performs best with respect to a chosen metric like accuracy, precision, recall, or F1 score. All these steps are orchestrated using the same Step Functions workflow after the model is built.

Model deployment

Steps 4–6 cover model deployment. If the new model outperforms the current model in production, the Step Functions workflow continues to the next step, where a new endpoint is created for the newly created custom classification model, and then updates the AWS Systems Manager Parameter Store values to this new classifier ARN and the new endpoint ARN to be used by the real-time classification API, upgrades the newly merged training dataset as the primary training dataset, and deletes the endpoint of the previous model that was in production. If the newer model doesn’t perform well in production, you can roll back to the previous model and endpoint by manually updating the Parameter Store values to refer to the earlier model and endpoint ARNs.

Deploying the AWS CloudFormation template

You can deploy this architecture using the provided AWS CloudFormation template in us-east-1.

  1. Choose Launch Stack:

  1. For Stack Name, enter the name of your CloudFormation stack.
  2. For StepFunctionName, enter the Step Functions name for automatic model building, endpoint creation, and deploying Amazon Comprehend custom classification models (can be left at the default value of ComprehendModelStepFunction).
  3. For TestThresholdParameterName, choose Accuracy, Precision, Recall, or F1score. (For this post, we leave it at the default value of F1 Score).

We use this metric to check if the newer model is better than the previous model.

  1. Choose Next.
  2. Choose Next again.
  3. In the Capabilities and transforms section, select all three check boxes to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.
  4. Choose Create stack.

This process might take 15 minutes or more to complete, and creates the following resources:

  • Systems Manager parameters to store intermediate parameters, like the current classifier endpoint ARN
  • An S3 bucket for the custom classification model training data
  • An S3 bucket for the custom classification model prediction job on the test data
  • Amazon DynamoDB tables for model predictions on the test data and status of custom classification prediction jobs
  • AWS Lambda functions:
    • StartCustomClassificationModelBuilding – Starts the custom classification model building
    • GetCustomClassificationModelStatus – Gets the status of the model building stage
    • StartCustomClassificationJob – Starts the prediction job using the test dataset
    • GetCustomClassificationJobStatus – Gets the status of the prediction job
    • CustomClassificationModelSelection – Starts the custom classification model selection stage
    • StartCustomClassificationEndpointBuilding – Starts the custom classification model endpoint building stage
    • GetCustomClassificationEndpointStatus – Gets the status of model endpoint building stage
    • DeleteCustomClassificationEndpoint – Deletes the old model endpoint
  • Step Functions to automate the workflow of building, testing, selecting, and deleting the models
  • IAM roles:
    • A Lambda execution role to run Amazon Comprehend custom classification jobs
    • A Lambda execution role to trigger Step Functions
    • A Step Functions role to trigger Lambda functions
    • An Amazon Comprehend data access role to give Amazon Comprehend access to the training data in the S3 bucket

Testing

For testing this blog, you can use your own training dataset or you can download the news dataset and upload it to Amazon S3. The news dataset comprises a collection of news articles and their corresponding category labels.

  1. Find the S3 bucket created by the CloudFormation to store the training data. This can be found by going to the Resources section of the CloudFormation and looking up ComprehendInputDataS3Bucket.
  2. On the Amazon S3 console, inside the input data bucket, create a folder named train.
  3. Upload the training data to the train folder.
  4. On the Step Functions console, choose the new state machine you created.
  5. In the Executions section, choose the latest run.

In the Executions section, choose the latest run.

The following screenshot shows the Graph inspector view. On the Details tab, you can check that Step Functions ran successfully.

The following screenshot shows the Graph inspector view.

It takes approximately 1 hour for the Step Functions state machine to complete.

  1. After Step Functions has successfully ran, on the Systems Manager console, in the navigation pane, under Application Management, choose Parameter Store.

You can check the updated classifier and endpoint from the Parameter Store.

You can check the updated classifier and endpoint from the Parameter Store.

Cleaning up

To avoid incurring any charges in the future, delete the CloudFormation stack. This removes all the resources you created as part of this post.

Conclusion

Active learning in custom classification models ensures that your models are kept up to date with new data and patterns. This two-part series provides you with a reference architecture to build an active learning workflow comprised of real-time classification APIs, feedback loops, a human review workflow, model building, model selection, and model deployment. For more information about the feedback loops and human review workflow, see the second part of this blog series, Active learning workflow for Amazon Comprehend custom classification models – Part 2.

For more information about custom classification in Amazon Comprehend, see Custom Classification. You can discover other Amazon Comprehend features and get inspiration from other AWS blog posts about how to use Amazon Comprehend beyond classification.


About the Authors

Shanthan Kesharaju is a Senior Architect in the AWS ProServe team. He helps our customers with AI/ML strategy, architecture, and developing products with a purpose. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.

 

 

Marty Jiang is a Conversational AI Consultant with AWS Professional Services. Outside of work, he loves spending time outdoors with his family and exploring new technologies.

Read More

Introducing a new API to stop in-progress workflows in Amazon Forecast

Amazon Forecast uses machine learning (ML) to generate more accurate demand forecasts, without requiring any prior ML experience. Forecast brings the same technology used at Amazon.com to developers as a fully managed service, removing the need to manage resources or rebuild your systems.

To start generating forecasts through Forecast, you can follow three steps of importing your data, training and evaluating a predictor, and then generating forecasts. Starting today, you can now stop an in-progress Forecast resource workflow if you have mistakenly started a job or misconfigured a workflow before starting, giving you more flexibility to manage your Forecast workflows and to experiment.

Previously, because you couldn’t stop APIs in progress, you had to wait for the job to complete and would incur charges for the job. You can now easily stop the following Forecast resource workflows:

In this post, we walk through the steps to stop workflows on the Forecast console. To review the steps through the APIs, refer to the following notebook in our GitHub repo.

Stop a resource job that is importing your datasets

You have two options to stop importing a dataset. One method is via the dataset details page on the Forecast console. On the Datasets page, choose your dataset and in the Dataset imports section, select the import job that you want to stop and choose Stop.

You can also stop a data import job from the job details page. In the Data imports section of your dataset, choose the import job to go to its details page. Then choose Stop.

Stop a resource job that is training a predictor

You have two options to stop a resource job that is training your predictor. One method is on the Predictors page for your dataset group, where you can select a predictor and choose Stop.

Alternatively, you can select the predictor and choose View details. Here you can stop the resource job that is training the predictor by choosing Stop.

Stop a resource job that is exporting backtest forecasts

Backtest forecasts are the forecasted values from the Forecast internal testing method of splitting the data into training and backtest data groups to compare forecasts versus observed data. When training a model, Forecast automatically splits the historical demand datasets into training and backtesting dataset groups. Forecast trains a model on the training dataset and forecasts at different specified stocking levels for the backtesting period, comparing to the observed values in the backtesting dataset group.

To stop a resource that is exporting these backtest results, select a predictor on the Predictors page of your dataset group. In the Predictor backtest exports section, select an export and choose Stop.

Stop a resource job that is generating forecasts

You have two options to stop a resource job that is generating your forecasts. One method is on the Forecasts page of the dataset group, where you can select a forecast and choose Stop.

Alternatively, you can select a forecast and choose View details. You can then stop the resource job that is generating the forecast by choosing Stop.

Stop a resource job that is exporting forecasts

Lastly, you can stop a resource job that is exporting your forecasts. You have two options to do so. One option is to select the forecast export job listed in the Forecast details section and choose Stop.

The second option is to choose the export job to view its details, and then choose Stop.

Important considerations

Stopping a resource halts the resource job workflow but doesn’t delete the resource. All your resources are still retained, and you can continue to call the Describe operation or access them as part of List APIs. After a resource is marked for stopping, it doesn’t count towards your Max Parallel In Progress limits. If you’re already at the limit, it allows you to submit a new job.

We allow only three resources of a given resource type at any time to be in the Stopping state, and you have to wait for one of the resources to go into the STOPPED state before you can stop more resources. After you initiate a stop, you can’t cancel it. You also can’t resume a stopped job. When you stop a predictor training or forecast generation job, you’re billed for the resources used up to the point when the job stopped.

Conclusion

You now have more flexibility in managing your Forecast workflows with the ability to stop in-progress resource workflows that may have been started unintentionally. To get started with this capability, see Stopping Resources and go through the notebook in our GitHub repo that walks you through how to use the Forecast Stop Resource APIs. You can use this capability in all Regions where Forecast is publicly available. For more information about Region availability, see AWS Regional Services.


About the Authors

Namita Das is a Sr. Product Manager for Amazon Forecast. Her current focus is to democratize machine learning by building no-code and low-code ML services. Outside of AWS, she frequently advises startups and is raising a puppy named Imli.

 

 

Gunjan Garg is a Sr. Software Development Engineer in the AWS Vertical AI team. In her current role at Amazon Forecast, she focuses on engineering problems and enjoys building scalable systems that provide the most value to end users. In her free time, she enjoys playing Sudoku and Minesweeper.

 

 

Punit Jain works as SDE on the Amazon Forecast team. His current work includes building large-scale distributed systems to solve complex machine learning problems with high availability and low latency as a major focus. In his spare time, he enjoys hiking and cycling.

 

 

Shannon Killingsworth is a UX Designer for Amazon Forecast and Amazon Personalize. His current work is creating console experiences that are usable by anyone, and integrating new features into the console experience. In his spare time, he is a fitness and automobile enthusiast.

 

 

 

 

Read More

Multimodal deep learning approach for event detection in sports using Amazon SageMaker

Have you ever thought about how artificial intelligence could be used to detect events during live sports broadcasts? With machine learning (ML) techniques, we introduce a scalable multimodal solution for event detection on sports video data. Recent developments in deep learning show that event detection algorithms are performing well on sports data [1]; however, they’re dependent upon the quality and amount of data used in model development. This post explains a deep learning-based approach developed by the Amazon Machine Learning Solutions Lab for sports event detection using Amazon SageMaker. This approach minimizes the impact of low-quality data in terms of labeling and image quality while improving the performance of event detection. Our solution uses a multimodal architecture utilizing video, static images, audio, and optical flow data to develop and fine-tune a model, followed by boosting and a postprocessing algorithm.

We used sports video data that included static 2D images and frames over time and audio data, which enabled us to train separate models in parallel. The outlined approach also enhances the performance of event detection by consolidating the models’ outcomes into one decision-maker using a boosting technique.

In this post, we first give an overview of the data. We then explain the preprocessing workflow, modeling strategy, postprocessing, and present the results.

Dataset

In this exploratory research study, we used the Sports-1 Million dataset [2], which includes 400 classes of short video clips of sports. The videos include the audio channel, enabling us to extract audio samples for multimodal model development. Among the sports in the dataset, we selected the most frequently occurring sports based on their number of data samples, resulting in 89 sports.

We then consolidated the sports in similar categories, resulting in 25 overall classes. The final list of selected sports for modeling is:

['americanfootball', 'athletics', 'badminton', 'baseball', 'basketball', 'bowling', 'boxing', 'cricket', 'cycling', 'fieldhockey', 'football', 'formula1', 'golf', 'gymnastics', 'handball', 'icehockey', 'lacrosse', 'rugby', 'skiing', 'soccer', 'swimming', 'tabletennis', 'tennis', 'volleyball', 'wrestling']

The following graph shows the number of video samples per sports category. Each video is cut into 1-second intervals.

The following graph shows the number of video samples per sports category. Each video is cut into 1-second intervals.

Data processing pipeline

The temporal modeling in this solution uses video clips with 1-second-long durations. Therefore, we first extracted 1-second length video clips from each data example. The average length of videos in the dataset is around 20 seconds, resulting in approximately 190,000 1-second video clips. We passed each second-level video clip through a frame extraction pipeline and, depending on the frames per second (fps) of the video clip, extracted the corresponding number of frames, and stored them in an Amazon Simple Storage Service (Amazon S3) bucket. The total number of frames extracted was around 3.8 million. We performed multi-processing on a SageMaker notebook using an Amazon Elastic Compute Cloud (Amazon EC2) ml.c5.large instance with 64 cores to parallelize the I/O heavy clip-extraction process. Parallelization reduced the clip extraction from hours to minutes.

To train the ML algorithms, we split the data using stratified sampling on the original clips, which prevented potential information leakage down the pipeline. In a classification setting, stratifying helps ensure that the training, validation, and test sets have approximately the same percentage of samples of each target class as the complete set. We split the data into 80/10/10 portions for training, validation, and test sets, respectively. We then reflected this splitting pattern on the 1-second video clips level and the corresponding extracted frames level.

Next, we fine-tuned the ResNet50 architecture using the extracted frames. Additionally, we trained a ResNet50 architecture using dense optical flow features extracted from the frames for each 1-second clip. Finally, we extracted audio features from 1-second clips and implemented an audio model. Each approach represented a modality in the final multimodal technique. The following diagram illustrates the architecture of the data processing and pipeline.

The following diagram illustrates the architecture of the data processing and pipeline.

The rest of this section details each modality.

Computer vision

We used two separate computer vision-based approaches to fit the data. First, we used the ResNet50 architecture to fine-tune the multi-class classification algorithm using RBG frames. Second, we used the ResNet50 architecture with the same fine-tuning strategy against optical flow frames. ResNet50 is one of the best classifiers for image data and has been remarkably successful in developing business applications.

We used a two-step fine-tuning approach: we first unfroze the last layer, added two flattened layers to the network, and fine-tuned the results for 10 epochs; we then saved the weights of this model, unfroze all the layers, and trained the entire network on the preceding sports data for 30 epochs. We used TensorFlow with Horovod for training on AWS Deep Learning AMI (DLAMI) instances. You can also use SageMaker Pipe mode to set up Horovod.

Horovod, an open-source framework for distributed deep learning, is available for use with most popular deep learning toolkits, like TensorFlow, Keras, PyTorch, and Apache MXNet. It uses the all-reduce algorithm for fast distributed training rather than using a parameter server approach, and it includes multiple optimization methods to make distributed training faster.

Since completing this project, SageMaker has introduced a new data parallelism library optimized for AWS, which allows you to use existing Horovod APIs. For more information, see New – Managed Data Parallelism in Amazon SageMaker Simplifies Training on Large Datasets.

Optical flow

For the second modality, we used an optical flow approach. The implementations of a classifier, such as ResNet50, on image data only addresses relationships of objects within the same frame, disregarding time information. A model trained this way assumes that frames are independent and unrelated.

To capture the relationships between consecutive frames, such as for recognizing human actions, we can use optical flow. Optical flow is the motion of objects between consecutive frames of sequence caused by the relative movement between the object and camera. We performed a dense optical flow algorithm on the images extracted from each 1-second video. We used OpenCV’s Gunner Farnebäck’s algorithm, which is explained in Farnebäck’s 2003 article “Two-Frame Motion Estimation Based on Polynomial Expansion” [3].

Audio event detection

ML-based audio modeling formed the third stream of our multimodal event detection solution, where audio samples were extracted from 1-second videos, resulting in audio segments in M4A format.

To explore the performance of audio models, two types of features broadly used in digital signal processing were extracted from the audio samples: Mel Spectrogram (MelSpec) and Mel-Frequency Cepstrum coefficient (MFCC). A modified version of MobileNet, a state-of-the-art architecture for audio data classification, was employed for the model development [4].

The audio processing pipeline consists of three steps, including MelSpec and MFCC features and MobileNetV2 model development:

  • First, MelSpec refers to the fast Fourier transformation of an audio segment known as spectrogram while considering Mel-Scale. Research has shown that human auditory systems are non-linearly distinguishable between certain frequencies so that the Mel-Scale equalizes the distance between frequency bands audible to a human. For our use case, MelSpec features with 128 points were calculated for model development.
  • Second, MFCC is a similar feature to MelSpec, where a linear cosine transformation is applied to the MelSpec feature as research has revealed that such a transformation can improve the performance of classification for audible sound. MFCC features with 33 points were extracted from the audio data; however, the performance of a model based on this feature was unable to compete with MelSpec, suggesting that MFCC often performs better with sequence models.
  • Finally, the audio model MobileNetV2 was adopted for our data and trained for 100 epochs with preloaded ImageNet weights. MobileNetV2 [5] is a convolutional neural network architecture that seeks to perform well on mobile devices. It’s based on an inverted residual structure, where the residual connections occur between the bottleneck layers.

Postprocessing

The objective of the postprocessing step employs a boosting algorithm to do the following:

  • Obtain video-level performance from the frame level
  • Incorporate three models of output into the decision-making process
  • Enhance the model performance in prediction using a defined class-model strategy obtained from validation sets and applied to test sets

First, the postprocessing module generated 1-second-level predicted classes and their probabilities for RGB, optical flow, and audio models. We then used a majority voting algorithm to assign the predicted class at the 1-second-level during inference.

Next, the 1-second-level computer vision and audio labels were converted to video-level performance. The results on the validations sets were compared to create a table of classes based on the model-class performance strategy for multimodal prediction against testing sets.

In the final stage, testing sets were passed through the prediction module, resulting in three labels and probabilities.

In this work, the RGB models resulted in the highest performance for all classes except badminton, where the audio model gave the best performance. The optical flow models didn’t compete with the other two models, although some research has shown that optical flow-based models could generate better results for certain datasets. The final prediction was performed by incorporating all three labels based on the predefined table to output the most probable classes.

The boosting algorithm of the prediction module is described as follows:

  1. Split videos into 1-second segments.
  2. Extract frames and audio signals.
  3. Prepare RGB frames and MelSpec features.
  4. Pass RGB frames through the trained ResNet50 by RGB samples and obtain prediction labels per frame.
  5. Pass MelSpec features through the trained MobileNet by audio samples and obtain prediction labels for each 1-second audio sample.
  6. Calculate 1-second-level RGB labels and probabilities.
  7. Use a predefined table (obtained from validation results).
  8. If the badminton class is found among two labels associated with a 1-second sample, vote for the audio model (get the label and probability from the audio model). Otherwise, vote for the RBG model (get the label and probability from the RGB model).

Results

The following graph shows the averaged frame-level F1 scores of the three models against two validation datasets; the error bars represent the standard deviations.

The following graph shows the averaged frame-level F1 scores of the three models against two validation datasets

Similarly, the following graph compares the F1 scores for three models per class measured for two testing datasets before postprocessing (average and standard deviation as error bars).

Similarly, the following graph compares the F1 scores for three models per class measured for two testing datasets before postprocessing

After applying the multimodal prediction module to the testing datasets to convert frame-level and 1-second-level predictions, the postprocessed video-level metrics were produced (see the following graph) and showed a significant improvement from the frame-level single modality to the video-level multimodal outputs.

After applying the multimodal prediction module to the testing datasets to convert frame-level and 1-second-level predictions, the postprocessed video-level metrics were produced (see the following graph) and showed a significant improvement from the frame-level single modality to the video-level multimodal outputs.

As previously mentioned, the class-model table was prepared using the comparison of three models for validation sets.

The analysis demonstrated that the multimodal approach could improve the performance of multi-class event detection by 5.10%, 55.68%, and 34.2% for single RGB, optical flow, and audio models, respectively. In addition, the confusion matrices for postprocessed testing datasets, shown in the following figures, indicated that the multimodal approach could predict most classes in a challenging 25-class event detection task.

The following figure shows the video-level confusion matrix of the first testing dataset after postprocessing.

The following figure shows the video-level confusion matrix of the first testing dataset after postprocessing.

The following figure shows the video-level confusion matrix of the second testing dataset after postprocessing.

The following figure shows the video-level confusion matrix of the second testing dataset after postprocessing.

The modeling workflow explained in this post assumes that the data examples in the dataset are all relevant, are all labeled correctly, and have similar distributions among each class. However, the authors’ manual observation of the data sometimes found substantial differences in video footage from one sample to another in the same class. Therefore, one of the areas of improvement that can have great impact on the performance of the model is to further prune the dataset to only include the relevant training examples and provide better labeling.

We used the multimodal model prediction against the testing dataset to generate the following demo for 25 sports, where the bars demonstrate the probability of each class per second (we called it 1-second-level prediction).

Conclusion

This post outlined a multimodal event detection approach using a combination of RGB, optical flow, and audio models through robust ResNet50 and MobileNet architectures implemented on SageMaker. The results of this study demonstrated that, by using a parallel model development, multimodal event detection improved the performance of a challenging 25-class event detection task in sports.

A dynamic postprocessing module enables you to update predictions after new training to enhance the model’s performance against new data.

About Amazon ML Solutions Lab

The Amazon ML Solutions Lab pairs your team with ML experts to help you identify and implement your organization’s highest value ML opportunities. If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.

Disclaimer

Editor’s note: The dataset used in this post is for non-commercial demonstration and exploratory research.

References

[1] Vats, Kanav, Mehrnaz Fani, Pascale Walters, David A. Clausi, and John Zelek. “Event Detection in Coarsely Annotated Sports Videos via Parallel Multi-Receptive Field 1D Convolutions.” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 882-883. 2020.

[2] Karpathy, Andrej, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. “Large-scale video classification with convolutional neural networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725-1732. 2014.

[3] Farnebäck, Gunnar. “Two-frame motion estimation based on polynomial expansion.” In Scandinavian Conference on Image Analysis, pp. 363-370. Springer, Berlin, Heidelberg, 2003.

[4] Adapa, Sainath. “Urban sound tagging using convolutional neural networks.” arXiv preprint arXiv:1909.12699 (2019).

[5] Sandler, Mark, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. “Mobilenetv2: Inverted residuals and linear bottlenecks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510-4520. 2018.


About the Authors

 

Saman Sarraf is a Data Scientist at the Amazon ML Solutions Lab. His background is in applied machine learning including deep learning, computer vision, and time series data prediction.

 

 

 

Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey and solve their ML problems using state-of-the-art solutions and technologies.

Read More