New EMNLP workshop will feature talks, papers, posters, and a competition built around the 50-plus-language, million-utterance MASSIVE dataset.Read More
Honorable mention to Amazon researchers for ICML test-of-time award
Amazon’s Bernhard Schölkopf and Dominik Janzing are first and second authors on “breakthrough 2012 paper”.Read More
Organize your machine learning journey with Amazon SageMaker Experiments and Amazon SageMaker Pipelines
The process of building a machine learning (ML) model is iterative until you find the candidate model that is performing well and is ready to be deployed. As data scientists iterate through that process, they need a reliable method to easily track experiments to understand how each model version was built and how it performed.
Amazon SageMaker allows teams to take advantage of a broad range of features to quickly prepare, build, train, deploy, and monitor ML models. Amazon SageMaker Pipelines provides a repeatable process for iterating through model build activities, and is integrated with Amazon SageMaker Experiments. By default, every SageMaker pipeline is associated with an experiment, and every run of that pipeline is tracked as a trial in that experiment. Then your iterations are automatically tracked without any additional steps.
In this post, we take a closer look at the motivation behind having an automated process to track experiments with Experiments and the native capabilities built into Pipelines.
Why is it important to keep your experiments organized?
Let’s take a step back for a moment and try to understand why it’s important to have experiments organized for machine learning. When data scientists approach a new ML problem, they have to answer many different questions, from data availability to how they will measure model performance.
At the start, the process is full of uncertainty and is highly iterative. As a result, this experimentation phase can produce multiple models, each created from their own inputs (datasets, training scripts, and hyperparameters) and producing their own outputs (model artifacts and evaluation metrics). The challenge then is to keep track of all these inputs and outputs of each iteration.
Data scientists typically train many different model versions until they find the combination of data transformation, algorithm, and hyperparameters that results in the best performing version of a model. Each of these unique combinations is a single experiment. With a traceable record of the inputs, algorithms, and hyperparameters that were used by that trial, the data science team can find it easy to reproduce their steps.
Having an automated process in place to track experiments improves the ability to reproduce as well as deploy specific model versions that are performing well. The Pipelines native integration with Experiments makes it easy to automatically track and manage experiments across pipeline runs.
Benefits of SageMaker Experiments
SageMaker Experiments allows data scientists organize, track, compare, and evaluate their training iterations.
Let’s start first with an overview of what you can do with Experiments:
- Organize experiments – Experiments structures experimentation with a top-level entity called an experiment that contains a set of trials. Each trial contains a set of steps called trial components. Each trial component is a combination of datasets, algorithms, and parameters. You can picture experiments as the top-level folder for organizing your hypotheses, your trials as the subfolders for each group test run, and your trial components as your files for each instance of a test run.
- Track experiments – Experiments allows data scientists to track experiments. It offers the possibility to automatically assign SageMaker jobs to a trial via simple configurations and via the tracking SDKs.
- Compare and evaluate experiments – The integration of Experiments with Amazon SageMaker Studio makes it easy to produce data visualizations and compare different trials. You can also access the trial data via the Python SDK to generate your own visualization using your preferred plotting libraries.
To learn more about Experiments APIs and SDKs, we recommend the following documentation: CreateExperiment and Amazon SageMaker Experiments Python SDK.
If you want to dive deeper, we recommend looking into the amazon-sagemaker-examples/sagemaker-experiments GitHub repository for further examples.
Integration between Pipelines and Experiments
The model building pipelines that are part of Pipelines are purpose-built for ML and allow you to orchestrate your model build tasks using a pipeline tool that includes native integrations with other SageMaker features as well as the flexibility to extend your pipeline with steps run outside SageMaker. Each step defines an action that the pipeline takes. The dependencies between steps are defined by a direct acyclic graph (DAG) built using the Pipelines Python SDK. You can build a SageMaker pipeline programmatically via the same SDK. After a pipeline is deployed, you can optionally visualize its workflow within Studio.
Pipelines automatically integrate with Experiments by automatically creating an experiment and trial for every run. Pipelines automatically create an experiment and a trial for every run of the pipeline before running the steps unless one or both of these inputs are specified. While running the pipeline’s SageMaker job, the pipeline associates the trial with the experiment, and associates to the trial every trial component that is created by the job. Specifying your own experiment or trial programmatically allows you to fine-tune how to organize your experiments.
The workflow we present in this example consists of a series of steps: a preprocessing step to split our input dataset into train, test, and validation datasets; a tuning step to tune our hyperparameters and kick off training jobs to train a model using the XGBoost built-in algorithm; and finally a model step to create a SageMaker model from the best trained model artifact. Pipelines also offers several natively supported step types outside of what is discussed in this post. We also illustrate how you can track your pipeline workflow and generate metrics and comparison charts. Furthermore, we show how to associate the new trial generated to an existing experiment that might have been created before the pipeline was defined.
SageMaker Pipelines code
You can review and download the notebook from the GitHub repository associated with this post. We look at the Pipelines-specific code to understand it better.
Pipelines enables you to pass parameters at run time. Here we define the processing and training instance types and counts at run time with preset defaults:
Next, we set up a processing script that downloads and splits the input dataset into train, test, and validation parts. We use SKLearnProcessor
for running this preprocessing step. To do so, we define a processor object with the instance type and count needed to run the processing job.
Pipelines allows us to achieve data versioning in a programmatic way by using execution-specific variables like ExecutionVariables.PIPELINE_EXECUTION_ID
, which is the unique ID of a pipeline run. We can, for example, create a unique key for storing the output datasets in Amazon Simple Storage Service (Amazon S3) that ties them to a specific pipeline run. For the full list of variables, refer to Execution Variables.
Then we move on to create an estimator object to train an XGBoost model. We set some static hyperparameters that are commonly used with XGBoost:
We do hyperparameter tuning of the models we create by using a ContinuousParameter
range for lambda
. Choosing one metric to be the objective metric tells the tuner (the instance that runs the hyperparameters tuning jobs) that you will evaluate the training job based on this specific metric. The tuner returns the best combination based on the best value for this objective metric, meaning the best combination that minimizes the best root mean square error (RMSE).
The tuning step runs multiple trials with the goal of determining the best model among the parameter ranges tested. With the method get_top_model_s3_uri
, we rank the top 50 performing versions of the model artifact S3 URI and only extract the best performing version (we specify k=0
for the best) to create a SageMaker model.
When the pipeline runs, it creates trial components for each hyperparameter tuning job and each SageMaker job created by the pipeline steps.
You can further configure the integration of pipelines with Experiments by creating a PipelineExperimentConfig
object and pass it to the pipeline object. The two parameters define the name of the experiment that will be created, and the trial that will refer to the whole run of the pipeline.
If you want to associate a pipeline run to an existing experiment, you can pass its name, and Pipelines will associate the new trial to it. You can prevent the creation of an experiment and trial for a pipeline run by setting pipeline_experiment_config
to None
.
We pass on the instance types and counts as parameters, and chain the preceding steps in order as follows. The pipeline workflow is implicitly defined by the outputs of a step being the inputs of another step.
The full-fledged pipeline is now created and ready to go. We add an execution role to the pipeline and start it. From here, we can go to the SageMaker Studio Pipelines console and visually track every step. You can also access the linked logs from the console to debug a pipeline.
The preceding screenshot shows in green a successfully run pipeline. We obtain the metrics of one trial from a run of the pipeline with the following code:
Compare the metrics for each trial component
You can plot the results of hyperparameter tuning in Studio or via other Python plotting libraries. We show both ways of doing this.
Explore the training and evaluation metrics in Studio
Studio provides an interactive user interface where you can generate interactive plots. The steps are as follows:
- Choose Experiments and Trials from the SageMaker resources icon on the left sidebar.
- Choose your experiment to open it.
- Choose (right-click) the trial of interest.
- Choose Open in trial component list.
- Press Shift to select the trial components representing the training jobs.
- Choose Add chart.
- Choose New chart and customize it to plot the collected metrics that you want to analyze. For our use case, choose the following:
- For Data type¸ select Summary Statistics.
- For Chart type¸ select Scatter Plot.
- For X-axis, choose
lambda
. - For Y-axis, choose
validation:rmse_last
.
The new chart appears at the bottom of the window, labeled as ‘8’.
You can include more or fewer training jobs by pressing Shift and choosing the eye icon for a more interactive experience.
Analytics with SageMaker Experiments
When the pipeline run is complete, we can quickly visualize how different variations of the model compare in terms of the metrics collected during training. Earlier, we exported all trial metrics to a Pandas DataFrame
using ExperimentAnalytics
. We can reproduce the plot obtained in Studio by using the Matplotlib library.
Conclusion
The native integration between SageMaker Pipelines and SageMaker Experiments allows data scientists to automatically organize, track, and visualize experiments during model development activities. You can create experiments to organize all your model development work, such as the following:
- A business use case you’re addressing, such as creating an experiment to predict customer churn
- An experiment owned by the data science team regarding marketing analytics, for example
- A specific data science and ML project
In this post, we dove into Pipelines to show how you can use it in tandem with Experiments to organize a fully automated end-to-end workflow.
As a next step, you can use these three SageMaker features – Studio, Experiments and Pipelines – for your next ML project.
Suggested readings
- Amazon SageMaker now supports cross-account lineage tracking and multi-hop lineage querying
- Announcing Amazon SageMaker Inference Recommender
- Introducing the Well-Architected Framework for Machine Learning
- Machine Learning Lens: AWS Well-Architected Framework
- Roundup of re:Invent 2021 Amazon SageMaker announcements
About the authors
Paolo Di Francesco is a solutions architect at AWS. He has experience in the telecommunications and software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.
Mario Bourgoin is a Senior Partner Solutions Architect for AWS, an AI/ML specialist, and the global tech lead for MLOps. He works with enterprise customers and partners deploying AI solutions in the cloud. He has more than 30 years of experience doing machine learning and AI at startups and in enterprises, starting with creating one of the first commercial machine learning systems for big data. Mario spends his free time playing with his three Belgian Tervurens, cooking dinner for his family, and learning about mathematics and cosmology.
Ganapathi Krishnamoorthi is a Senior ML Solutions Architect at AWS. Ganapathi provides prescriptive guidance to startup and enterprise customers helping them to design and deploy cloud applications at scale. He is specialized in machine learning and is focused on helping customers leverage AI/ML for their business outcomes. When not at work, he enjoys exploring outdoors and listening to music.
Valerie Sounthakith is a Solutions Architect for AWS, working in the Gaming Industry and with Partners deploying AI solutions. She is aiming to build her career around Computer Vision. During her free time, Valerie spends it to travel, discover new food spots and change her house interiors.
ICML: Where causality meets machine learning
Amazon’s Dominik Janzing on the history and promise of the young field of causal machine learning.Read More
Build taxonomy-based contextual targeting using AWS Media Intelligence and Hugging Face BERT
As new data privacy regulations like GDPR (General Data Protection Regulation, 2017) have come into effect, customers are under increased pressure to monetize media assets while abiding by the new rules. Monetizing media while respecting privacy regulations requires the ability to automatically extract granular metadata from assets like text, images, video, and audio files at internet scale. It also requires a scalable way to map media assets to industry taxonomies that facilitate discovery and monetization of content. This use case is particularly significant for the advertising industry as data privacy rules cause a shift from behavioral targeting using third-party cookies.
Third-party cookies help enable personalized ads for web users, and allow advertisers to reach their intended audience. A traditional solution to serve ads without third-party cookies is contextual advertising, which places ads on webpages based on the content published on the pages. However, contextual advertising poses the challenge of extracting context from media assets at scale, and likewise using that context to monetize the assets.
In this post, we discuss how you can build a machine learning (ML) solution that we call Contextual Intelligence Taxonomy Mapper (CITM) to extract context from digital content and map it to standard taxonomies in order to generate value. Although we apply this solution to contextual advertising, you can use it to solve other use cases. For example, education technology companies can use it to map their content to industry taxonomies in order to facilitate adaptive learning that delivers personalized learning experiences based on students’ individual needs.
Solution overview
The solution comprises two components: AWS Media Intelligence (AWS MI) capabilities for context extraction from content on web pages, and CITM for intelligent mapping of content to an industry taxonomy. You can access the solution’s code repository for a detailed view of how we implement its components.
AWS Media Intelligence
AWS MI capabilities enable automatic extraction of metadata that provides contextual understanding of a webpage’s content. You can combine ML techniques like computer vision, speech to text, and natural language processing (NLP) to automatically generate metadata from text, videos, images, and audio files for use in downstream processing. Managed AI services such as Amazon Rekognition, Amazon Transcribe, Amazon Comprehend, and Amazon Textract make these ML techniques accessible using API calls. This eliminates the overhead needed to train and build ML models from scratch. In this post, you see how using Amazon Comprehend and Amazon Rekognition for media intelligence enables metadata extraction at scale.
Contextual Intelligence Taxonomy Mapper
After you extract metadata from media content, you need a way to map that metadata to an industry taxonomy in order to facilitate contextual targeting. To do this, you build Contextual Intelligence Taxonomy Mapper (CITM), which is powered by a BERT sentence transformer from Hugging Face.
The BERT sentence transformer enables CITM to categorize web content with contextually related keywords. For example, it can categorize a web article about healthy living with keywords from the industry taxonomy, such as “Healthy Cooking and Eating,” “Running and Jogging,” and more, based on the text written and the images used within the article. CITM also provides the ability to choose the mapped taxonomy terms to use for your ad bidding process based on your criteria.
The following diagram illustrates the conceptual view of the architecture with CITM.
The IAB (Interactive Advertising Bureau) Content Taxonomy
For this post, we use the IAB Tech Lab’s Content Taxonomy as the industry standard taxonomy for the contextual advertising use case. By design, the IAB taxonomy helps content creators more accurately describe their content, and it provides a common language for all parties in the programmatic advertising process. The use of a common terminology is crucial because the selection of ads for a webpage a user visits has to happen within milliseconds. The IAB taxonomy serves as a standardized way to categorize content from various sources while also being an industry protocol that real-time bidding platforms use for ad selection. It has a hierarchical structure, which provides granularity of taxonomy terms and enhanced context for advertisers.
Solution workflow
The following diagram illustrates the solution workflow.
The steps are as follows:
- Amazon Simple Storage Service (Amazon S3) stores the IAB content taxonomy and extracted web content.
- Amazon Comprehend performs topic modeling to extract common themes from the collection of articles.
- The Amazon Rekognition object label API detects labels in images.
- CITM maps content to a standard taxonomy.
- Optionally, you can store content to taxonomy mapping in a metadata store.
In the following sections, we walk through each step in detail.
Amazon S3 stores the IAB content taxonomy and extracted web content
We store extracted text and images from a collection of web articles in an S3 bucket. We also store the IAB content taxonomy. As a first step, we concatenate different tiers on the taxonomy to create combined taxonomy terms. This approach helps maintain the taxonomy’s hierarchical structure when the BERT sentence transformer creates embeddings for each keyword. See the following code:
The following diagram illustrates the IAB context taxonomy with combined tiers.
Amazon Comprehend performs topic modeling to extract common themes from the collection of articles
With the Amazon Comprehend topic modeling API, you analyze all the article texts using the Latent Dirichlet Allocation (LDA) model. The model examines each article in the corpus and groups keywords into the same topic based on the context and frequency in which they appear across the entire collection of articles. To ensure the LDA model detects highly coherent topics, you perform a preprocessing step prior to calling the Amazon Comprehend API. You can use the gensim library’s CoherenceModel to determine the optimal number of topics to detect from the collection of articles or text files. See the following code:
After you get the optimal number of topics, you use that value for the Amazon Comprehend topic modeling job. Providing different values for the NumberOfTopics parameter in the Amazon Comprehend StartTopicsDetectionJob operation results in a variation in the distribution of keywords placed in each topic group. An optimized value for the NumberOfTopics parameter represents the number of topics that provide the most coherent grouping of keywords with higher contextual relevance. You can store the topic modeling output from Amazon Comprehend in its raw format in Amazon S3.
The Amazon Rekognition object label API detects labels in images
You analyze each image extracted from all webpages using the Amazon Rekognition DetectLabels operation. For each image, the operation provides a JSON response with all labels detected within the image, coupled with a confidence score for each. For our use case, we arbitrarily select a confidence score of 60% or higher as the threshold for object labels to use in the next step. You store object labels in their raw format in Amazon S3. See the following code:
CITM maps content to a standard taxonomy
CITM compares extracted content metadata (topics from text and labels from images) with keywords on the IAB taxonomy, and then maps the content metadata to keywords from the taxonomy that are semantically related. For this task, CITM completes the following three steps:
- Generate neural embeddings for the content taxonomy, topic keywords, and image labels using Hugging Face’s BERT sentence transformer. We access the sentence transformer model from Amazon SageMaker. In this post, we use the paraphrase-MiniLM-L6-v2 model, which maps keywords and labels to a 384 dimensional dense vector space.
- Compute the cosine similarity score between taxonomy keywords and topic keywords using their embeddings. It also computes cosine similarity between the taxonomy keywords and the image object labels. We use cosine similarity as a scoring mechanism to find semantically similar matches between the content metadata and the taxonomy. See the following code:
- Identify pairings with similarity scores that are above a user-defined threshold and use them to map the content to semantically related keywords on the content taxonomy. In our test, we select all keywords from pairings that have a cosine similarity score of 0.5 or higher. See the following code:
A common challenge when working with internet-scale language representation (such as in this use case) is that you need a model that can fit most of the content—in this case, words in the English language. Hugging Face’s BERT transformer has been pre-trained using a large corpus of Wikipedia posts in the English language to represent the semantic meaning of words in relation to one another. You fine-tune the pre-trained model using your specific dataset of topic keywords, image labels, and taxonomy keywords. When you place all embeddings in the same feature space and visualize them, you see that BERT logically represents semantic similarity between terms.
The following example visualizes IAB content taxonomy keywords for the class Automotive represented as vectors using BERT. BERT places Automotive keywords from the taxonomy close to semantically similar terms.
The feature vectors allow CITM to compare the metadata labels and taxonomy keywords in the same feature space. In this feature space, CITM calculates cosine similarity between each feature vector for taxonomy keywords and each feature vector for topic keywords. In a separate step, CITM compares taxonomy feature vectors and feature vectors for image labels. Pairings with cosine scores closest to 1 are identified as semantically similar. Note that a pairing can either be a topic keyword and a taxonomy keyword, or an object label and a taxonomy keyword.
The following screenshot shows example pairings of topic keywords and taxonomy keywords using cosine similarity calculated with BERT embeddings.
To map content to taxonomy keywords, CITM selects keywords from pairings with cosine scores that meet a user-defined threshold. These are the keywords that will be used on real-time bidding platforms to select ads for the webpage’s inventory. The result is a rich mapping of online content to the taxonomy.
Optionally store content to taxonomy mapping in a metadata store
After you identify contextually similar taxonomy terms from CITM, you need a way for low-latency APIs to access this information. In programmatic bidding for advertisements, low response time and high concurrency play an important role in monetizing the content. The schema for the data store needs to be flexible to accommodate additional metadata when needed to enrich bid requests. Amazon DynamoDB can match the data access patterns and operational requirements for such a service.
Conclusion
In this post, you learned how to build a taxonomy-based contextual targeting solution using Contextual Intelligence Taxonomy Mapper (CITM). You learned how to use Amazon Comprehend and Amazon Rekognition to extract granular metadata from your media assets. Then, using CITM you mapped the assets to an industry standard taxonomy to facilitate programmatic ad bidding for contextually related ads. You can apply this framework to other use cases that require use of a standard taxonomy to enhance the value of existing media assets.
To experiment with CITM, you can access its code repository and use it with a text and image dataset of your choice.
We recommend learning more about the solution components introduced in this post. Discover more about AWS Media Intelligence to extract metadata from media content. Also, learn more about how to use Hugging Face models for NLP using Amazon SageMaker.
About the Authors
Aramide Kehinde is a Sr. Partner Solution Architect at AWS in Machine Learning and AI. Her career journey has spanned the areas of Business Intelligence and Advanced Analytics across multiple industries. She works to enable partners to build solutions with AWS AI/ML services that serve customers needs for innovation. She also enjoys building the intersection of AI and creative arenas and spending time with her family.
Anuj Gupta is a Principal Solutions Architect working with hyper-growth companies on their cloud native journey. He is passionate about using technology to solve challenging problems and has worked with customers to build highly distributed and low latency applications. He contributes to open-source Serverless and Machine Learning solutions. Outside of work, he loves traveling with his family and writing poems and philosophical blogs.
“Among all sources of information, visual information may be the most interesting”
Violetta Shevchenko, an Amazon applied scientist and former intern, combines vision and language to create solutions to challenging problems.Read More
Localize content into multiple languages using AWS machine learning services
Over the last few years, online education platforms have seen an increase in adoption of and an uptick in demand for video-based learnings because it offers an effective medium to engage learners. To expand to international markets and address a culturally and linguistically diverse population, businesses are also looking at diversifying their learning offerings by localizing content into multiple languages. These businesses are looking for reliable and cost-effective ways to solve their localization use cases.
Localizing content mainly includes translating original voices into new languages and adding visual aids such as subtitles. Traditionally, this process is cost-prohibitive, manual, and takes a lot of time, including working with localization specialists. With the power of AWS machine learning (ML) services such as Amazon Transcribe, Amazon Translate, and Amazon Polly, you can create a viable and a cost-effective localization solution. You can use Amazon Transcribe to create a transcript of your existing audio and video streams, and then translate this transcript into multiple languages using Amazon Translate. You can then use Amazon Polly, a text-to speech service, to convert the translated text into natural-sounding human speech.
The next step of localization is to add subtitles to the content, which can improve accessibility and comprehension, and help viewers understand the videos better. Subtitle creation on video content can be challenging because the translated speech doesn’t match the original speech timing. This synchronization between audio and subtitles is a critical task to consider because it might disconnect the audience from your content if they’re not in sync. Amazon Polly offers a solution to this challenge through enabling speech marks, which you can use to create a subtitle file that can be synced with the generated speech output.
In this post, we review a localization solution using AWS ML services where we use an original English video and convert it into Spanish. We also focus on using speech marks to create a synced subtitle file in Spanish.
Solution overview
The following diagram illustrates the solution architecture.
The solution takes a video file and the target language settings as input and uses Amazon Transcribe to create a transcription of the video. We then use Amazon Translate to translate the transcript to the target language. The translated text is provided as an input to Amazon Polly to generate the audio stream and speech marks in the target language. Amazon Polly returns speech mark output in a line-delimited JSON stream, which contains the fields such as time, type, start, end, and value. The value could vary depending on the type of speech mark requested in the input, such as SSML, viseme, word, or sentence. For the purpose of our example, we requested the speech mark type as word
. With this option, Amazon Polly breaks a sentence into its individual words in the sentence and their start and end times in the audio stream. With this metadata, the speech marks are then processed to generate the subtitles for the corresponding audio stream generated by Amazon Polly.
Finally, we use AWS Elemental MediaConvert to render the final video with the translated audio and corresponding subtitles.
The following video demonstrates the final outcome of the solution:
AWS Step Functions workflow
We use AWS Step Functions to orchestrate this process. The following figure shows a high-level view of the Step Functions workflow (some steps are omitted from the diagram for better clarity).
The workflow steps are as follows:
- A user uploads the source video file to an Amazon Simple Storage Service (Amazon S3) bucket.
- The S3 event notification triggers the AWS Lambda function state_machine.py (not shown in the diagram), which invokes the Step Functions state machine.
- The first step, Transcribe audio, invokes the Lambda function transcribe.py, which uses Amazon Transcribe to generate a transcript of the audio from the source video.
The following sample code demonstrates how to create a transcription job using the Amazon Transcribe Boto3 Python SDK:
After the job is complete, the output files are saved into the S3 bucket and the process continues to the next step of translating the content.
- The Translate transcription step invokes the Lambda function translate.py which uses Amazon Translate to translate the transcript to the target language. Here, we use the synchronous/real-time translation using the translate_text function:
Synchronous translation has limits on the document size it can translate; as of this writing, it’s set to 5,000 bytes. For larger document sizes, consider using an asynchronous route of creating the job using start_text_translation_job and checking the status via describe_text_translation_job.
- The next step is a Step Functions Parallel state, where we create parallel branches in our state machine.
- In the first branch, we invoke the Lambda function the Lambda function generate_polly_audio.py to generate our Amazon Polly audio stream:
Here we use the start_speech_synthesis_task method of the Amazon Polly Python SDK to trigger the speech synthesis task that creates the Amazon Polly audio. We set the
OutputFormat
tomp3
, which tells Amazon Polly to generate an audio stream for this API call. - In the second branch, we invoke the Lambda function generate_speech_marks.py to generate speech marks output:
- In the first branch, we invoke the Lambda function the Lambda function generate_polly_audio.py to generate our Amazon Polly audio stream:
- We again use the start_speech_synthesis_task method but specify
OutputFormat
tojson
, which tells Amazon Polly to generate speech marks for this API call.
In the next step of the second branch, we invoke the Lambda function generate_subtitles.py, which implements the logic to generate a subtitle file from the speech marks output.
It uses the Python module in the file webvtt_utils.py. This module has multiple utility functions to create the subtitle file; one such method get_phrases_from_speechmarks
is responsible for parsing the speech marks file. The speech marks JSON structure provides just the start time for each word individually. To create the subtitle timing required for the SRT file, we first create phrases of about n (where n=10) words from the list of words in the speech marks file. Then we write them into the SRT file format, taking the start time from the first word in the phrase, and for the end time we use the start time of the (n+1) word and subtract it by 1 to create the sequenced entry. The following function creates the phrases in preparation for writing them to the SRT file:
- The final step, Media Convert, invokes the Lambda function create_mediaconvert_job.py to combine the audio stream from Amazon Polly and the subtitle file with the source video file to generate the final output file, which is then stored in an S3 bucket. This step uses
MediaConvert
, a file-based video transcoding service with broadcast-grade features. It allows you to easily create video-on-demand content and combines advanced video and audio capabilities with a simple web interface. Here again we use the Python Boto3 SDK to create aMediaConvert
job:
Prerequisites
Before getting started, you must have the following prerequisites:
- An AWS account
- AWS Cloud Development Kit (AWS CDK)
Deploy the solution
To deploy the solution using the AWS CDK, complete the following steps:
- Clone the repository:
- To make sure the AWS CDK is bootstrapped, run the command
cdk bootstrap
from the root of the repository: - Change the working directory to the root of the repository and run the following command:
By default, the target audio settings are set to US Spanish (es-US
). If you plan to test it with a different target language, use the following command:
The process takes a few minutes to complete, after which it displays a link that you can use to view the target video file with the translated audio and translated subtitles.
Test the solution
To test this solution, we used a small portion of the following AWS re:Invent 2017 video from YouTube, where Amazon Transcribe was first introduced. You can also test the solution with your own video. The original language of our test video is English. When you deploy this solution, you can specify the target audio settings or you can use the default target audio settings, which uses Spanish for generating audio and subtitles. The solution creates an S3 bucket that can be used to upload the video file to.
- On the Amazon S3 console, navigate to the bucket
PollyBlogBucket
.
- Choose the bucket, navigate to the
/inputVideo
directory, and upload the video file (the solution is tested with videos of type mp4). At this point, an S3 event notification triggers the Lambda function, which starts the state machine. - On the Step Functions console, browse to the state machine (
ProcessAudioWithSubtitles
). - Choose one of the runs of the state machine to locate the Graph Inspector.
This shows the run results for each state. The Step Functions workflow takes a few minutes to complete, after which you can verify if all the steps successfully completed.
Review the output
To review the output, open the Amazon S3 console and check if the audio file (.mp3) and the speech mark file (.marks) are stored in the S3 bucket under <ROOT_S3_BUCKET>/<UID>/synthesisOutput/
.
The following is a sample of the speech mark file generated from the translated text:
In this output, each part of the text is broken out in terms of speech marks:
- time – The timestamp in milliseconds from the beginning of the corresponding audio stream
- type – The type of speech mark (sentence, word, viseme, or SSML)
- start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
- end – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)
- value – Individual words in the sentence
The generated subtitle file is written back to the S3 bucket. You can find the file under <ROOT_S3_BUCKET>/<UID>/subtitlesOutput/
. Inspect the subtitle file; the content should be similar to the following text:
After the subtitles file and audio file are generated, the final source video file is created using MediaConvert. Check the MediaConvert console to verify if the job status is COMPLETE
.
When the MediaConvert job is complete, the final video file is generated and saved back to the S3 bucket, which can be found under <ROOT_S3_BUCKET>/<UID>/convertedAV/
.
As part of this deployment, the final video is distributed through an Amazon CloudFront (CDN) link and displayed in the terminal or in the AWS CloudFormation console.
Open the URL in a browser to view the original video with additional options for audio and subtitles. You can verify that the translated audio and subtitles are in sync.
Conclusion
In this post, we discussed how to create new language versions of video files without the need of manual intervention. Content creators can use this process to synchronize the audio and subtitles of their videos and reach a global audience.
You can easily integrate this approach into your own production pipelines to handle large volumes and scale according to your needs. Amazon Polly uses Neural TTS (NTTS) to produce natural and human-like text-to-speech voices. It also supports generating speech from SSML, which gives you additional control over how Amazon Polly generates speech from the text provided. Amazon Polly also provides a variety of different voices in multiple languages to support your needs.
Get started with AWS machine learning services by visiting the product page, or refer the Amazon Machine Learning Solutions Lab page where you can collaborate with experts to bring machine learning solutions to your organization.
Additional resources
For more information about the services used in this solution, refer to the following:
- Amazon Transcribe Developer Guide
- Amazon Translate Developer Guide
- AWS Polly Developer Guide
- AWS Step Functions Developer Guide
- AWS Elemental MediaConvert User Guide
- Languages Supported by Amazon Polly
- Languages Supported by Amazon Transcribe
About the authors
Reagan Rosario works as a solutions architect at AWS focusing on education technology companies. He loves helping customers build scalable, highly available, and secure solutions in the AWS Cloud. He has more than a decade of experience working in a variety of technology roles, with a focus on software engineering and architecture.
Anil Kodali is a Solutions Architect with Amazon Web Services. He works with AWS EdTech customers, guiding them with architectural best practices for migrating existing workloads to the cloud and designing new workloads with a cloud-first approach. Prior to joining AWS, he worked with large retailers to help them with their cloud migrations.
Prasanna Saraswathi Krishnan is a Solutions Architect with Amazon Web Services working with EdTech customers. He helps them drive their cloud architecture and data strategy using best practices. His background is in distributed computing, big data analytics, and data engineering. He is passionate about machine learning and natural language processing.
Identify rooftop solar panels from satellite imagery using Amazon Rekognition Custom Labels
Renewable resources like sunlight provide a sustainable and carbon neutral mechanism to generate power. Governments in many countries are providing incentives and subsidies to households to install solar panels as part of small-scale renewable energy schemes. This has created a huge demand for solar panels. Reaching out to potential customers at the right time, through the right channel, and with attractive offers is very crucial for solar and energy companies. They’re looking for cost-efficient approaches and tools to conduct targeted marketing to proactively reach out to potential customers. By identifying the suburbs that have low coverage of solar panel installation at scale, they can maximize their marketing initiatives to those places, so as to maximize the return on their marketing investment.
In this post, we discuss how you can identify solar panels on rooftops from satellite imagery using Amazon Rekognition Custom Labels.
The problem
High-resolution satellite imagery of urban areas provides an aerial view of rooftops. You can use these images to identify solar panel installations. But it is a challenging task to automatically identify solar panels with high accuracy, low cost, and in a scalable way.
With rapid development in computer vision technology, several third-party tools use computer vision to analyze satellite images and identify objects (like solar panels) automatically. However, these tools are expensive and increase the overall cost of marketing. Many organizations have also successfully implemented state-of-the-art computer vision applications to identify the presence of solar panels on the rooftops from the satellite images.
But the reality is that you need to build your own data science teams that have the specific expertise and experience to build a production machine learning (ML) application for your specific use case. It generally takes months for teams to build a computer vision solution that they can use in production. This leads to an increased cost in building and maintaining such a system.
Is there a simpler and cost-effective solution that helps solar companies quickly build effective computer vision models without building a dedicated data science team for that purpose? Yes, Rekognition Custom Labels is the answer to this question.
Solution overview
Rekognition Custom Labels is a feature of Amazon Rekognition that takes care of the heavy lifting of computer vision model development for you, so no computer vision experience is required. You simply provide images with the appropriate labels, train the model, and deploy without having to build the model and fine-tune it. Rekognition Custom Labels has the capability to build highly accurate models with fewer labeled images. This takes away the heavy lifting of model development and helps you focus on developing value-added products and applications to your customers.
In this post, we show how to label, train, and build a computer vision model to detect rooftops and solar panels from satellite images. We use Amazon Simple Storage Service (Amazon S3) for storing satellite images, Amazon SageMaker Ground Truth for labeling the images with the appropriate labels of interest, and Rekognition Custom Labels for model training and hosting. To test the model output, we use a Jupyter notebook to run Python code to detect custom labels in a supplied image by calling Amazon Rekognition APIs.
The following diagram illustrates the architecture using AWS services to label the images, and train and host the ML model.
The solution workflow is as follows:
- Store satellite imagery data in Amazon S3 as the input source.
- Use a Ground Truth labeling job to label the images.
- Use Amazon Rekognition to train the model with custom labels.
- Fine-tune the trained model.
- Start the model and analyze the image with the trained model using the Rekognition Custom Labels API.
Store satellite imagery data in Amazon S3 as an input source
The satellite images of rooftops with and without solar panels are captured from the satellite imagery data providers and stored in an S3 bucket. For our post, we use the images of New South Wales (NSW), Australia, provided by the Spatial Services, Department of Customer Service NSW. We have taken the screenshots of the rooftops from this portal and stored those images in the source S3 bucket. These images are labeled using a Ground Truth labeling job, as explained in the next step.
Use a Ground Truth labeling job to label the images
Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML tasks. It has three options:
- Amazon Mechanical Turk, which uses a public workforce to label the data
- Private, which allows you to create a private workforce from internal teams
- Vendor, which uses third-party resources for the labeling task
In this example, we use a private workforce to perform the data labeling job. Refer to Use Amazon SageMaker Ground Truth to Label Data for instructions on creating a private workforce and configuring Ground Truth for a labeling job with bounding boxes.
The following is an example image of the labeling job. The labeler can draw bounding boxes of the targets with the selected labels indicated by different colors. We used three labels on the images: rooftop, rooftop-panel, and panel to signify rooftops without solar panels, rooftops with solar panels, and just solar panels, respectively.
When the labeling job is complete, an output.manifest file is generated and stored in the S3 output location that you specified when creating the labeling job. The following code is an example of one image labeling output in the manifest file:
The output manifest file is what we need for the Amazon Rekognition training job. In the next section, we provide step-by-step instructions to create a high-performance ML model to detect objects of interest.
Use Amazon Rekognition to train the model with custom labels
We now create a project for a custom object detection model, and provide the labeled images to Rekognition Custom Labels to train the model.
- On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane.
- In the navigation pane, choose Projects.
- Choose Create project.
- For Project name, enter a unique name.
- Choose Create project.
Next, we create a dataset for the training job.
- In the navigation pane, choose Datasets.
- Create a dataset based on the manifest file generated by the Ground Truth labeling job.
We’re now ready to train a new model.
- Select the project you created and choose Train new model.
- Choose the training dataset you created and choose the test dataset.
- Choose Train to start training the model.
You can create a new test dataset or split the training dataset to use 20% of the training data as the test dataset and the remaining 80% as the training dataset. However, if you split the training dataset, the training and test datasets are randomly selected from the whole dataset every time you train a new model. In this example, we create a separate test dataset to evaluate the trained models.
We collected an additional 50 satellite images and labeled them using Ground Truth. We then used the output manifest file of this labeling job to create the test dataset.
This allows us to compare the evaluation metrics of different versions of the model that are trained based on different input datasets. The first training dataset consists of 160 images; the test dataset has 50 images.
When the model training process is complete, we can access the evaluation metrics on the Evaluate tab on the model page. Our training job was able to achieve an F1 score of 0.934. The model evaluation metrics are reasonably good considering the number of training images we used and the number of images used for validating the model.
Although the model evaluation metrics are reasonable, it’s important to understand which images the model incorrectly labels, so as to further fine-tune the model’s performance to make it more robust to handle real-world challenges. In the next section, we describe the process of evaluating the images that have inaccurate labels and retraining the model to achieve better performance.
Fine-tune the trained model
Evaluating incorrect labels inferred by the trained model is a crucial step to further fine-tune the model. To check the detailed test results, we can choose the training job and choose View test results to evaluate the images that the model inaccurately labeled. Evaluating the model performance on test data can help you identify any potential labeling or data source-related issues. For example, the following test image shows an example of false-positive labeling of a rooftop.
As you can determine from the preceding image, the identified rooftop is correct—it’s the rooftop of a smaller home built on the property. Based on the source image name, we can go back to the dataset to check the labels. We can do this via the Amazon Rekognition console. In the following screenshot, we can determine that the source image wasn’t labeled correctly. The labeler missed labeling that rooftop.
To correct the labeling issue, we don’t need to rerun the Ground Truth job or run an adjustment job on the whole dataset. We can easily verify or adjust individual images via the Rekognition Custom Labels console.
- On the dataset page, choose Start labeling.
- Select the image file that needs adjustment and choose Draw bounding box.
On the labeling page, we can draw or update the bounding boxes on this image.
- Draw the bounding box around the smaller building and label it as rooftop.
- Choose Done to save the changes or choose Next or Previous to navigate through additional images that require adjustments.
In some situations, you might have to provide more images with examples of the rooftops that the model failed to identify correctly. We can collect more images, label them, and retrain the model so that the model can learn the special cases of rooftops and solar panels.
To add more training images to the training dataset, you can create another Ground Truth job if the number of added images is large and if you need to create a labeling workforce team to label the images. When the labeling job is finished, we get a new manifest file, which contains the bounding box information for the newly added images. Then we need to manually merge the manifest file of the newly added images to the existing manifest file. We use the combined manifest file to create a new training dataset on the Rekognition Custom Labels console and train a more robust model.
Another option is to add the images directly to the current training dataset if the number of new images isn’t large and one person is sufficient to finish the labeling task on the Amazon Rekognition console. In this project, we directly add another 30 images to the original training dataset and perform labeling on the console.
After we complete the label verification and add more images of different rooftop and panel types, we have a second model trained with 190 training images that we evaluate on the same test dataset. The second version of the trained model achieved an F1 score of 0.964, which is an improvement from the earlier score of 0.934. Based on your business requirement, you can further fine-tune the model.
Deploy the model and analyze the images using the Rekognition Custom Labels API
Now that we have trained a model with satisfactory evaluation results, we deploy the model to an endpoint for real-time inference. We analyze a few images using Python code via the Amazon Rekognition API on this deployed model. After you start the model, its status shows as Running.
Now the model is ready to detect the labels on the new satellite images. We can test the model by running the provided sample Python API code. On the model page, choose API Code.
Select Python to review the sample code to start the model, analyze images, and stop the model.
Copy the Python code in the Analyze image section into a Jupyter notebook, which can be running on your laptop.
To set up the environment to run the code, we need to install the AWS SDKs that we want to use and configure the security credentials to access the AWS resources. For instructions, refer to Set Up the AWS CLI and AWS SDKs.
Upload a test image to an S3 bucket. In the Analyze image Python code, substitute the variable MY_BUCKET with the bucket name that has the test image and replace MY_IMAGE_KEY with the file name of the test image.
The following screenshot shows a sample response of running the Python code.
The following output image shows that the model has successfully detected three labels: rooftop, rooftop-panel, and panel.
Clean up
After testing, we can stop the model to avoid any unnecessary charges incurred to run the model.
Conclusion
In this post, we showed you how to detect rooftops and solar panels from the satellite imagery by building custom computer vision models with Rekognition Custom Labels. We demonstrated how Rekognition Custom Labels manages the model training by taking care of the deep learning complexities behind the scenes. We also demonstrated how to use Ground Truth to label the training images at scale. Furthermore, we discussed mechanisms to improve model accuracy by correcting the labeling of the images on the fly and retraining the model with the dataset. Power utility companies can use this solution to detect houses without solar panels to send offers and promotions to achieve efficient targeted marketing.
To learn more about how Rekognition Custom Labels can help your business, visit Amazon Rekognition Custom Labels or contact AWS Sales.
About the Authors
Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.
Santosh Kulkarni is a Solutions Architect at Amazon Web Services. He works closely with enterprise customers to accelerate their Cloud journey. He is also passionate about building large-scale distributed applications to solve business problems using his knowledge in Machine Learning, Big Data, and Software Development.
Dr. Baichuan Sun is a Senior Data Scientist at AWS AI/ML. He is passionate about solving strategic business problems with customers using data-driven methodology on the cloud, and he has been leading projects in challenging areas including robotics computer vision, time series forecasting, price optimization, predictive maintenance, pharmaceutical development, product recommendation system, etc. In his spare time he enjoys traveling and hanging out with family.
New method identifies the root causes of statistical outliers
Amazon ICML paper proposes information-theoretic measurement of quantitative causal contribution.Read More
Generate synchronized closed captions and audio using the Amazon Polly subtitle generator
Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.
As our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.
Although subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:
- Subtitles – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.
- Captions (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.
Benefits of using Amazon Polly to generate audio with subtitles or closed captions
Imagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.
There are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.
Irrespective of the field of application, closed captioning can help with the following:
- Accessibility – People with hearing impairments can better consume your content.
- Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.
- Reachability – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.
- Searchability – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.
- Social courtesy – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.
- Comprehension – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.
Solution overview
The library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.
In this post, we focus on the PollyVTT()
syntax and options, and offer a few examples that demonstrate how to use the Python SubtitleGeneratorForPolly
to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, SubtitleGeneratorForPolly
supports all Amazon Polly synthesize_speech
parameters and adds to the rich Amazon Polly feature set.
The polly-vtt
library and its dependencies are available on GitHub.
Install and use the function
Before we look at some examples of using PollyVTT()
, the function that powers SubtitleGeneratorForPolly
, let’s look at the installation and syntax of it.
Install the library using the following code:
To run from the command line, you simply run polly-vtt
:
The following code shows your options:
Let’s look at a few examples now.
Example 1
This example generates a PCM audio file along with an SRT caption file for two simple sentences:
Example 2
This example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:
pcm_testfile.wav
pcm_testfile.wav.vtt
mp3_testfile.mp3
mp3_testfile.mp3.vtt
ogg_testfile.ogg
ogg_testfile.ogg.srt
See the following code:
Example 3
In most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:
The following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions:
The following video offers a short demo of how the internal training team at AWS uses PollyVTT()
:
Conclusion
In this post, we shared a method to generate audio and subtitles at the same time for a given text. The PollyVTT()
function and SubtitleGeneratorForPolly
address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.
For more tutorials and information about Amazon Polly, check out the AWS Machine Learning Blog.
About the Authors
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.
Dan McKee uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.
Orlando Karam is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.