Violetta Shevchenko, an Amazon applied scientist and former intern, combines vision and language to create solutions to challenging problems.Read More
Localize content into multiple languages using AWS machine learning services
Over the last few years, online education platforms have seen an increase in adoption of and an uptick in demand for video-based learnings because it offers an effective medium to engage learners. To expand to international markets and address a culturally and linguistically diverse population, businesses are also looking at diversifying their learning offerings by localizing content into multiple languages. These businesses are looking for reliable and cost-effective ways to solve their localization use cases.
Localizing content mainly includes translating original voices into new languages and adding visual aids such as subtitles. Traditionally, this process is cost-prohibitive, manual, and takes a lot of time, including working with localization specialists. With the power of AWS machine learning (ML) services such as Amazon Transcribe, Amazon Translate, and Amazon Polly, you can create a viable and a cost-effective localization solution. You can use Amazon Transcribe to create a transcript of your existing audio and video streams, and then translate this transcript into multiple languages using Amazon Translate. You can then use Amazon Polly, a text-to speech service, to convert the translated text into natural-sounding human speech.
The next step of localization is to add subtitles to the content, which can improve accessibility and comprehension, and help viewers understand the videos better. Subtitle creation on video content can be challenging because the translated speech doesn’t match the original speech timing. This synchronization between audio and subtitles is a critical task to consider because it might disconnect the audience from your content if they’re not in sync. Amazon Polly offers a solution to this challenge through enabling speech marks, which you can use to create a subtitle file that can be synced with the generated speech output.
In this post, we review a localization solution using AWS ML services where we use an original English video and convert it into Spanish. We also focus on using speech marks to create a synced subtitle file in Spanish.
Solution overview
The following diagram illustrates the solution architecture.
The solution takes a video file and the target language settings as input and uses Amazon Transcribe to create a transcription of the video. We then use Amazon Translate to translate the transcript to the target language. The translated text is provided as an input to Amazon Polly to generate the audio stream and speech marks in the target language. Amazon Polly returns speech mark output in a line-delimited JSON stream, which contains the fields such as time, type, start, end, and value. The value could vary depending on the type of speech mark requested in the input, such as SSML, viseme, word, or sentence. For the purpose of our example, we requested the speech mark type as word
. With this option, Amazon Polly breaks a sentence into its individual words in the sentence and their start and end times in the audio stream. With this metadata, the speech marks are then processed to generate the subtitles for the corresponding audio stream generated by Amazon Polly.
Finally, we use AWS Elemental MediaConvert to render the final video with the translated audio and corresponding subtitles.
The following video demonstrates the final outcome of the solution:
AWS Step Functions workflow
We use AWS Step Functions to orchestrate this process. The following figure shows a high-level view of the Step Functions workflow (some steps are omitted from the diagram for better clarity).
The workflow steps are as follows:
- A user uploads the source video file to an Amazon Simple Storage Service (Amazon S3) bucket.
- The S3 event notification triggers the AWS Lambda function state_machine.py (not shown in the diagram), which invokes the Step Functions state machine.
- The first step, Transcribe audio, invokes the Lambda function transcribe.py, which uses Amazon Transcribe to generate a transcript of the audio from the source video.
The following sample code demonstrates how to create a transcription job using the Amazon Transcribe Boto3 Python SDK:
After the job is complete, the output files are saved into the S3 bucket and the process continues to the next step of translating the content.
- The Translate transcription step invokes the Lambda function translate.py which uses Amazon Translate to translate the transcript to the target language. Here, we use the synchronous/real-time translation using the translate_text function:
Synchronous translation has limits on the document size it can translate; as of this writing, it’s set to 5,000 bytes. For larger document sizes, consider using an asynchronous route of creating the job using start_text_translation_job and checking the status via describe_text_translation_job.
- The next step is a Step Functions Parallel state, where we create parallel branches in our state machine.
- In the first branch, we invoke the Lambda function the Lambda function generate_polly_audio.py to generate our Amazon Polly audio stream:
Here we use the start_speech_synthesis_task method of the Amazon Polly Python SDK to trigger the speech synthesis task that creates the Amazon Polly audio. We set the
OutputFormat
tomp3
, which tells Amazon Polly to generate an audio stream for this API call. - In the second branch, we invoke the Lambda function generate_speech_marks.py to generate speech marks output:
- In the first branch, we invoke the Lambda function the Lambda function generate_polly_audio.py to generate our Amazon Polly audio stream:
- We again use the start_speech_synthesis_task method but specify
OutputFormat
tojson
, which tells Amazon Polly to generate speech marks for this API call.
In the next step of the second branch, we invoke the Lambda function generate_subtitles.py, which implements the logic to generate a subtitle file from the speech marks output.
It uses the Python module in the file webvtt_utils.py. This module has multiple utility functions to create the subtitle file; one such method get_phrases_from_speechmarks
is responsible for parsing the speech marks file. The speech marks JSON structure provides just the start time for each word individually. To create the subtitle timing required for the SRT file, we first create phrases of about n (where n=10) words from the list of words in the speech marks file. Then we write them into the SRT file format, taking the start time from the first word in the phrase, and for the end time we use the start time of the (n+1) word and subtract it by 1 to create the sequenced entry. The following function creates the phrases in preparation for writing them to the SRT file:
- The final step, Media Convert, invokes the Lambda function create_mediaconvert_job.py to combine the audio stream from Amazon Polly and the subtitle file with the source video file to generate the final output file, which is then stored in an S3 bucket. This step uses
MediaConvert
, a file-based video transcoding service with broadcast-grade features. It allows you to easily create video-on-demand content and combines advanced video and audio capabilities with a simple web interface. Here again we use the Python Boto3 SDK to create aMediaConvert
job:
Prerequisites
Before getting started, you must have the following prerequisites:
- An AWS account
- AWS Cloud Development Kit (AWS CDK)
Deploy the solution
To deploy the solution using the AWS CDK, complete the following steps:
- Clone the repository:
- To make sure the AWS CDK is bootstrapped, run the command
cdk bootstrap
from the root of the repository: - Change the working directory to the root of the repository and run the following command:
By default, the target audio settings are set to US Spanish (es-US
). If you plan to test it with a different target language, use the following command:
The process takes a few minutes to complete, after which it displays a link that you can use to view the target video file with the translated audio and translated subtitles.
Test the solution
To test this solution, we used a small portion of the following AWS re:Invent 2017 video from YouTube, where Amazon Transcribe was first introduced. You can also test the solution with your own video. The original language of our test video is English. When you deploy this solution, you can specify the target audio settings or you can use the default target audio settings, which uses Spanish for generating audio and subtitles. The solution creates an S3 bucket that can be used to upload the video file to.
- On the Amazon S3 console, navigate to the bucket
PollyBlogBucket
.
- Choose the bucket, navigate to the
/inputVideo
directory, and upload the video file (the solution is tested with videos of type mp4). At this point, an S3 event notification triggers the Lambda function, which starts the state machine. - On the Step Functions console, browse to the state machine (
ProcessAudioWithSubtitles
). - Choose one of the runs of the state machine to locate the Graph Inspector.
This shows the run results for each state. The Step Functions workflow takes a few minutes to complete, after which you can verify if all the steps successfully completed.
Review the output
To review the output, open the Amazon S3 console and check if the audio file (.mp3) and the speech mark file (.marks) are stored in the S3 bucket under <ROOT_S3_BUCKET>/<UID>/synthesisOutput/
.
The following is a sample of the speech mark file generated from the translated text:
In this output, each part of the text is broken out in terms of speech marks:
- time – The timestamp in milliseconds from the beginning of the corresponding audio stream
- type – The type of speech mark (sentence, word, viseme, or SSML)
- start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
- end – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)
- value – Individual words in the sentence
The generated subtitle file is written back to the S3 bucket. You can find the file under <ROOT_S3_BUCKET>/<UID>/subtitlesOutput/
. Inspect the subtitle file; the content should be similar to the following text:
After the subtitles file and audio file are generated, the final source video file is created using MediaConvert. Check the MediaConvert console to verify if the job status is COMPLETE
.
When the MediaConvert job is complete, the final video file is generated and saved back to the S3 bucket, which can be found under <ROOT_S3_BUCKET>/<UID>/convertedAV/
.
As part of this deployment, the final video is distributed through an Amazon CloudFront (CDN) link and displayed in the terminal or in the AWS CloudFormation console.
Open the URL in a browser to view the original video with additional options for audio and subtitles. You can verify that the translated audio and subtitles are in sync.
Conclusion
In this post, we discussed how to create new language versions of video files without the need of manual intervention. Content creators can use this process to synchronize the audio and subtitles of their videos and reach a global audience.
You can easily integrate this approach into your own production pipelines to handle large volumes and scale according to your needs. Amazon Polly uses Neural TTS (NTTS) to produce natural and human-like text-to-speech voices. It also supports generating speech from SSML, which gives you additional control over how Amazon Polly generates speech from the text provided. Amazon Polly also provides a variety of different voices in multiple languages to support your needs.
Get started with AWS machine learning services by visiting the product page, or refer the Amazon Machine Learning Solutions Lab page where you can collaborate with experts to bring machine learning solutions to your organization.
Additional resources
For more information about the services used in this solution, refer to the following:
- Amazon Transcribe Developer Guide
- Amazon Translate Developer Guide
- AWS Polly Developer Guide
- AWS Step Functions Developer Guide
- AWS Elemental MediaConvert User Guide
- Languages Supported by Amazon Polly
- Languages Supported by Amazon Transcribe
About the authors
Reagan Rosario works as a solutions architect at AWS focusing on education technology companies. He loves helping customers build scalable, highly available, and secure solutions in the AWS Cloud. He has more than a decade of experience working in a variety of technology roles, with a focus on software engineering and architecture.
Anil Kodali is a Solutions Architect with Amazon Web Services. He works with AWS EdTech customers, guiding them with architectural best practices for migrating existing workloads to the cloud and designing new workloads with a cloud-first approach. Prior to joining AWS, he worked with large retailers to help them with their cloud migrations.
Prasanna Saraswathi Krishnan is a Solutions Architect with Amazon Web Services working with EdTech customers. He helps them drive their cloud architecture and data strategy using best practices. His background is in distributed computing, big data analytics, and data engineering. He is passionate about machine learning and natural language processing.
Identify rooftop solar panels from satellite imagery using Amazon Rekognition Custom Labels
Renewable resources like sunlight provide a sustainable and carbon neutral mechanism to generate power. Governments in many countries are providing incentives and subsidies to households to install solar panels as part of small-scale renewable energy schemes. This has created a huge demand for solar panels. Reaching out to potential customers at the right time, through the right channel, and with attractive offers is very crucial for solar and energy companies. They’re looking for cost-efficient approaches and tools to conduct targeted marketing to proactively reach out to potential customers. By identifying the suburbs that have low coverage of solar panel installation at scale, they can maximize their marketing initiatives to those places, so as to maximize the return on their marketing investment.
In this post, we discuss how you can identify solar panels on rooftops from satellite imagery using Amazon Rekognition Custom Labels.
The problem
High-resolution satellite imagery of urban areas provides an aerial view of rooftops. You can use these images to identify solar panel installations. But it is a challenging task to automatically identify solar panels with high accuracy, low cost, and in a scalable way.
With rapid development in computer vision technology, several third-party tools use computer vision to analyze satellite images and identify objects (like solar panels) automatically. However, these tools are expensive and increase the overall cost of marketing. Many organizations have also successfully implemented state-of-the-art computer vision applications to identify the presence of solar panels on the rooftops from the satellite images.
But the reality is that you need to build your own data science teams that have the specific expertise and experience to build a production machine learning (ML) application for your specific use case. It generally takes months for teams to build a computer vision solution that they can use in production. This leads to an increased cost in building and maintaining such a system.
Is there a simpler and cost-effective solution that helps solar companies quickly build effective computer vision models without building a dedicated data science team for that purpose? Yes, Rekognition Custom Labels is the answer to this question.
Solution overview
Rekognition Custom Labels is a feature of Amazon Rekognition that takes care of the heavy lifting of computer vision model development for you, so no computer vision experience is required. You simply provide images with the appropriate labels, train the model, and deploy without having to build the model and fine-tune it. Rekognition Custom Labels has the capability to build highly accurate models with fewer labeled images. This takes away the heavy lifting of model development and helps you focus on developing value-added products and applications to your customers.
In this post, we show how to label, train, and build a computer vision model to detect rooftops and solar panels from satellite images. We use Amazon Simple Storage Service (Amazon S3) for storing satellite images, Amazon SageMaker Ground Truth for labeling the images with the appropriate labels of interest, and Rekognition Custom Labels for model training and hosting. To test the model output, we use a Jupyter notebook to run Python code to detect custom labels in a supplied image by calling Amazon Rekognition APIs.
The following diagram illustrates the architecture using AWS services to label the images, and train and host the ML model.
The solution workflow is as follows:
- Store satellite imagery data in Amazon S3 as the input source.
- Use a Ground Truth labeling job to label the images.
- Use Amazon Rekognition to train the model with custom labels.
- Fine-tune the trained model.
- Start the model and analyze the image with the trained model using the Rekognition Custom Labels API.
Store satellite imagery data in Amazon S3 as an input source
The satellite images of rooftops with and without solar panels are captured from the satellite imagery data providers and stored in an S3 bucket. For our post, we use the images of New South Wales (NSW), Australia, provided by the Spatial Services, Department of Customer Service NSW. We have taken the screenshots of the rooftops from this portal and stored those images in the source S3 bucket. These images are labeled using a Ground Truth labeling job, as explained in the next step.
Use a Ground Truth labeling job to label the images
Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML tasks. It has three options:
- Amazon Mechanical Turk, which uses a public workforce to label the data
- Private, which allows you to create a private workforce from internal teams
- Vendor, which uses third-party resources for the labeling task
In this example, we use a private workforce to perform the data labeling job. Refer to Use Amazon SageMaker Ground Truth to Label Data for instructions on creating a private workforce and configuring Ground Truth for a labeling job with bounding boxes.
The following is an example image of the labeling job. The labeler can draw bounding boxes of the targets with the selected labels indicated by different colors. We used three labels on the images: rooftop, rooftop-panel, and panel to signify rooftops without solar panels, rooftops with solar panels, and just solar panels, respectively.
When the labeling job is complete, an output.manifest file is generated and stored in the S3 output location that you specified when creating the labeling job. The following code is an example of one image labeling output in the manifest file:
The output manifest file is what we need for the Amazon Rekognition training job. In the next section, we provide step-by-step instructions to create a high-performance ML model to detect objects of interest.
Use Amazon Rekognition to train the model with custom labels
We now create a project for a custom object detection model, and provide the labeled images to Rekognition Custom Labels to train the model.
- On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane.
- In the navigation pane, choose Projects.
- Choose Create project.
- For Project name, enter a unique name.
- Choose Create project.
Next, we create a dataset for the training job.
- In the navigation pane, choose Datasets.
- Create a dataset based on the manifest file generated by the Ground Truth labeling job.
We’re now ready to train a new model.
- Select the project you created and choose Train new model.
- Choose the training dataset you created and choose the test dataset.
- Choose Train to start training the model.
You can create a new test dataset or split the training dataset to use 20% of the training data as the test dataset and the remaining 80% as the training dataset. However, if you split the training dataset, the training and test datasets are randomly selected from the whole dataset every time you train a new model. In this example, we create a separate test dataset to evaluate the trained models.
We collected an additional 50 satellite images and labeled them using Ground Truth. We then used the output manifest file of this labeling job to create the test dataset.
This allows us to compare the evaluation metrics of different versions of the model that are trained based on different input datasets. The first training dataset consists of 160 images; the test dataset has 50 images.
When the model training process is complete, we can access the evaluation metrics on the Evaluate tab on the model page. Our training job was able to achieve an F1 score of 0.934. The model evaluation metrics are reasonably good considering the number of training images we used and the number of images used for validating the model.
Although the model evaluation metrics are reasonable, it’s important to understand which images the model incorrectly labels, so as to further fine-tune the model’s performance to make it more robust to handle real-world challenges. In the next section, we describe the process of evaluating the images that have inaccurate labels and retraining the model to achieve better performance.
Fine-tune the trained model
Evaluating incorrect labels inferred by the trained model is a crucial step to further fine-tune the model. To check the detailed test results, we can choose the training job and choose View test results to evaluate the images that the model inaccurately labeled. Evaluating the model performance on test data can help you identify any potential labeling or data source-related issues. For example, the following test image shows an example of false-positive labeling of a rooftop.
As you can determine from the preceding image, the identified rooftop is correct—it’s the rooftop of a smaller home built on the property. Based on the source image name, we can go back to the dataset to check the labels. We can do this via the Amazon Rekognition console. In the following screenshot, we can determine that the source image wasn’t labeled correctly. The labeler missed labeling that rooftop.
To correct the labeling issue, we don’t need to rerun the Ground Truth job or run an adjustment job on the whole dataset. We can easily verify or adjust individual images via the Rekognition Custom Labels console.
- On the dataset page, choose Start labeling.
- Select the image file that needs adjustment and choose Draw bounding box.
On the labeling page, we can draw or update the bounding boxes on this image.
- Draw the bounding box around the smaller building and label it as rooftop.
- Choose Done to save the changes or choose Next or Previous to navigate through additional images that require adjustments.
In some situations, you might have to provide more images with examples of the rooftops that the model failed to identify correctly. We can collect more images, label them, and retrain the model so that the model can learn the special cases of rooftops and solar panels.
To add more training images to the training dataset, you can create another Ground Truth job if the number of added images is large and if you need to create a labeling workforce team to label the images. When the labeling job is finished, we get a new manifest file, which contains the bounding box information for the newly added images. Then we need to manually merge the manifest file of the newly added images to the existing manifest file. We use the combined manifest file to create a new training dataset on the Rekognition Custom Labels console and train a more robust model.
Another option is to add the images directly to the current training dataset if the number of new images isn’t large and one person is sufficient to finish the labeling task on the Amazon Rekognition console. In this project, we directly add another 30 images to the original training dataset and perform labeling on the console.
After we complete the label verification and add more images of different rooftop and panel types, we have a second model trained with 190 training images that we evaluate on the same test dataset. The second version of the trained model achieved an F1 score of 0.964, which is an improvement from the earlier score of 0.934. Based on your business requirement, you can further fine-tune the model.
Deploy the model and analyze the images using the Rekognition Custom Labels API
Now that we have trained a model with satisfactory evaluation results, we deploy the model to an endpoint for real-time inference. We analyze a few images using Python code via the Amazon Rekognition API on this deployed model. After you start the model, its status shows as Running.
Now the model is ready to detect the labels on the new satellite images. We can test the model by running the provided sample Python API code. On the model page, choose API Code.
Select Python to review the sample code to start the model, analyze images, and stop the model.
Copy the Python code in the Analyze image section into a Jupyter notebook, which can be running on your laptop.
To set up the environment to run the code, we need to install the AWS SDKs that we want to use and configure the security credentials to access the AWS resources. For instructions, refer to Set Up the AWS CLI and AWS SDKs.
Upload a test image to an S3 bucket. In the Analyze image Python code, substitute the variable MY_BUCKET with the bucket name that has the test image and replace MY_IMAGE_KEY with the file name of the test image.
The following screenshot shows a sample response of running the Python code.
The following output image shows that the model has successfully detected three labels: rooftop, rooftop-panel, and panel.
Clean up
After testing, we can stop the model to avoid any unnecessary charges incurred to run the model.
Conclusion
In this post, we showed you how to detect rooftops and solar panels from the satellite imagery by building custom computer vision models with Rekognition Custom Labels. We demonstrated how Rekognition Custom Labels manages the model training by taking care of the deep learning complexities behind the scenes. We also demonstrated how to use Ground Truth to label the training images at scale. Furthermore, we discussed mechanisms to improve model accuracy by correcting the labeling of the images on the fly and retraining the model with the dataset. Power utility companies can use this solution to detect houses without solar panels to send offers and promotions to achieve efficient targeted marketing.
To learn more about how Rekognition Custom Labels can help your business, visit Amazon Rekognition Custom Labels or contact AWS Sales.
About the Authors
Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.
Santosh Kulkarni is a Solutions Architect at Amazon Web Services. He works closely with enterprise customers to accelerate their Cloud journey. He is also passionate about building large-scale distributed applications to solve business problems using his knowledge in Machine Learning, Big Data, and Software Development.
Dr. Baichuan Sun is a Senior Data Scientist at AWS AI/ML. He is passionate about solving strategic business problems with customers using data-driven methodology on the cloud, and he has been leading projects in challenging areas including robotics computer vision, time series forecasting, price optimization, predictive maintenance, pharmaceutical development, product recommendation system, etc. In his spare time he enjoys traveling and hanging out with family.
New method identifies the root causes of statistical outliers
Amazon ICML paper proposes information-theoretic measurement of quantitative causal contribution.Read More
Generate synchronized closed captions and audio using the Amazon Polly subtitle generator
Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.
As our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.
Although subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:
- Subtitles – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.
- Captions (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.
Benefits of using Amazon Polly to generate audio with subtitles or closed captions
Imagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.
There are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.
Irrespective of the field of application, closed captioning can help with the following:
- Accessibility – People with hearing impairments can better consume your content.
- Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.
- Reachability – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.
- Searchability – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.
- Social courtesy – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.
- Comprehension – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.
Solution overview
The library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.
In this post, we focus on the PollyVTT()
syntax and options, and offer a few examples that demonstrate how to use the Python SubtitleGeneratorForPolly
to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, SubtitleGeneratorForPolly
supports all Amazon Polly synthesize_speech
parameters and adds to the rich Amazon Polly feature set.
The polly-vtt
library and its dependencies are available on GitHub.
Install and use the function
Before we look at some examples of using PollyVTT()
, the function that powers SubtitleGeneratorForPolly
, let’s look at the installation and syntax of it.
Install the library using the following code:
To run from the command line, you simply run polly-vtt
:
The following code shows your options:
Let’s look at a few examples now.
Example 1
This example generates a PCM audio file along with an SRT caption file for two simple sentences:
Example 2
This example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:
pcm_testfile.wav
pcm_testfile.wav.vtt
mp3_testfile.mp3
mp3_testfile.mp3.vtt
ogg_testfile.ogg
ogg_testfile.ogg.srt
See the following code:
Example 3
In most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:
The following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions:
The following video offers a short demo of how the internal training team at AWS uses PollyVTT()
:
Conclusion
In this post, we shared a method to generate audio and subtitles at the same time for a given text. The PollyVTT()
function and SubtitleGeneratorForPolly
address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.
For more tutorials and information about Amazon Polly, check out the AWS Machine Learning Blog.
About the Authors
Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.
Dan McKee uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.
Orlando Karam is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.
Accelerate your identity verification projects using AWS Amplify and Amazon Rekognition sample implementations
Amazon Rekognition allows you to mitigate fraudulent attacks and minimize onboarding friction for legitimate customers through a streamlined identity verification process. This can result in an increase in customer trust and safety. Key capabilities of this solution include:
- Register a new user using a selfie
- Register a new user after face match against an ID card and ID card data extraction
- Authenticate returning user
Amazon Rekognition offers pre-trained facial recognition capabilities that you can quickly add to your user onboarding and authentication workflows to verify opted-in users’ identities online. No machine learning (ML) expertise is required to use this service.
In a previous post, we described a typical identity verification workflow and showed you how to build an identity verification solution using various Amazon Rekognition APIs. In this post, we have added a facial identity-based authentication user interface to show a complete end-to-end identity verification solution. We provide a complete sample implementation in our GitHub repository.
Solution overview
The following reference architecture shows how you can use Amazon Rekognition, along with other AWS services, to implement identity verification.
The architecture includes the following components:
- Users access the front-end web portal hosted within the AWS Amplify Amplify is an end-to-end solution that enables front-end web developers to build and deploy secure, scalable full stack applications.
- Applications invoke Amazon API Gateway to route requests to the correct AWS Lambda function depending on the user flow. There are four major actions in this solution: authenticate, register, register with ID card, and update.
- API Gateway uses a service integration to run the AWS Step Functions express state machine corresponding to the specific endpoint called from API Gateway. Within each step, Lambda functions are responsible for triggering the correct set of calls to and from Amazon DynamoDB and Amazon Simple Storage Service (Amazon S3), along with the relevant Amazon Rekognition APIs.
- DynamoDB holds face IDs (
face-id
), S3 path URIs, and unique IDs (for example employee ID number) for eachface-id
. Amazon S3 stores all the face images. - The final major component of the solution is Amazon Rekognition. Each flow (authenticate, register, register with ID card, and update) calls different Amazon Rekognition APIs depending on the task.
Before we deploy the solution, it’s important to know the following concepts and API descriptions:
- Collections – Amazon Rekognition stores information about detected faces in server-side containers known as collections. You can use the facial information that’s stored in a collection to search for known faces in images, stored videos, and streaming videos. You can use collections in a variety of scenarios. For example, you might create a face collection to store scanned badge images by using the IndexFaces When an employee enters the building, an image of the employee’s face may be captured and sent to the SearchFacesByImage operation. If the face match produces a sufficiently high similarity score (say 99%), you can authenticate the employee.
- DetectFaces API – This API detects faces within an image provided as input and returns information about faces. In a user registration workflow, this operation may help you screen images before moving to the next step. For example, you can check if a photo contains a face, if the person identified is in the right orientation, and if they’re not wearing a face blocker such as sunglasses or a cap.
- IndexFaces API – This API detects faces in the input image and adds them to the specified collection. This operation is used to add a screened image to a collection for future queries.
- SearchFacesByImage API – For a given input image, the API first detects the largest face in the image, and then searches the specified collection for matching faces. The operation compares the features of the input face with face features in the specified collection.
- CompareFaces API – This API compares a face in the source input image with each of the 100 largest faces detected in the target input image. If the source image contains multiple faces, the service detects the largest face and compares it with each face detected in the target image. For our use case, we expect both the source and target image to contain a single face.
- DeleteFaces API – This API deletes faces from a collection. You specify a collection ID and an array of face IDs to remove.
Workflows
The solution provides a sample of workflows to enable user registration, authentication, and updates to the user profile image. We detail each workflow in this section.
Register a new user using a face selfie
The following figure shows the workflow of a new user registration. Typical steps in this process are:
- A user captures a selfie image.
- A quality check of the selfie image is performed.
Note: A liveness detection check can also be performed after this step. For more details, please read this blog. - The selfie is checked against a database of existing user faces.
The following image illustrates the Step Functions workflow for new user registration.
Three functions are called in this workflow: detect-faces, search-faces, and index-faces. The detect-faces function calls the Amazon Rekognition DetectFaces
API to determine if a face is detected in an image and is usable. Some of the quality checks include determining that only one face is present in the image, ensuring the face isn’t obscured by sunglasses or a hat, and confirming that the face isn’t rotated by using the pose dimension. If the image passes the quality check, the search-faces function searches for an existing face match in the Amazon Rekognition collections by confirming the FaceMatchThreshold confidence score meets your threshold objective. For more information, refer to Using similarity thresholds to match faces. If the face image doesn’t exist in the collections, the index-faces function is called to index the face in the collections. The face image metadata is stored in the DynamoDB table and the face images are stored in an S3 bucket.
If the new user registration succeeds, the face image attribute information is added in DynamoDB. You can customize the flow according to the business process. It often contains some or all of the steps presented in the preceding diagram. You can choose to run all the steps synchronously (wait for one step to complete before moving on to the next step). Alternately, you can run some of the steps asynchronously (don’t wait for that step to complete) to speed up the user registration process and improve the customer experience. If the steps aren’t successful, you must roll back the user registration.
Register a new user after face match against an ID card with ID card data extraction
In addition to user registration with image, this workflow allows users to register with an identification card like driver’s license. The steps to register a new user with an ID card are similar to the steps for registering a new user.
The following image illustrates the Step Functions workflow for new user registration with ID.
Four functions are called in this workflow: detect-faces, search-faces, index-faces and compare-faces. The sequence of operations in this workflow is similar to the user registration workflow with the addition of compare-faces. After verifying the quality of the selfie image and ensuring the face image is not present in the collection, the compare-faces function is invoked to verify the selfie image matches the face image in the ID card. If the images match, the relevant properties are extracted from the ID card. You can extract key-value pairs from identity documents using the newly launched Amazon Textract AnalyzeID
API (for US regions) or Amazon Rekognition DetectText
API (non-US regions and non-English languages). The extracted properties from the ID card are merged and the user’s face is indexed in the collection via the index-faces function.
The face image metadata is stored in the DynamoDB table and the face images are stored in an S3 bucket.
If the images don’t match or a duplicate registration is detected, the user receives a login failure. Login failures can be logged using an Amazon CloudWatch event, and actions can be triggered using Amazon Simple Notification Service (Amazon SNS) to notify security operations for monitoring and tracking failed logins. For more information, refer to Monitoring Amazon SNS topics using CloudWatch.
Authenticate returning user
Another common flow is an existing or returning user login. In this flow, a check of the user face (selfie) is performed against a previously registered face. Typical steps in this process include user face capture (selfie), check of the selfie image quality, and search and compare of the selfie against the faces database. The following diagram shows a possible flow.
The following image illustrates the workflow for authenticating an existing user.
This Step Function workflow calls three functions: detect-faces, compare-faces and search-faces. After the detect-faces function verifies that the captured face image is valid, the compare-faces function checks the link in the DynamoDB table for a face image in S3 bucket that matches an existing user. If a match is found, the user authenticates successfully. If a match isn’t found, the search-faces function is called to search for the face image in the collections. The user is verified and the authentication process completes if their face image exists in the collections. Otherwise, the user’s access is denied.
Prerequisites
Before you get started, complete the following prerequisites:
- Create an AWS account.
- Install the AWS Command Line Interface (AWS CLI) version 2 on your local machine. For instructions, refer to Installing or updating the latest version of the AWS CLI.
- Set up the AWS CLI.
- Install Node.js on your local machine.
- Clone the sample repo on your local machine:
Deploy the solution
Choose the appropriate CloudFormation stack to provision the solution in your AWS account in your preferred Region. This solution deploys API Gateway integrated with Step Functions and Amazon Rekognition APIs to run the identity verification workflows.
Clicking on one of the following launch buttons will provision the solution into your AWS Account in the particular region.
Run the following steps on your local machine to deploy the Front-end application:
Invoke the web UI
The web portal is deployed with Amplify. On the Amplify console, locate the hosted web application environment and the URL. Copy the URL and access it from your browser.
Register a new user using a face selfie
Register yourself as a user with the following steps:
- Open the web URL provided from Amplify.
- Choose Register
- Enable your camera and capture a face image.
- Enter your user name and details.
- Choose Signup to register your account.
Authenticate returning user
After you’re registered, you log in using the face ID as an authentication mechanism.
- Open the web URL provided by Amplify
- Capture your face ID.
- Enter your user ID.
- Choose Login.
You get a “Login successful” message after your face ID is verified with the registration image.
Register a new user after face match against an ID card with ID card data extraction
To test user registration with an ID, complete the following steps:
- Open the web URL provided by Amplify.
- Choose Register with ID
- Enable your camera and capture a face image.
- Drag and drop your ID card
- Choose Register.
The following screenshot shows an example. The application supports ID card images of up to 256 KB.
You receive a “Successfully Registered User” message.
Clean up
To prevent accruing additional charges in your AWS account, delete the resources you provisioned by navigating to the AWS CloudFormation console and deleting the Riv-Prod
stack.
Deleting the stack doesn’t delete the S3 bucket you created. This bucket stores all the face images. If you want to delete the S3 bucket, navigate to the Amazon S3 console, empty the bucket, and then confirm you want to permanently delete it.
Conclusion
Amazon Rekognition makes it easy to add image analysis to your identity verification applications using proven, highly scalable, deep learning technology that requires no ML expertise to use. Amazon Rekognition provides face detection and comparison capabilities. With a combination of the DetectFaces, CompareFaces, IndexFaces, SearchFacesByImage, DetectText and AnalyzeID, you can implement the common flows around new user registration and existing user logins.
Amazon Rekognition collections provide a method to store information about detected faces in server-side containers. You can then use the facial information stored in a collection to search for known faces in images. When using collections, you don’t need to store original photos after you index faces in the collection. Amazon Rekognition collections don’t persist actual images. Instead, the underlying detection algorithm detects the faces in the input image, extracts facial features into a feature vector for each face, and stores it in the collection.
To start your journey towards identity verification, visit Identity Verification using Amazon Rekognition.
About the authors
Vineet Kacchawaha is a Solutions Architect at AWS with expertise in Machine Learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.
Ramesh Thiagarajan is a Senior Solutions Architect based out of San Francisco. He holds a Bachelor of Science in Applied Sciences and a master’s in Cyber Security. He specializes in cloud migration, cloud security, compliance, and risk management. Outside of work, he is a passionate gardener, and has an avid interest in real estate and home improvement projects.
Amit Gupta is an AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.
Tim Murphy is a Senior Solutions Architect for AWS, working with enterprise financial service customers building business cloud centric solutions. He has spent the last decade working with startups, non-profits, commercial enterprise, and government agencies, deploying infrastructure at scale. In his spare time when he isn’t tinkering with technology, you’ll most likely find him in far flung areas of the earth hiking mountains, surfing waves, or biking through a new city.
Nate Bachmeier is an AWS Senior Solutions Architect that nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing applications. Besides this, Nate is a full-time student and has two kids.
Jessie-Lee Fry is a Snr AIML Specialist with a focus on Computer Vision at AWS. She helps organizations leverage Machine Learning and AI to combat fraud and drive innovation on behalf of their customers. Outside of work, she enjoys spending time with her family, traveling and read all about Responsible AI.
74 Amazon Research Awards recipients announced
The awardees represent 51 universities in 17 countries. Recipients have access to more than 300 Amazon public datasets, and can utilize AWS AI/ML services and tools.Read More
Build a news-based real-time alert system with Twitter, Amazon SageMaker, and Hugging Face
Today, social media is a huge source of news. Users rely on platforms like Facebook and Twitter to consume news. For certain industries such as insurance companies, first respondents, law enforcement, and government agencies, being able to quickly process news about relevant events occurring can help them take action while these events are still unfolding.
It’s not uncommon for organizations trying to extract value from text data to look for a solution that doesn’t involve the training of a complex NLP (natural language processing) model. For those organizations, using a pre-trained NLP model is more practical. Furthermore, if the chosen model doesn’t satisfy their success metrics, organizations want to be able to easily pick another model and reassess.
At present, it’s easier than ever to extract information from text data thanks to the following:
- The rise of state-of-the art, general-purpose NLP architectures such as transformers
- The ability that developers and data scientists have to quickly build, train, and deploy machine learning (ML) models at scale on the cloud with services like Amazon SageMaker
- The availability of thousands of pre-trained NLP models in hundreds of languages and with support for multiple frameworks provided by the community in platforms like Hugging Face Hub
In this post, we show you how to build a real-time alert system that consumes news from Twitter and classifies the tweets using a pre-trained model from the Hugging Face Hub. You can use this solution for zero-shot classification, meaning you can classify tweets at virtually any set of categories, and deploy the model with SageMaker for real-time inference.
Alternatively, if you’re looking for insights into your customer’s conversations and deepen brand awareness by analyzing social media interactions, we encourage you to check out the AI-Driven Social Media Dashboard. The solution uses Amazon Comprehend, a fully managed NLP service that uncovers valuable insights and connections in text without requiring machine learning experience.
Zero-shot learning
The fields of NLP and natural language understanding (NLU) have rapidly evolved to address use cases involving text classification, question answering, summarization, text generation, and more. This evolution has been possible, in part, thanks to the rise of state-of-the art, general-purpose architectures such as transformers, but also the availability of more and better-quality text corpora available for the training of such models.
The transformer architecture is a complex neural network that requires domain expertise and a huge amount of data in order to be trained from scratch. A common practice is to take a pre-trained state-of-the-art transformer like BERT, RoBERTa, T5, GPT-2, or DistilBERT and fine-tune (transfer learning) the model to a specific use case.
Nevertheless, even performing transfer learning on a pre-trained NLP model can often be a challenging task, requiring large amounts of labeled text data and a team of experts to curate the data. This complexity prevents most organizations from using these models effectively, but zero-shot learning helps ML practitioners and organizations overcome this shortcoming.
Zero-shot learning is a specific ML task in which a classifier learns on one set of labels during training, and then during inference is evaluated on a different set of labels that the classifier has never seen before. In NLP, you can use a zero-shot sequence classifier trained on a natural language inference (NLI) task to classify text without any fine-tuning. In this post, we use the popular NLI BART model bart-large-mnli to classify tweets. This is a large pre-trained model (1.6 GB), available on the Hugging Face model hub.
Hugging Face is an AI company that manages an open-source platform (Hugging Face Hub) with thousands of pre-trained NLP models (transformers) in more than 100 different languages and with support for different frameworks such as TensorFlow and PyTorch. The transformers library helps developers and data scientists get started in complex NLP and NLU tasks such as classification, information extraction, question answering, summarization, translation, and text generation.
AWS and Hugging Face have been collaborating to simplify and accelerate the adoption of NLP models. A set of Deep Learning Containers (DLCs) for training and inference in PyTorch or TensorFlow, and Hugging Face estimators and predictors for the SageMaker Python SDK are now available. These capabilities help developers with all levels of expertise get started with NLP easily.
Overview of solution
We provide a working solution that fetches tweets in real time from selected Twitter accounts. For the demonstration of our solution, we use three accounts, Amazon Web Services (@awscloud), AWS Security (@AWSSecurityInfo), and Amazon Science (@AmazonScience), and classify their content into one of the following categories: security, database, compute, storage, and machine learning. If the model returns a category with a confidence score greater than 40%, a notification is sent.
In the following example, the model classified a tweet from Amazon Web Services in the machine learning category, with a confidence score of 97%, generating an alert.
The solution relies on a Hugging Face pre-trained transformer model (from the Hugging Face Hub) to classify tweets based on a set of labels that are provided at inference time—the model doesn’t need to be trained. The following screenshots show more examples and how they were classified.
We encourage you to try the solution for yourself. Simply download the source code from the GitHub repository and follow the deployment instructions in the README file.
Solution architecture
The solution keeps an open connection to Twitter’s endpoint and, when a new tweet arrives, sends a message to a queue. A consumer reads messages from the queue, calls the classification endpoint, and, depending on the results, notifies the end user.
The following is the architecture diagram of the solution.
The solution workflow consists of the following components:
- The solution relies on Twitter’s Stream API to get tweets that match the configured rules (tweets from the accounts of interest) in real time. To do so, an application running inside a container keeps an open connection to Twitter’s endpoint. Refer to Twitter API for more details.
- The container runs on Amazon Elastic Container Service (Amazon ECS), a fully managed container orchestration service that makes it easy for you to deploy, manage, and scale containerized applications. A single task runs on a serverless infrastructure managed by AWS Fargate.
- The Twitter Bearer token is securely stored in AWS Systems Manager Parameter Store, a capability of AWS Systems Manager that provides secure, hierarchical storage for configuration data and secrets. The container image is hosted on Amazon Elastic Container Registry (Amazon ECR), a fully managed container registry offering high-performance hosting.
- Whenever a new tweet arrives, the container application puts the tweet into an Amazon Simple Queue Service (Amazon SQS) queue. Amazon SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
- The logic of the solution resides in an AWS Lambda function. Lambda is a serverless, event-driven compute service. The function consumes new tweets from the queue and classifies them by calling an endpoint.
- The endpoint relies on a Hugging Face model and is hosted on SageMaker. The endpoint runs the inference and outputs the class of the tweet.
- Depending on the classification, the function generates a notification through Amazon Simple Notification Service (Amazon SNS), a fully managed messaging service. You can subscribe to the SNS topic, and multiple destinations can receive that notification (see Amazon SNS event destinations). For instance, you can deliver the notification to inboxes as email messages (see Email notifications).
Deploy Hugging Face models with SageMaker
You can select any of the over 10,000 publicly available models from the Hugging Face Model Hub and deploy them with SageMaker by using Hugging Face Inference DLCs.
When using AWS CloudFormation, you select one of the publicly available Hugging Face Inference Containers and configure the model and the task. This solution uses the facebook/bart-large-mnli model and the zero-shot-classification task, but you can choose any of the models under Zero-Shot Classification on the Hugging Face Model Hub. You configure those by setting the HF_MODEL_ID and HF_TASK environment variables in your CloudFormation template, as in the following code:
Alternatively, if you’re not using AWS CloudFormation, you can achieve the same results with few lines of code. Refer to Deploy models to Amazon SageMaker for more details.
To classify the content, you just call the SageMaker endpoint. The following is a Python code snippet:
Note the False value for the multi_class parameter to indicate that the sum of all the probabilities for each class will add up to 1.
Solution improvements
You can enhance the solution proposed here by storing the tweets and the model results. Amazon Simple Storage Service (Amazon S3), an object storage service, is one option. You can write tweets, results, and other metadata as JSON objects into an S3 bucket. You can then perform ad hoc queries against that content using Amazon Athena, an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
You can use the history not only to extract insights but also to train a custom model. You can use Hugging Face support to train a model with your own data with SageMaker. Learn more on Run training on Amazon SageMaker.
Real-world use cases
Customers are already experimenting with Hugging Face models on SageMaker. Seguros Bolívar, a Colombian financial and insurance company founded in 1939, is an example.
“We developed a threat notification solution for customers and insurance brokers. We use Hugging Face pre-trained NLP models to classify tweets from relevant accounts to generate notifications for our customers in near-real time as a prevention strategy to help mitigate claims. A claim occurs because customers are not aware of the level of risk they are exposed to. The solution allows us to generate awareness in our customers, turning risk into something measurable in concrete situations.”
– Julian Rico, Chief of Research and Knowledge at Seguros Bolívar.
Seguros Bolívar worked with AWS to re-architecture their solution; it now relies on SageMaker and resembles the one described in this post.
Conclusion
Zero-shot classification is ideal when you have little data to train a custom text classifier or when you can’t afford to train a custom NLP model. For specialized use cases, when text is based on specific words or terms, it’s better to go with a supervised classification model based on a custom training set.
In this post, we showed you how to build a news classifier using a Hugging Face zero-shot model on AWS. We used Twitter as our news source, but you can choose a news source that is more suitable to your specific needs. Furthermore, you can easily change the model, just specify your chosen model in the CloudFormation template.
For the source code, refer to the GitHub repository It includes the full setup instructions. You can clone, change, deploy, and run it yourself. You can also use it as a starting point and customize the categories and the alert logic or build another solution for a similar use case.
Please give it a try, and let us know what you think. As always, we’re looking forward to your feedback. You can send it to your usual AWS Support contacts, or in the AWS Forum for SageMaker.
About the authors
David Laredo is a Prototyping Architect at AWS Envision Engineering in LATAM, where he has helped develop multiple machine learning prototypes. Previously he has worked as a Machine Learning Engineer and has been doing machine learning for over 5 years. His areas of interest are NLP, time series, and end-to-end ML.
Rafael Werneck is a Senior Prototyping Architect at AWS Envision Engineering, based in Brazil. Previously, he worked as a Software Development Engineer on Amazon.com.br and Amazon RDS Performance Insights.
Vikram Elango is an AI/ML Specialist Solutions Architect at Amazon Web Services, based in Virginia, USA. Vikram helps financial and insurance industry customers with design and thought leadership to build and deploy machine learning applications at scale. He is currently focused on natural language processing, responsible AI, inference optimization, and scaling ML across the enterprise. In his spare time, he enjoys traveling, hiking, cooking, and camping with his family.
Filtering out “forbidden” documents during information retrieval
New method optimizes the twin demands of retrieving relevant content and filtering out bad content.Read More
Amazon scientists Mike Hicks and René Vidal honored
Hicks wins 2022 ACM SIGPLAN Distinguished Service Award for career contributions; Vidal wins IEEE Signal Processing Magazine Best Paper Award.Read More