July 2022 – Page 8

Simplified Transfer Learning for Chest Radiography Model Development

Posted by Akib Uddin, Product Manager and Andrew Sellergren, Software Engineer, Google Health

Every year, nearly a billion chest X-ray (CXR) images are taken globally to aid in the detection and management of health conditions ranging from collapsed lungs to infectious diseases. Generally, CXRs are cheaper and more accessible than other forms of medical imaging. However, existing challenges continue to impede the optimal use of CXRs. For example, in some areas, trained radiologists that can accurately interpret CXR images are in short supply. In addition, interpretation variability between experts, workflow differences between institutions, and the presence of rare conditions familiar only to subspecialists all contribute to making high-quality CXR interpretation a challenge.

Recent research has leveraged machine learning (ML) to explore potential solutions for some of these challenges. There is significant interest and effort devoted to building deep learning models that detect abnormalities in CXRs and improve access, accuracy, and efficiency to identify diseases and conditions that affect the heart and lungs. However, building robust CXR models requires large labeled training datasets, which can be prohibitively expensive and time-consuming to create. In some cases, such as working with underrepresented populations or studying rare medical conditions, only limited data are available. Additionally, CXR images vary in quality across populations, geographies, and institutions, making it difficult to build robust models that perform well globally.

In “Simplified Transfer Learning for Chest Radiography Models Using Less Data”, published in the journal Radiology, we describe how Google Health utilizes advanced ML methods to generate pre-trained “CXR networks” that can convert CXR images to embeddings (i.e., information-rich numerical vectors) to enable the development of CXR models using less data and fewer computational resources. We demonstrate that even with less data and compute, this approach has enabled performance comparable to state-of-the-art deep learning models across various prediction tasks. We are also excited to announce the release of CXR Foundation, a tool that utilizes our CXR-specific network to enable developers to create custom embeddings for their CXR images. We believe this work will help accelerate the development of CXR models, aiding in disease detection and contributing to more equitable health access throughout the world.

Developing a Chest X-ray Network
A common approach to building medical ML models is to pre-train a model on a generic task using non-medical datasets and then refine the model on a target medical task. This process of transfer learning may improve the target task performance or at least speed up convergence by applying the understanding of natural images to medical images. However, transfer learning may still require large labeled medical datasets for the refinement step.

Expanding on this standard approach, our system supports modeling CXR-specific tasks through a three-step model training setup composed of (1) generic image pre-training similar to traditional transfer learning, (2) CXR-specific pre-training, and (3) task-specific training. The first and third steps are common in ML: first pre-training on a large dataset and labels that are not specific to the desired task, and then fine-tuning on the task of interest.

We built a CXR-specific image classifier that employs supervised contrastive learning (SupCon). SupCon pulls together representations of images that have the same label (e.g., abnormal) and pushes apart representations of images that have a different label (e.g., one normal image and one abnormal image). We pre-trained this model on de-identified CXR datasets of over 800,000 images generated in partnership with Northwestern Medicine and Apollo Hospitals in the US and India, respectively. We then leveraged noisy abnormality labels from natural language processing of radiology reports to build our “CXR-specific” network.

This network creates embeddings (i.e., information-rich numerical vectors that can be used to distinguish classes from each other) that can more easily train models for specific medical prediction tasks, such as image finding (e.g., airspace opacity), clinical condition (e.g., tuberculosis), or patient outcome (e.g., hospitalization). For example, the CXR network can generate embeddings for every image in a given CXR dataset. For these images, the generated embeddings and the labels for the desired target task (such as tuberculosis) are used as examples to train a small ML model.

Left: Training a CXR model for a given task generally requires a large number of labeled images and a significant amount of computational resources to create a foundation of neural network layers. Right: With the CXR network and tool providing this foundation, each new task requires only a fraction of the labeled images, computational resources, and neural network parameters compared to rebuilding the entire network from scratch.

Effects of CXR Pre-training
We visualized these embedding layers at each step of the process using airspace opacity as an example (see the figure below). Before SupCon-based pre-training, there was poor separation of normal and abnormal CXR embeddings. After SupCon-based pre-training, the positive examples were grouped more closely together, and the negative examples more closely together as well, indicating that the model had identified that images from each category resembled themselves.

Visualizations of the t-distributed stochastic neighbor embedding for generic vs. CXR-specific network embeddings. Embeddings are information-rich numerical vectors that alone can distinguish classes from each other, in this case, airspace opacity positive vs. negative.

Our research suggests that adding the second stage of pre-training enables high-quality models to be trained with up to 600-fold less data in comparison to traditional transfer learning approaches that leverage pre-trained models on generic, non-medical datasets. We found this to be true regardless of model architecture (e.g., ResNet or EfficientNet) or dataset used for natural image pre-training (e.g., ImageNet or JFT-300M). With this approach, researchers and developers can significantly reduce dataset size requirements.

Top: In a deep learning model, the neural network contains multiple layers of artificial neurons, with the first layer taking the CXR image as input, intermediate layers doing additional computation, and the final layer making the classification (e.g., airspace opacity: present vs. absent). The embedding layer is usually one of the last layers. Bottom left: The traditional transfer learning approach involves a two-step training setup where a generic pre-trained network is optimized directly on a prediction task of interest. Our proposed three-step training setup generates a CXR network using a SupCon ML technique (step 2) before optimization for prediction tasks of interest (step 3). Bottom right: Using the embeddings involves either training smaller models (the first two strategies) or fine-tuning the whole network if there are sufficient data (strategy 3).

Results
After training the initial model, we measured performance using the area under the curve (AUC) metric with both linear and non-linear models applied to CXR embeddings; and a non-linear model produced by fine-tuning the entire network. On public datasets, such as ChestX-ray14 and CheXpert, our work substantially and consistently improved the data-accuracy tradeoff for models developed across a range of training dataset sizes and several findings. For example, when evaluating the tool’s ability to develop tuberculosis models, data efficiency gains were more striking: models trained on the embeddings of just 45 images achieved non-inferiority to radiologists in detecting tuberculosis on an external validation dataset. For both tuberculosis and severe COVID-19 outcomes, we show that non-linear classifiers trained on frozen embeddings outperformed a model that was fine-tuned on the entire dataset.

Comparing CXR-specific networks for transfer learning (red), with a baseline transfer learning approach (blue) across a variety of CXR abnormalities (top left), tuberculosis (bottom left), and COVID-19 outcomes (bottom right). This approach improves performance at the same dataset size, or reduces the dataset size required to reach the same performance. Interestingly, using the CXR network with simpler ML models that are faster to train (red) performs better than training the full network (black) at dataset sizes up to 8⁵ images.

Conclusion and Future Work
To accelerate CXR modeling efforts with low data and computational requirements, we are releasing our CXR Foundation tool, along with scripts to train linear and nonlinear classifiers. Via these embeddings, this tool will allow researchers to jump-start CXR modeling efforts using simpler transfer learning methods. This approach can be particularly useful for predictive modeling using small datasets, and for adapting CXR models when there are distribution shifts in patient populations (whether over time or across different institutions). We are excited to continue working with partners, such as Northwestern Medicine and Apollo Hospitals, to explore the impact of this technology further. By enabling researchers with limited data and compute to develop CXR models, we’re hoping more developers can solve the most impactful problems for their populations.

Acknowledgements
Key contributors to this project at Google include Christina Chen, Yun Liu, Dilip Krishnan, Zaid Nabulsi, Atilla Kiraly, Arnav Agharwal, Eric Wu, Yuanzhen Li, Aaron Maschinot, Aaron Sarna, Jenny Huang, Marilyn Zhang, Charles Lau, Neeral Beladia, Daniel Tse, Krish Eswaran, and Shravya Shetty. Significant contributions and input were also made by collaborators Sreenivasa Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia-Vicente, and David Melnick. For the ChestX-ray14 dataset, we thank the NIH Clinical Center for making it publicly available. The authors would also like to acknowledge many members of the Google Health Radiology and labeling software teams. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study; Jonny Wong for coordinating the imaging annotation work; Craig Mermel and Akinori Mitani for providing feedback on the manuscript; Nicole Linton and Lauren Winer for feedback on the blogpost; and Tom Small for the animation.

Localize content into multiple languages using AWS machine learning services

Over the last few years, online education platforms have seen an increase in adoption of and an uptick in demand for video-based learnings because it offers an effective medium to engage learners. To expand to international markets and address a culturally and linguistically diverse population, businesses are also looking at diversifying their learning offerings by localizing content into multiple languages. These businesses are looking for reliable and cost-effective ways to solve their localization use cases.

Localizing content mainly includes translating original voices into new languages and adding visual aids such as subtitles. Traditionally, this process is cost-prohibitive, manual, and takes a lot of time, including working with localization specialists. With the power of AWS machine learning (ML) services such as Amazon Transcribe, Amazon Translate, and Amazon Polly, you can create a viable and a cost-effective localization solution. You can use Amazon Transcribe to create a transcript of your existing audio and video streams, and then translate this transcript into multiple languages using Amazon Translate. You can then use Amazon Polly, a text-to speech service, to convert the translated text into natural-sounding human speech.

The next step of localization is to add subtitles to the content, which can improve accessibility and comprehension, and help viewers understand the videos better. Subtitle creation on video content can be challenging because the translated speech doesn’t match the original speech timing. This synchronization between audio and subtitles is a critical task to consider because it might disconnect the audience from your content if they’re not in sync. Amazon Polly offers a solution to this challenge through enabling speech marks, which you can use to create a subtitle file that can be synced with the generated speech output.

In this post, we review a localization solution using AWS ML services where we use an original English video and convert it into Spanish. We also focus on using speech marks to create a synced subtitle file in Spanish.

Solution overview

The following diagram illustrates the solution architecture.

The solution takes a video file and the target language settings as input and uses Amazon Transcribe to create a transcription of the video. We then use Amazon Translate to translate the transcript to the target language. The translated text is provided as an input to Amazon Polly to generate the audio stream and speech marks in the target language. Amazon Polly returns speech mark output in a line-delimited JSON stream, which contains the fields such as time, type, start, end, and value. The value could vary depending on the type of speech mark requested in the input, such as SSML, viseme, word, or sentence. For the purpose of our example, we requested the speech mark type as word. With this option, Amazon Polly breaks a sentence into its individual words in the sentence and their start and end times in the audio stream. With this metadata, the speech marks are then processed to generate the subtitles for the corresponding audio stream generated by Amazon Polly.

Finally, we use AWS Elemental MediaConvert to render the final video with the translated audio and corresponding subtitles.

The following video demonstrates the final outcome of the solution:

AWS Step Functions workflow

We use AWS Step Functions to orchestrate this process. The following figure shows a high-level view of the Step Functions workflow (some steps are omitted from the diagram for better clarity).

The workflow steps are as follows:

A user uploads the source video file to an Amazon Simple Storage Service (Amazon S3) bucket.
The S3 event notification triggers the AWS Lambda function state_machine.py (not shown in the diagram), which invokes the Step Functions state machine.
The first step, Transcribe audio, invokes the Lambda function transcribe.py, which uses Amazon Transcribe to generate a transcript of the audio from the source video.

The following sample code demonstrates how to create a transcription job using the Amazon Transcribe Boto3 Python SDK:

response = transcribe.start_transcription_job(
    TranscriptionJobName = jobName,
    MediaFormat=media_format,
    Media = {"MediaFileUri": "s3://"+bucket+"/"+mediaconvert_video_filename},
    OutputBucketName=bucket,
    OutputKey=outputKey,
    IdentifyLanguage=True
)

After the job is complete, the output files are saved into the S3 bucket and the process continues to the next step of translating the content.

The Translate transcription step invokes the Lambda function translate.py which uses Amazon Translate to translate the transcript to the target language. Here, we use the synchronous/real-time translation using the translate_text function:
```
# Real-time translation
response = translate.translate_text(
    Text=transcribe_text,
    SourceLanguageCode=source_language_code,
    TargetLanguageCode=target_language_code,
)
```
Synchronous translation has limits on the document size it can translate; as of this writing, it’s set to 5,000 bytes. For larger document sizes, consider using an asynchronous route of creating the job using start_text_translation_job and checking the status via describe_text_translation_job.

The next step is a Step Functions Parallel state, where we create parallel branches in our state machine.

In the first branch, we invoke the Lambda function the Lambda function generate_polly_audio.py to generate our Amazon Polly audio stream:

# Set up the polly and translate services
client = boto3.client('polly')

# Use the translated text to create the synthesized speech
response = client.start_speech_synthesis_task(
             Engine="standard", LanguageCode="es", OutputFormat="mp3",
             SampleRate="22050", Text=polly_text, VoiceId="Miguel",
             TextType="text",
             OutputS3BucketName="S3-bucket-name",
             OutputS3KeyPrefix="-polly-recording")
audio_task_id = response['SynthesisTask']['TaskId']

Here we use the start_speech_synthesis_task method of the Amazon Polly Python SDK to trigger the speech synthesis task that creates the Amazon Polly audio. We set the OutputFormat to mp3, which tells Amazon Polly to generate an audio stream for this API call.

In the second branch, we invoke the Lambda function generate_speech_marks.py to generate speech marks output:

....
# Use the translated text to create the speech marks
response = client.start_speech_synthesis_task(
             Engine="standard", LanguageCode="es", OutputFormat="json",
             SampleRate="22050", Text=polly_text, VoiceId="Miguel",
             TextType="text", SpeechMarkTypes=['word'],
             OutputS3BucketName="S3-bucket-name", 
             OutputS3KeyPrefix="-polly-speech-marks")
speechmarks_task_id = response['SynthesisTask']['TaskId']

We again use the start_speech_synthesis_task method but specify OutputFormat to json, which tells Amazon Polly to generate speech marks for this API call.

In the next step of the second branch, we invoke the Lambda function generate_subtitles.py, which implements the logic to generate a subtitle file from the speech marks output.

It uses the Python module in the file webvtt_utils.py. This module has multiple utility functions to create the subtitle file; one such method get_phrases_from_speechmarks is responsible for parsing the speech marks file. The speech marks JSON structure provides just the start time for each word individually. To create the subtitle timing required for the SRT file, we first create phrases of about n (where n=10) words from the list of words in the speech marks file. Then we write them into the SRT file format, taking the start time from the first word in the phrase, and for the end time we use the start time of the (n+1) word and subtract it by 1 to create the sequenced entry. The following function creates the phrases in preparation for writing them to the SRT file:

def get_phrases_from_speechmarks(words, transcript):
.....

    for item in items:
        # if it is a new phrase, then get the start_time of the first item
        if n_phrase:
            phrase["start_time"] = get_time_code(words[c]["start_time"])
            n_phrase = False

        else:
            if c == len(words) - 1:
                phrase["end_time"] = get_time_code(words[c]["start_time"])
            else:
                phrase["end_time"] = get_time_code(words[c + 1]["start_time"] - 1)

        # in either case, append the word to the phrase...
        phrase["words"].append(item)
        x += 1

        # now add the phrase to the phrases, generate a new phrase, etc.
        if x == 10 or c == (len(items) - 1):
            # print c, phrase
            if c == (len(items) - 1):
                if phrase["end_time"] == '':
                    start_time = words[c]["start_time"]
                    end_time = int(start_time) + 500
                    phrase["end_time"] = get_time_code(end_time)

            phrases.append(phrase)
            phrase = new_phrase()
            n_phrase = True
            x = 0

        .....

    return phrases

The final step, Media Convert, invokes the Lambda function create_mediaconvert_job.py to combine the audio stream from Amazon Polly and the subtitle file with the source video file to generate the final output file, which is then stored in an S3 bucket. This step uses MediaConvert, a file-based video transcoding service with broadcast-grade features. It allows you to easily create video-on-demand content and combines advanced video and audio capabilities with a simple web interface. Here again we use the Python Boto3 SDK to create a MediaConvert job:
```
……
job_metadata = {'asset_id': asset_id, 'application': "createMediaConvertJob"}
mc_client = boto3.client('mediaconvert', region_name=region)
endpoints = mc_client.describe_endpoints()
mc_endpoint_url = endpoints['Endpoints'][0]['Url']

mc = boto3.client('mediaconvert', region_name=region, endpoint_url=mc_endpoint_url, verify=True)

mc.create_job(Role=mediaconvert_role_arn, UserMetadata=job_metadata, Settings=mc_data["Settings"])
```

Prerequisites

Before getting started, you must have the following prerequisites:

An AWS account
AWS Cloud Development Kit (AWS CDK)

Deploy the solution

To deploy the solution using the AWS CDK, complete the following steps:

Clone the repository:

git clone https://github.com/aws-samples/localize-content-using-aws-ml-services.git

To make sure the AWS CDK is bootstrapped, run the command cdk bootstrap from the root of the repository:

$ cdk bootstrap
⏳ Bootstrapping environment aws://<acct#>/<region>...
Trusted accounts for deployment: (none)
Trusted accounts for lookup: (none)
Using default execution policy of 'arn:aws:iam::aws:policy/AdministratorAccess'. Pass '--cloudformation-execution-policies' to customize.
✅ Environment aws://<acct#>/<region> bootstrapped (no changes).

Change the working directory to the root of the repository and run the following command:
```
cdk deploy
```

By default, the target audio settings are set to US Spanish (es-US). If you plan to test it with a different target language, use the following command:

cdk deploy --parameters pollyLanguageCode=<pollyLanguageCode> 
           --parameters pollyVoiceId=<pollyVoiceId>
           --parameters pollyEngine=<pollyEngine> 
           --parameters mediaConvertLangShort=<mediaConvertLangShort>
           --parameters mediaConvertLangLong=<mediaConvertLangLong>
           --parameters targetLanguageCode=<targetLanguageCode>

The process takes a few minutes to complete, after which it displays a link that you can use to view the target video file with the translated audio and translated subtitles.

Test the solution

To test this solution, we used a small portion of the following AWS re:Invent 2017 video from YouTube, where Amazon Transcribe was first introduced. You can also test the solution with your own video. The original language of our test video is English. When you deploy this solution, you can specify the target audio settings or you can use the default target audio settings, which uses Spanish for generating audio and subtitles. The solution creates an S3 bucket that can be used to upload the video file to.

On the Amazon S3 console, navigate to the bucket PollyBlogBucket.
Choose the bucket, navigate to the /inputVideo directory, and upload the video file (the solution is tested with videos of type mp4). At this point, an S3 event notification triggers the Lambda function, which starts the state machine.
On the Step Functions console, browse to the state machine (ProcessAudioWithSubtitles).
Choose one of the runs of the state machine to locate the Graph Inspector.

This shows the run results for each state. The Step Functions workflow takes a few minutes to complete, after which you can verify if all the steps successfully completed.

Review the output

To review the output, open the Amazon S3 console and check if the audio file (.mp3) and the speech mark file (.marks) are stored in the S3 bucket under <ROOT_S3_BUCKET>/<UID>/synthesisOutput/.

The following is a sample of the speech mark file generated from the translated text:

{"time":6,"type":"word","start":2,"end":6,"value":"Qué"}
{"time":109,"type":"word","start":7,"end":10,"value":"tal"}
{"time":347,"type":"word","start":11,"end":13,"value":"el"}
{"time":453,"type":"word","start":14,"end":20,"value":"idioma"}
{"time":1351,"type":"word","start":22,"end":24,"value":"Ya"}
{"time":1517,"type":"word","start":25,"end":30,"value":"sabes"}
{"time":2240,"type":"word","start":32,"end":38,"value":"hablé"}
{"time":2495,"type":"word","start":39,"end":44,"value":"antes"}
{"time":2832,"type":"word","start":45,"end":50,"value":"sobre"}
{"time":3125,"type":"word","start":51,"end":53,"value":"el"}
{"time":3227,"type":"word","start":54,"end":59,"value":"hecho"}
{"time":3464,"type":"word","start":60,"end":62,"value":"de"}

In this output, each part of the text is broken out in terms of speech marks:

time – The timestamp in milliseconds from the beginning of the corresponding audio stream
type – The type of speech mark (sentence, word, viseme, or SSML)
start – The offset in bytes (not characters) of the start of the object in the input text (not including viseme marks)
end – The offset in bytes (not characters) of the object’s end in the input text (not including viseme marks)
value – Individual words in the sentence

The generated subtitle file is written back to the S3 bucket. You can find the file under <ROOT_S3_BUCKET>/<UID>/subtitlesOutput/. Inspect the subtitle file; the content should be similar to the following text:

1
00:00:00,006 --> 00:00:03,226
¿Qué tal el idioma? Ya sabes, hablé antes sobre el

2
00:00:03,227 --> 00:00:06,065
hecho de que el año pasado lanzamos Polly y Lex,

3
00:00:06,066 --> 00:00:09,263
pero hay muchas otras cosas que los constructores quieren hacer

4
00:00:09,264 --> 00:00:11,642
con el lenguaje. Y una de las cosas que ha

5
00:00:11,643 --> 00:00:14,549
sido interesante es que ahora hay tantos datos que están

After the subtitles file and audio file are generated, the final source video file is created using MediaConvert. Check the MediaConvert console to verify if the job status is COMPLETE.

When the MediaConvert job is complete, the final video file is generated and saved back to the S3 bucket, which can be found under <ROOT_S3_BUCKET>/<UID>/convertedAV/.

As part of this deployment, the final video is distributed through an Amazon CloudFront (CDN) link and displayed in the terminal or in the AWS CloudFormation console.

Open the URL in a browser to view the original video with additional options for audio and subtitles. You can verify that the translated audio and subtitles are in sync.

Conclusion

In this post, we discussed how to create new language versions of video files without the need of manual intervention. Content creators can use this process to synchronize the audio and subtitles of their videos and reach a global audience.

You can easily integrate this approach into your own production pipelines to handle large volumes and scale according to your needs. Amazon Polly uses Neural TTS (NTTS) to produce natural and human-like text-to-speech voices. It also supports generating speech from SSML, which gives you additional control over how Amazon Polly generates speech from the text provided. Amazon Polly also provides a variety of different voices in multiple languages to support your needs.

Get started with AWS machine learning services by visiting the product page, or refer the Amazon Machine Learning Solutions Lab page where you can collaborate with experts to bring machine learning solutions to your organization.

Additional resources

For more information about the services used in this solution, refer to the following:

About the authors

Reagan Rosario works as a solutions architect at AWS focusing on education technology companies. He loves helping customers build scalable, highly available, and secure solutions in the AWS Cloud. He has more than a decade of experience working in a variety of technology roles, with a focus on software engineering and architecture.

Anil Kodali is a Solutions Architect with Amazon Web Services. He works with AWS EdTech customers, guiding them with architectural best practices for migrating existing workloads to the cloud and designing new workloads with a cloud-first approach. Prior to joining AWS, he worked with large retailers to help them with their cloud migrations.

Prasanna Saraswathi Krishnan is a Solutions Architect with Amazon Web Services working with EdTech customers. He helps them drive their cloud architecture and data strategy using best practices. His background is in distributed computing, big data analytics, and data engineering. He is passionate about machine learning and natural language processing.

Identify rooftop solar panels from satellite imagery using Amazon Rekognition Custom Labels

Renewable resources like sunlight provide a sustainable and carbon neutral mechanism to generate power. Governments in many countries are providing incentives and subsidies to households to install solar panels as part of small-scale renewable energy schemes. This has created a huge demand for solar panels. Reaching out to potential customers at the right time, through the right channel, and with attractive offers is very crucial for solar and energy companies. They’re looking for cost-efficient approaches and tools to conduct targeted marketing to proactively reach out to potential customers. By identifying the suburbs that have low coverage of solar panel installation at scale, they can maximize their marketing initiatives to those places, so as to maximize the return on their marketing investment.

In this post, we discuss how you can identify solar panels on rooftops from satellite imagery using Amazon Rekognition Custom Labels.

The problem

High-resolution satellite imagery of urban areas provides an aerial view of rooftops. You can use these images to identify solar panel installations. But it is a challenging task to automatically identify solar panels with high accuracy, low cost, and in a scalable way.

With rapid development in computer vision technology, several third-party tools use computer vision to analyze satellite images and identify objects (like solar panels) automatically. However, these tools are expensive and increase the overall cost of marketing. Many organizations have also successfully implemented state-of-the-art computer vision applications to identify the presence of solar panels on the rooftops from the satellite images.

But the reality is that you need to build your own data science teams that have the specific expertise and experience to build a production machine learning (ML) application for your specific use case. It generally takes months for teams to build a computer vision solution that they can use in production. This leads to an increased cost in building and maintaining such a system.

Is there a simpler and cost-effective solution that helps solar companies quickly build effective computer vision models without building a dedicated data science team for that purpose? Yes, Rekognition Custom Labels is the answer to this question.

Solution overview

Rekognition Custom Labels is a feature of Amazon Rekognition that takes care of the heavy lifting of computer vision model development for you, so no computer vision experience is required. You simply provide images with the appropriate labels, train the model, and deploy without having to build the model and fine-tune it. Rekognition Custom Labels has the capability to build highly accurate models with fewer labeled images. This takes away the heavy lifting of model development and helps you focus on developing value-added products and applications to your customers.

In this post, we show how to label, train, and build a computer vision model to detect rooftops and solar panels from satellite images. We use Amazon Simple Storage Service (Amazon S3) for storing satellite images, Amazon SageMaker Ground Truth for labeling the images with the appropriate labels of interest, and Rekognition Custom Labels for model training and hosting. To test the model output, we use a Jupyter notebook to run Python code to detect custom labels in a supplied image by calling Amazon Rekognition APIs.

The following diagram illustrates the architecture using AWS services to label the images, and train and host the ML model.

The solution workflow is as follows:

Store satellite imagery data in Amazon S3 as the input source.
Use a Ground Truth labeling job to label the images.
Use Amazon Rekognition to train the model with custom labels.
Fine-tune the trained model.
Start the model and analyze the image with the trained model using the Rekognition Custom Labels API.

Store satellite imagery data in Amazon S3 as an input source

The satellite images of rooftops with and without solar panels are captured from the satellite imagery data providers and stored in an S3 bucket. For our post, we use the images of New South Wales (NSW), Australia, provided by the Spatial Services, Department of Customer Service NSW. We have taken the screenshots of the rooftops from this portal and stored those images in the source S3 bucket. These images are labeled using a Ground Truth labeling job, as explained in the next step.

Use a Ground Truth labeling job to label the images

Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML tasks. It has three options:

Amazon Mechanical Turk, which uses a public workforce to label the data
Private, which allows you to create a private workforce from internal teams
Vendor, which uses third-party resources for the labeling task

In this example, we use a private workforce to perform the data labeling job. Refer to Use Amazon SageMaker Ground Truth to Label Data for instructions on creating a private workforce and configuring Ground Truth for a labeling job with bounding boxes.

The following is an example image of the labeling job. The labeler can draw bounding boxes of the targets with the selected labels indicated by different colors. We used three labels on the images: rooftop, rooftop-panel, and panel to signify rooftops without solar panels, rooftops with solar panels, and just solar panels, respectively.

When the labeling job is complete, an output.manifest file is generated and stored in the S3 output location that you specified when creating the labeling job. The following code is an example of one image labeling output in the manifest file:

{
"source-ref": "s3://<your-bucket-name>/blog-images/source-image/03-09-2021/source-image-001.png",
“Rekognition-solar-panel-labeling”:{
    "image_size": [{"width":644,"height":560,"depth":3}],
    "annotations": [
        {"class_id":1,"top":51,"left":188,"height":175,"width":236},
        {"class_id":2,"top":58,"left":276,"height":32,"width":105},
        {"class_id":0,"top":332,"left":150,"height":192,"width":151},
        {"class_id":0,"top":271,"left":354,"height":79,"width":121}
    ]},
"Rekognition-solar-panel-labeling-metadata": {
    "objects": [{"confidence":0},{"confidence":0},{"confidence":0},{"confidence":0}],
    "class-map":{
        "1":"rooftop-panel",
        "2":"panel",
        "0":"rooftop"},
    "type": "groundtruth/object-detection",
    "human-annotated": "yes",
    "creation-date": "2021-09-14T11:30:41.125196",
    "job-name": "labeling-job/Rekognition-solar-panel-labeling"
    }
}

The output manifest file is what we need for the Amazon Rekognition training job. In the next section, we provide step-by-step instructions to create a high-performance ML model to detect objects of interest.

Use Amazon Rekognition to train the model with custom labels

We now create a project for a custom object detection model, and provide the labeled images to Rekognition Custom Labels to train the model.

On the Amazon Rekognition console, choose Use Custom Labels in the navigation pane.
In the navigation pane, choose Projects.
Choose Create project.
For Project name, enter a unique name.
Choose Create project.

Next, we create a dataset for the training job.

In the navigation pane, choose Datasets.
Create a dataset based on the manifest file generated by the Ground Truth labeling job.

We’re now ready to train a new model.

Select the project you created and choose Train new model.
Choose the training dataset you created and choose the test dataset.
Choose Train to start training the model.

You can create a new test dataset or split the training dataset to use 20% of the training data as the test dataset and the remaining 80% as the training dataset. However, if you split the training dataset, the training and test datasets are randomly selected from the whole dataset every time you train a new model. In this example, we create a separate test dataset to evaluate the trained models.

We collected an additional 50 satellite images and labeled them using Ground Truth. We then used the output manifest file of this labeling job to create the test dataset.

This allows us to compare the evaluation metrics of different versions of the model that are trained based on different input datasets. The first training dataset consists of 160 images; the test dataset has 50 images.

When the model training process is complete, we can access the evaluation metrics on the Evaluate tab on the model page. Our training job was able to achieve an F1 score of 0.934. The model evaluation metrics are reasonably good considering the number of training images we used and the number of images used for validating the model.

Although the model evaluation metrics are reasonable, it’s important to understand which images the model incorrectly labels, so as to further fine-tune the model’s performance to make it more robust to handle real-world challenges. In the next section, we describe the process of evaluating the images that have inaccurate labels and retraining the model to achieve better performance.

Fine-tune the trained model

Evaluating incorrect labels inferred by the trained model is a crucial step to further fine-tune the model. To check the detailed test results, we can choose the training job and choose View test results to evaluate the images that the model inaccurately labeled. Evaluating the model performance on test data can help you identify any potential labeling or data source-related issues. For example, the following test image shows an example of false-positive labeling of a rooftop.

As you can determine from the preceding image, the identified rooftop is correct—it’s the rooftop of a smaller home built on the property. Based on the source image name, we can go back to the dataset to check the labels. We can do this via the Amazon Rekognition console. In the following screenshot, we can determine that the source image wasn’t labeled correctly. The labeler missed labeling that rooftop.

To correct the labeling issue, we don’t need to rerun the Ground Truth job or run an adjustment job on the whole dataset. We can easily verify or adjust individual images via the Rekognition Custom Labels console.

On the dataset page, choose Start labeling.
Select the image file that needs adjustment and choose Draw bounding box.

On the labeling page, we can draw or update the bounding boxes on this image.

Draw the bounding box around the smaller building and label it as rooftop.
Choose Done to save the changes or choose Next or Previous to navigate through additional images that require adjustments.

In some situations, you might have to provide more images with examples of the rooftops that the model failed to identify correctly. We can collect more images, label them, and retrain the model so that the model can learn the special cases of rooftops and solar panels.

To add more training images to the training dataset, you can create another Ground Truth job if the number of added images is large and if you need to create a labeling workforce team to label the images. When the labeling job is finished, we get a new manifest file, which contains the bounding box information for the newly added images. Then we need to manually merge the manifest file of the newly added images to the existing manifest file. We use the combined manifest file to create a new training dataset on the Rekognition Custom Labels console and train a more robust model.

Another option is to add the images directly to the current training dataset if the number of new images isn’t large and one person is sufficient to finish the labeling task on the Amazon Rekognition console. In this project, we directly add another 30 images to the original training dataset and perform labeling on the console.

After we complete the label verification and add more images of different rooftop and panel types, we have a second model trained with 190 training images that we evaluate on the same test dataset. The second version of the trained model achieved an F1 score of 0.964, which is an improvement from the earlier score of 0.934. Based on your business requirement, you can further fine-tune the model.

Deploy the model and analyze the images using the Rekognition Custom Labels API

Now that we have trained a model with satisfactory evaluation results, we deploy the model to an endpoint for real-time inference. We analyze a few images using Python code via the Amazon Rekognition API on this deployed model. After you start the model, its status shows as Running.

Now the model is ready to detect the labels on the new satellite images. We can test the model by running the provided sample Python API code. On the model page, choose API Code.

Select Python to review the sample code to start the model, analyze images, and stop the model.

Copy the Python code in the Analyze image section into a Jupyter notebook, which can be running on your laptop.

To set up the environment to run the code, we need to install the AWS SDKs that we want to use and configure the security credentials to access the AWS resources. For instructions, refer to Set Up the AWS CLI and AWS SDKs.

Upload a test image to an S3 bucket. In the Analyze image Python code, substitute the variable MY_BUCKET with the bucket name that has the test image and replace MY_IMAGE_KEY with the file name of the test image.

The following screenshot shows a sample response of running the Python code.

The following output image shows that the model has successfully detected three labels: rooftop, rooftop-panel, and panel.

Clean up

After testing, we can stop the model to avoid any unnecessary charges incurred to run the model.

Conclusion

In this post, we showed you how to detect rooftops and solar panels from the satellite imagery by building custom computer vision models with Rekognition Custom Labels. We demonstrated how Rekognition Custom Labels manages the model training by taking care of the deep learning complexities behind the scenes. We also demonstrated how to use Ground Truth to label the training images at scale. Furthermore, we discussed mechanisms to improve model accuracy by correcting the labeling of the images on the fly and retraining the model with the dataset. Power utility companies can use this solution to detect houses without solar panels to send offers and promotions to achieve efficient targeted marketing.

To learn more about how Rekognition Custom Labels can help your business, visit Amazon Rekognition Custom Labels or contact AWS Sales.

About the Authors

Melanie Li is a Senior AI/ML Specialist TAM at AWS based in Sydney, Australia. She helps enterprise customers to build solutions leveraging the state-of-the-art AI/ML tools on AWS and provides guidance on architecting and implementing machine learning solutions with best practices. In her spare time, she loves to explore nature outdoors and spend time with family and friends.

Santosh Kulkarni is a Solutions Architect at Amazon Web Services. He works closely with enterprise customers to accelerate their Cloud journey. He is also passionate about building large-scale distributed applications to solve business problems using his knowledge in Machine Learning, Big Data, and Software Development.

Dr. Baichuan Sun is a Senior Data Scientist at AWS AI/ML. He is passionate about solving strategic business problems with customers using data-driven methodology on the cloud, and he has been leading projects in challenging areas including robotics computer vision, time series forecasting, price optimization, predictive maintenance, pharmaceutical development, product recommendation system, etc. In his spare time he enjoys traveling and hanging out with family.

New method identifies the root causes of statistical outliers

Amazon ICML paper proposes information-theoretic measurement of quantitative causal contribution.Read More

What Every User Should Know About Mixed Precision Training in PyTorch

Efficient training of modern neural networks often relies on using lower precision data types. Peak float16 matrix multiplication and convolution performance is 16x faster than peak float32 performance on A100 GPUs. And since the float16 and bfloat16 data types are only half the size of float32 they can double the performance of bandwidth-bound kernels and reduce the memory required to train a network, allowing for larger models, larger batches, or larger inputs. Using a module like torch.amp (short for “Automated Mixed Precision”) makes it easy to get the speed and memory usage benefits of lower precision data types while preserving convergence behavior.

Going faster and using less memory is always advantageous – deep learning practitioners can test more model architectures and hyperparameters, and larger, more powerful models can be trained. Training very large models like those described in Narayanan et al. and Brown et al. (which take thousands of GPUs months to train even with expert handwritten optimizations) is infeasible without using mixed precision.

We’ve talked about mixed precision techniques before (here, here, and here), and this blog post is a summary of those techniques and an introduction if you’re new to mixed precision.

Mixed Precision Training in Practice

Mixed precision training techniques – the use of the lower precision float16 or bfloat16 data types alongside the float32 data type – are broadly applicable and effective. See Figure 1 for a sampling of models successfully trained with mixed precision, and Figures 2 and 3 for example speedups using torch.amp.

Figure 1: Sampling of DL Workloads Successfully Trained with float16 (Source).

Figure 2: Performance of mixed precision training using torch.amp on NVIDIA 8xV100 vs. float32 training on 8xV100 GPU. Bars represent the speedup factor of torch.amp over float32.
(Higher is better.) (Source).

Figure 3. Performance of mixed precision training using torch.amp on NVIDIA 8xA100 vs. 8xV100 GPU. Bars represent the speedup factor of A100 over V100.
(Higher is Better.) (Source).

See the NVIDIA Deep Learning Examples repository for more sample mixed precision workloads.

Similar performance charts can be seen in 3D medical image analysis, gaze estimation, video synthesis, conditional GANs, and convolutional LSTMs. Huang et al. showed that mixed precision training is 1.5x to 5.5x faster over float32 on V100 GPUs, and an additional 1.3x to 2.5x faster on A100 GPUs on a variety of networks. On very large networks the need for mixed precision is even more evident. Narayanan et al. reports that it would take 34 days to train GPT-3 175B on 1024 A100 GPUs (with a batch size of 1536), but it’s estimated it would take over a year using float32!

Getting Started With Mixed Precision Using torch.amp

torch.amp, introduced in PyTorch 1.6, makes it easy to leverage mixed precision training using the float16 or bfloat16 dtypes. See this blog post, tutorial, and documentation for more details. Figure 4 shows an example of applying AMP with grad scaling to a network.

import torch
# Creates once at the beginning of training
scaler = torch.cuda.amp.GradScaler()

for data, label in data_iter:
   optimizer.zero_grad()
   # Casts operations to mixed precision
   with torch.amp.autocast(device_type=“cuda”, dtype=torch.float16):
      loss = model(data)

   # Scales the loss, and calls backward()
   # to create scaled gradients
   scaler.scale(loss).backward()

   # Unscales gradients and calls
   # or skips optimizer.step()
   scaler.step(optimizer)

   # Updates the scale for next iteration
   scaler.update()

Figure 4: AMP recipe

Picking The Right Approach

Out-of-the-box mixed precision training with either float16 or bfloat16 is effective at speeding up the convergence of many deep learning models, but some models may require more careful numerical accuracy management. Here are some options:

Full float32 precision. Floating point tensors and modules are created in float32 precision by default in PyTorch, but this is a historic artifact not representative of training most modern deep learning networks. It’s rare that networks need this much numerical accuracy.
Enabling TensorFloat32 (TF32) mode. On Ampere and later CUDA devices matrix multiplications and convolutions can use the TensorFloat32 (TF32) mode for faster but slightly less accurate computations. See the Accelerating AI Training with NVIDIA TF32 Tensor Cores blog post for more details. By default PyTorch enables TF32 mode for convolutions but not matrix multiplications, and unless a network requires full float32 precision we recommend enabling this setting for matrix multiplications, too (see the documentation here for how to do so). It can significantly speed up computations with typically negligible loss of numerical accuracy.
Using torch.amp with bfloat16 or float16. Both these low precision floating point data types are usually comparably fast, but some networks may only converge with one vs the other. If a network requires more precision it may need to use float16, and if a network requires more dynamic range it may need to use bfloat16, whose dynamic range is equal to that of float32. If overflows are observed, for example, then we suggest trying bfloat16.

There are even more advanced options than those presented here, like using torch.amp’s autocasting for only parts of a model, or managing mixed precision directly. These topics are largely beyond the scope of this blog post, but see the “Best Practices” section below.

Best Practices

We strongly recommend using mixed precision with torch.amp or the TF32 mode (on Ampere and later CUDA devices) whenever possible when training a network. If one of those approaches doesn’t work, however, we recommend the following:

High Performance Computing (HPC) applications, regression tasks, and generative networks may simply require full float32 IEEE precision to converge as expected.
Try selectively applying torch.amp. In particular we recommend first disabling it on regions performing operations from the torch.linalg module or when doing pre- or post-processing. These operations are often especially sensitive. Note that TF32 mode is a global switch and can’t be used selectively on regions of a network. Enable TF32 first to check if a network’s operators are sensitive to the mode, otherwise disable it.
If you encounter type mismatches while using torch.amp we don’t suggest inserting manual casts to start. This error is indicative of something being off with the network, and it’s usually worth investigating first.
Figure out by experimentation if your network is sensitive to range and/or precision of a format. For example fine-tuning bfloat16-pretrained models in float16 can easily run into range issues in float16 because of the potentially large range from training in bfloat16, so users should stick with bfloat16 fine-tuning if the model was trained in bfloat16.
The performance gain of mixed precision training can depend on multiple factors (e.g. compute-bound vs memory-bound problems) and users should use the tuning guide to remove other bottlenecks in their training scripts. Although having similar theoretical performance benefits, BF16 and FP16 can have different speeds in practice. It’s recommended to try the mentioned formats and use the one with best speed while maintaining the desired numeric behavior.

For more details, refer to the AMP Tutorial, Training Neural Networks with Tensor Cores, and see the post “More In-Depth Details of Floating Point Precision” on PyTorch Dev Discussion.

Conclusion

Mixed precision training is an essential tool for training deep learning models on modern hardware, and it will become even more important in the future as the performance gap between lower precision operations and float32 continues to grow on newer hardware, as reflected in Figure 5.

Figure 5: Relative peak throughput of float16 (FP16) vs float32 matrix multiplications on Volta and Ampere GPUs. On Ampere relative peak throughput for the TensorFloat32 (TF32) mode and bfloat16 matrix multiplications are shown, too. The relative peak throughput of low precision data types like float16 and bfloat16 vs. float32 matrix multiplications is expected to grow as new hardware is released.

PyTorch’s torch.amp module makes it easy to get started with mixed precision, and we highly recommend using it to train faster and reduce memory usage. torch.amp supports both float16 and bfloat16 mixed precision.

There are still some networks that are tricky to train with mixed precision, and for these networks we recommend trying TF32 accelerated matrix multiplications on Ampere and later CUDA hardware. Networks are rarely so precision sensitive that they require full float32 precision for every operation.

If you have questions or suggestions for torch.amp or mixed precision support in PyTorch then let us know by posting to the mixed precision category on the PyTorch Forums or filing an issue on the PyTorch GitHub page.

The virtuous cycle of AI research

We recently caught up with Petar Veličković, a research scientist at DeepMind. Along with his co-authors, Petar is presenting his paper The CLRS Algorithmic Reasoning Benchmark at ICML 2022 in Baltimore, Maryland, USA.Read More

The virtuous cycle of AI research

Google at ICML 2022

Posted by Cat Armato, Program Manager, University Relations

Google is a leader in machine learning (ML) research with groups innovating across virtually all aspects of the field, from theory to application. We build machine learning systems to solve deep scientific and engineering challenges in areas of language, music, visual processing, algorithm development, and more. Core to our approach is to actively engage with the broader research community by open-sourcing datasets and models, publishing our discoveries, and actively participating in leading conferences.

Google is proud to be a Diamond Sponsor of the thirty-ninth International Conference on Machine Learning (ICML 2022), a premier annual conference, which is being held this week in Baltimore, Maryland. Google has a strong presence at this year’s conference with over 100 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive ML research and expanding our partnership with the broader ML research community.

Registered for ICML 2022? We hope you’ll visit the Google booth to learn more about the exciting work, creativity, and fun that goes into solving a portion of the field’s most interesting challenges. Take a look below to learn more about the Google research being presented at ICML 2022 (Google affiliations in bold).

Organizing Committee

Tutorial Chairs include: Hanie Sedghi

Emeritus Members include: Andrew McCallum

Board Members include: Hugo Larochelle, Csaba Szepesvari, Corinna Cortes

Publications

Individual Preference Stability for Clustering
Saba Ahmadi, Pranjal Awasthi, Samir Khuller, Matthäus Kleindessner, Jamie Morgenstern, Pattara Sukprasert, Ali Vakilian

Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning
Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael Mozer

H-Consistency Bounds for Surrogate Loss Minimizers
Pranjal Awasthi, Anqi Mao, Mehryar Mohri, Yutao Zhong

Cooperative Online Learning in Stochastic and Adversarial MDPs
Tal Lancewicki, Aviv Rosenberg, Yishay Mansour

Do More Negative Samples Necessarily Hurt in Contrastive Learning?
Pranjal Awasthi, Nishanth Dikkala, Pritish Kamath

Deletion Robust Submodular Maximization Over Matroids
Paul Dütting, Federico Fusco^*, Silvio Lattanzi, Ashkan Norouzi-Fard, Morteza Zadimoghaddam

Tight and Robust Private Mean Estimation with Few Users
Hossein Esfandiari, Vahab Mirrokni, Shyam Narayanan^*

Generative Trees: Adversarial and Copycat
Richard Nock, Mathieu Guillame-Bert

Agnostic Learnability of Halfspaces via Logistic Loss
Ziwei Ji^*, Kwangjun Ahn^*, Pranjal Awasthi, Satyen Kale, Stefani Karp

Adversarially Trained Actor Critic for Offline Reinforcement Learning
Ching-An Cheng, Tengyang Xie, Nan Jiang, Alekh Agarwal

Unified Scaling Laws for Routed Language Models
Aidan Clark, Diego de Las Casas, Aurelia Guy, Arthur Mensch, Michela Paganini, Jordan Hoffmann, Bogdan Damoc, Blake Hechtman, Trevor Cai, Sebastian Borgeaud, George van den Driessche, Eliza Rutherford, Tom Hennigan, Matthew Johnson, Albin Cassirer, Chris Jones, Elena Buchatskaya, David Budden, Laurent Sifre, Simon Osindero, Oriol Vinyals, Marc’Aurelio Ranzato, Jack Rae, Erich Elsen, Koray Kavukcuogu, Karen Simonyan

Large Batch Experience Replay
Thibault Lahire, Matthieu Geist, Emmanuel Rachelson

Robust Training of Neural Networks Using Scale Invariant Architectures
Zhiyuan Li^*, Srinadh Bhojanapalli, Manzil Zaheer, Sashank J. Reddi, Sanjiv Kumar

The Poisson Binomial Mechanism for Unbiased Federated Learning with Secure Aggregation
Wei-Ning Chen, Ayfer Ozgur, Peter Kairouz

Global Optimization Networks
Sen Zhao, Erez Louidor, Maya Gupta

A Joint Exponential Mechanism for Differentially Private Top-k
Jennifer Gillenwater, Matthew Joseph, Andres Munoz Medina, Mónica Ribero

On the Practicality of Deterministic Epistemic Uncertainty
Janis Postels, Mattia Segu, Tao Sun, Luc Van Gool, Fisher Yu, Federico Tombari

Balancing Discriminability and Transferability for Source-Free Domain Adaptation
Jogendra Nath Kundu, Akshay Kulkarni, Suvaansh Bhambri, Deepesh Mehta, Shreyas Kulkarni, Varun Jampani, Venkatesh Babu Radhakrishnan

Transfer and Marginalize: Explaining Away Label Noise with Privileged Information
Mark Collier, Rodolphe Jenatton, Efi Kokiopoulou, Jesse Berent

In Defense of Dual-Encoders for Neural Ranking
Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Seungyeon Kim, Sashank Jakkam Reddi, Sanjiv Kumar

Surrogate Likelihoods for Variational Annealed Importance Sampling
Martin Jankowiak, Du Phan

Translatotron 2: High-Quality Direct Speech-to-Speech Translation with Voice Preservation (see blog post)
Ye Jia, Michelle Tadmor Ramanovich, Tal Remez, Roi Pomerantz

Differentially Private Approximate Quantiles
Haim Kaplan, Shachar Schnapp, Uri Stemmer

Continuous Control with Action Quantization from Demonstrations
Robert Dadashi, Léonard Hussenot, Damien Vincent, Sertan Girgin, Anton Raichuk, Matthieu Geist, Olivier Pietquin

Data Scaling Laws in NMT: The Effect of Noise and Architecture
Yamini Bansal^*, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat

Debiaser Beware: Pitfalls of Centering Regularized Transport Maps
Aram-Alexandre Pooladian, Marco Cuturi, Jonathan Niles-Weed

A Context-Integrated Transformer-Based Neural Network for Auction Design
Zhijian Duan, Jingwu Tang, Yutong Yin, Zhe Feng, Xiang Yan, Manzil Zaheer, Xiaotie Deng

Algorithms for the Communication of Samples
Lucas Theis, Noureldin Yosri

Being Properly Improper
Tyler Sypherd, Richard Nock, Lalitha Sankar

Guarantees for Epsilon-Greedy Reinforcement Learning with Function Approximation
Chris Dann, Yishay Mansour, Mehryar Mohri, Ayush Sekhari, Karthik Sridharan

Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error
Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, Shixiang Shane Gu

Public Data-Assisted Mirror Descent for Private Model Training
Ehsan Amid, Arun Ganesh^*, Rajiv Mathews, Swaroop Ramaswamy, Shuang Song, Thomas Steinke, Vinith M. Suriyakumar^*, Om Thakkar, Abhradeep Thakurta

Deep Hierarchy in Bandits
Joey Hong, Branislav Kveton, Sumeet Katariya, Manzil Zaheer, Mohammad Ghavamzadeh

Scalable Deep Reinforcement Learning Algorithms for Mean Field Games
Mathieu Lauriere, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Perolat, Romuald Elie, Olivier Pietquin, Matthieu Geist

Faster Privacy Accounting via Evolving Discretization
Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi

HyperPrompt: Prompt-Based Task-Conditioning of Transformers
Yun He^*, Huaixiu Steven Zheng, Yi Tay, Jai Gupta, Yu Du, Vamsi Aribandi, Zhe Zhao, YaGuang Li, Zhao Chen, Donald Metzler, Heng-Tze Cheng, Ed H. Chi

Blocks Assemble! Learning to Assemble with Large-Scale Structured Reinforcement Learning
Seyed Kamyar, Seyed Ghasemipour, Daniel Freeman, Byron David, Shixiang Shane Gu, Satoshi Kataoka, Igor Mordatch

Latent Diffusion Energy-Based Model for Interpretable Text Modelling
Peiyu Yu, Sirui Xie, Xiaojian Ma, Baoxiong Jia, Bo Pang, Ruiqi Gao, Yixin Zhu, Song-Chun Zhu, Ying Nian Wu

On the Optimization Landscape of Neural Collapse Under MSE Loss: Global Optimality with Unconstrained Features
Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, Zhihui Zhu

Efficient Reinforcement Learning in Block MDPs: A Model-Free Representation Learning Approach
Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun

Robust Training Under Label Noise by Over-Parameterization
Sheng Liu, Zhihui Zhu, Qing Qu, Chong You

FriendlyCore: Practical Differentially Private Aggregation
Eliad Tsfadia, Edith Cohen, Haim Kaplan, Yishay Mansour, Uri Stemmer

Adaptive Data Analysis with Correlated Observations
Aryeh Kontorovich, Menachem Sadigurschi,Uri Stemmer

A Resilient Distributed Boosting Algorithm
Yuval Filmus, Idan Mehalel, Shay Moran

On Learning Mixture of Linear Regressions in the Non-Realizable Setting
Avishek Ghosh, Arya Mazumdar,Soumyabrata Pal, Rajat Sen

Online and Consistent Correlation Clustering
Vincent Cohen-Addad, Silvio Lattanzi, Andreas Maggiori, Nikos Parotsidis

From Block-Toeplitz Matrices to Differential Equations on Graphs: Towards a General Theory for Scalable Masked Transformers
Krzysztof Choromanski, Han Lin, Haoxian Chen, Tianyi Zhang, Arijit Sehanobish, Valerii Likhosherstov, Jack Parker-Holder, Tamas Sarlos, Adrian Weller, Thomas Weingarten

Parsimonious Learning-Augmented Caching
Sungjin Im, Ravi Kumar, Aditya Petety, Manish Purohit

General-Purpose, Long-Context Autoregressive Modeling with Perceiver AR
Curtis Hawthorne, Andrew Jaegle, Cătălina Cangea, Sebastian Borgeaud, Charlie Nash, Mateusz Malinowski, Sander Dieleman, Oriol Vinyals, Matthew Botvinick, Ian Simon, Hannah Sheahan, Neil Zeghidour, Jean-Baptiste Alayrac, Joao Carreira, Jesse Engel

Conformal Prediction Sets with Limited False Positives
Adam Fisch, Tal Schuster, Tommi Jaakkola, Regina Barzilay

Dialog Inpainting: Turning Documents into Dialogs
Zhuyun Dai, Arun Tejasvi Chaganty, Vincent Zhao, Aida Amini, Qazi Mamunur Rashid, Mike Green, Kelvin Guu

Benefits of Overparameterized Convolutional Residual Networks: Function Approximation Under Smoothness Constraint
Hao Liu, Minshuo Chen, Siawpeng Er, Wenjing Liao, Tong Zhang, Tuo Zhao

Congested Bandits: Optimal Routing via Short-Term Resets
Pranjal Awasthi, Kush Bhatia, Sreenivas Gollapudi, Kostas Kollias

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance
Zhuoning Yuan, Yuexin Wu, Zihao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, Tianbao Yang

Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Biao Zhang^*, Behrooz Ghorbani, Ankur Bapna, Yong Cheng, Xavier Garcia, Jonathan Shen, Orhan Firat

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (see blog post)
Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathy Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, Claire Cui

How to Leverage Unlabeled Data in Offline Reinforcement Learning?
Tianhe Yu, Aviral Kumar, Yevgen Chebotar, Karol Hausman, Chelsea Finn, Sergey Levine

Distributional Hamilton-Jacobi-Bellman Equations for Continuous-Time Reinforcement Learning
Harley Wiltzer, David Meger, Marc G. Bellemare

On the Robustness of CountSketch to Adaptive Inputs
Edith Cohen, Xin Lyu, Jelani Nelson, Tamás Sarlós, Moshe Shechner, Uri Stemmer

Model Selection in Batch Policy Optimization
Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai

The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning
Wei-Ning Chen, Christopher A. Choquette-Choo, Peter Kairouz, Ananda Theertha Suresh

Linear-Time Gromov Wasserstein Distances Using Low Rank Couplings and Costs
Meyer Scetbon, Gabriel Peyré, Marco Cuturi^*

Active Sampling for Min-Max Fairness
Jacob Abernethy, Pranjal Awasthi, Matthäus Kleindessner, Jamie Morgenstern, Chris Russell, Jie Zhang

Making Linear MDPs Practical via Contrastive Representation Learning
Tianjun Zhang, Tongzheng Ren, Mengjiao Yang, Joseph E. Gonzalez, Dale Schuurmans, Bo Dai

Achieving Minimax Rates in Pool-Based Batch Active Learning
Claudio Gentile, Zhilei Wang, Tong Zhang

Private Adaptive Optimization with Side Information
Tian Li, Manzil Zaheer, Sashank J. Reddi, Virginia Smith

Self-Supervised Learning With Random-Projection Quantizer for Speech Recognition
Chung-Cheng Chiu, James Qin, Yu Zhang, Jiahui Yu, Yonghui Wu

Wide Bayesian Neural Networks Have a Simple Weight Posterior: Theory and Accelerated Sampling
Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

The State of Sparse Training in Deep Reinforcement Learning
Laura Graesser, Utku Evci, Erich Elsen, Pablo Samuel Castro

Constrained Discrete Black-Box Optimization Using Mixed-Integer Programming
Theodore P. Papalexopoulos, Christian Tjandraatmadja, Ross Anderson, Juan Pablo Vielma, David Belanger

Massively Parallel k-Means Clustering for Perturbation Resilient Instances
Vincent Cohen-Addad, Vahab Mirrokni, Peilin Zhong

What Language Model Architecture and Pre-training Objective Works Best for Zero-Shot Generalization?
Thomas Wang, Adam Roberts, Daniel Hesslow, Teven Le Scao, Hyung Won Chung, Iz Beltagy, Julien Launay, Colin Raffel

Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy Without Increasing Inference Time
Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, Ludwig Schmidt

Synergy and Symmetry in Deep Learning: Interactions Between the Data, Model, and Inference Algorithm
Lechao Xiao, Jeffrey Pennington

Fast Finite Width Neural Tangent Kernel
Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz

The Combinatorial Brain Surgeon: Pruning Weights that Cancel One Another in Neural Networks
Xin Yu, Thiago Serra, Srikumar Ramalingam, Shandian Zhe

Bayesian Imitation Learning for End-to-End Mobile Manipulation
Yuqing Du, Daniel Ho, Alexander A. Alemi, Eric Jang, Mohi Khansari

HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning
Andrey Zhmoginov, Mark Sandler, Max Vladymyrov

Marginal Distribution Adaptation for Discrete Sets via Module-Oriented Divergence Minimization
Hanjun Dai, Mengjiao Yang, Yuan Xue, Dale Schuurmans, Bo Dai

Correlated Quantization for Distributed Mean Estimation and Optimization
Ananda Theertha Suresh, Ziteng Sun, Jae Hun Ro, Felix Yu

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
Wenlong Huang, Pieter Abbeel, Deepak Pathak, Igor Mordatch

Only Tails Matter: Average-Case Universality and Robustness in the Convex Regime
Leonardo Cunha, Gauthier Gidel, Fabian Pedregosa, Damien Scieur, Courtney Paquette

Learning Iterative Reasoning through Energy Minimization
Yilun Du, Shuang Li, Josh Tenenbaum, Igor Mordatch

Interactive Correlation Clustering with Existential Cluster Constraints
Rico Angell, Nicholas Monath, Nishant Yadav, Andrew McCallum

Building Robust Ensembles via Margin Boosting
Dinghuai Zhang, Hongyang Zhang, Aaron Courville, Yoshua Bengio, Pradeep Ravikumar, Arun Sai Suggala

Probabilistic Bilevel Coreset Selection
Xiao Zhou, Renjie Pi, Weizhong Zhang, Yong Lin, Tong Zhang

Model Agnostic Sample Reweighting for Out-of-Distribution Learning
Xiao Zhou, Yong Lin, Renjie Pi, Weizhong Zhang, Renzhe Xu, Peng Cui, Tong Zhang

Sparse Invariant Risk Minimization
Xiao Zhou, Yong Lin, Weizhong Zhang, Tong Zhang

RUMs from Head-to-Head Contests
Matteo Almanza, Flavio Chierichetti, Ravi Kumar, Alessandro Panconesi, Andrew Tomkins

A Parametric Class of Approximate Gradient Updates for Policy Optimization
Ramki Gummadi, Saurabh Kumar, Junfeng Wen, Dale Schuurmans

On Implicit Bias in Overparameterized Bilevel Optimization
Paul Vico, Jonathan Lorraine, Fabian Pedregosa, David Duvenaud, Roger Grosse

Feature and Parameter Selection in Stochastic Linear Bandits
Ahmadreza Moradipari, Berkay Turan, Yasin Abbasi-Yadkori, Mahnoosh Alizadeh, Mohammad Ghavamzadeh

Neural Network Poisson Models for Behavioural and Neural Spike Train Data
Moein Khajehnejad, Forough Habibollahi, Richard Nock, Ehsan Arabzadeh, Peter Dayan and Amir Dezfouli

Deep Equilibrium Networks are Sensitive to Initialization Statistics
Atish Agarwala, Samuel Schoenholz

A Regret Minimization Approach to Multi-Agent Control
Udaya Ghai, Udari Madhushani, Naomi Leonard, Elad Hazan

Transformer Quality in Linear Time
Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

Workshops

Shift Happens: Crowdsourcing Metrics and Test Datasets Beyond ImageNet
Organizing Committee includes: Roland S. Zimmerman
Invited Speakers include: Chelsea Finn, Lucas Beyer

Machine Learning for Audio Synthesis
Organizing Committee includes: Yu Zhang
Invited Speakers include: Chris Donahue

New Frontiers in Adversarial Machine Learning
Organizing Committee includes: Sanmi Koyejo

Spurious Correlations, Invariance, and Stability (SIC)
Organizing Committee includes: Victor Veitch

DataPerf: Benchmarking Data for Data-Centric AI
Organizing Committee includes: Lora Aroyo, Peter Mattson, Praveen Paritosh
DataPerf Speakers include: Lora Aroyo, Peter Mattson, Praveen Paritosh
Invited Speakers include: Jordi Pont-Tuset

Machine Learning for Astrophysics
Invited Speakers include: Dustin Tran

Dynamic Neural Networks
Organizing Committee includes: Carlos Riquelme
Panel Chairs include: Neil Houlsby

Interpretable Machine Learning in Healthcare (IMLH)
Organizing Committee includes: Ramin Zabih
Invited Speakers include: Been Kim

Human-Machine Collaboration and Teaming
Invited Speakers include: Fernanda Viégas, Martin Wattenberg, Yuhuai (Tony) Wu

Pre-training: Perspectives, Pitfalls, and Paths Forward
Organizing Committee includes: Hugo Larochelle, Chelsea Finn
Invited Speakers include: Hanie Sedgh, Charles Sutton

Responsible Decision Making in Dynamic Environments
Invited Speakers include: Craig Boutilier

Principles of Distribution Shift (PODS)
Organizing Committee includes: Hossein Mobahi

Hardware-Aware Efficient Training (HAET)
Invited Speakers include: Tien-Ju Yang

Updatable Machine Learning
Invited Speakers include: Chelsea Finn, Nicolas Papernot
Organizing Committee includes: Ananda Theertha Suresh, Badih Ghazi, Chiyuan Zhang, Kate Donahue, Peter Kairouz, Ziteng Sun

Knowledge Retrieval and Language Models
Invited Speakers include: Fernando Diaz, Quoc Le, Kenton Lee, Ellie Pavlick
Organizing Committee includes: Urvashi Khandelwal, Chiyuan Zhang

Theory and Practice of Differential Privacy
Organizing Committee includes: Badih Ghazi, Matthew Joseph, Peter Kairouz, Om Thakkar, Thomas Steinke, Ziteng Sun

Beyond Bayes: Paths Towards Universal Reasoning Systems
Invited Speakers include: Charles Sutton
Spotlight Talk: Language Model Cascades | David Dohan, Winnie Xu, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton

Safe Learning for Autonomous Driving (SL4AD)
Invited Speakers include: Chelsea Finn

^*Work done while at Google. ^↩
Read More

Generate synchronized closed captions and audio using the Amazon Polly subtitle generator

Amazon Polly, an AI generated text-to-speech service, enables you to automate and scale your interactive voice solutions, helping to improve productivity and reduce costs.

As our customers continue to use Amazon Polly for its rich set of features and ease of use, we have observed a demand for the ability to simultaneously generate synchronized audio and subtitles or closed captions for a given text input. At AWS, we continuously work backward from our customer asks, so in this post, we outline a method to generate audio and subtitles at the same time for a given text.

Although subtitles and captions are often used interchangeably, including in this post, there are subtle differences among them:

Subtitles – In subtitles, text language displayed on the screen is different from the audio language and doesn’t display anything for non-dialogue like significant sounds. The primary objective is to reach the audience that doesn’t speak the audio language in the video.
Captions (closed/open) – Captions display the dialogues being spoken in the audio in the same language. Its primary purpose is to increase accessibility in cases where the audio can’t be heard by the end consumer due to a range of issues. Closed captions are part of a different file than the audio/video source and can be turned off and on at the user’s discretion, whereas open captions are part of the video file and can’t be turned off by the user.

Benefits of using Amazon Polly to generate audio with subtitles or closed captions

Imagine the following use case: you prepare a slide-based presentation for an online learning portal. Each slide includes onscreen content and narration. The onscreen content is a basic outline, and the narration goes into detail. Instead of recording a human voice, which can be cumbersome and inconsistent, you can use Amazon Polly to generate the narration. Amazon Polly produces high-quality, consistent voices. There’s no need for post-production. In the future, if you need to update a portion of the presentation, you only need to update the affected slides. The voice matches the original slides. Additionally, when Amazon Polly generates your audio, captions are included that appear in time with the audio. You save time because there’s no manual recording involved, and save additional time when updates are needed. Your presentation also delivers more value because captions help students consume the content. It’s a win-win-win solution.

There are a multitude of use cases for captions, such as advertisements in social spaces, gymnasiums, coffee shops, and other places where typically there is something on a television with the audio muted and music in the background; online training and classes; virtual meetings; public electronic announcements; watching videos while commuting without headphones and without disturbing co-passengers; and several more.

Irrespective of the field of application, closed captioning can help with the following:

Accessibility – People with hearing impairments can better consume your content.
Retention – Online learning is easier for e-learners to grasp and retain when more human senses are involved.
Reachability – Your content can reach people that have competing priorities, such as gaming and watching news simultaneously, or people who have a different native language than the audio language.
Searchability – The content is searchable by search engines. Whereas videos can’t be searched optimally by most search engines, search engines can use the caption text files and make your content more discoverable.
Social courtesy – Sometimes it may be rude to play audio because of your surroundings, or the audio could be difficult to hear because of the noise of your environment.
Comprehension – The content is easier to comprehend irrespective of the accent of the speaker, native language of the speaker, or speed of speech. You can also take notes without repeatedly watching the same scene.

Solution overview

The library presented in this post uses Amazon Polly to generate sound and closed captions for an input text. You can easily integrate this library in your text-to-speech applications. It supports several audio formats, and captions in both VTT and SRT file formats, which are the most commonly used across the industry.

In this post, we focus on the PollyVTT() syntax and options, and offer a few examples that demonstrate how to use the Python SubtitleGeneratorForPolly to simultaneously generate synchronous audio and subtitle files for a given text input. The output audio file format can be PCM(wav), OGG, or MP3, and the subtitle file format can be VTT or SRT. Furthermore, SubtitleGeneratorForPolly supports all Amazon Polly synthesize_speech parameters and adds to the rich Amazon Polly feature set.

The polly-vtt library and its dependencies are available on GitHub.

Install and use the function

Before we look at some examples of using PollyVTT(), the function that powers SubtitleGeneratorForPolly, let’s look at the installation and syntax of it.

Install the library using the following code:

pip install

To run from the command line, you simply run polly-vtt:

Usage: polly-vtt [OPTIONS] BASE_FILENAME VOICE_ID OUTPUT_FORMAT TEXT

The following code shows your options:

--caption-format TEXT 'srt' or 'vtt'
--help Show this message and exit. 

BASE_FILENAME: Base filename for both the audio and caption files 
VOICE_ID: Polly voice to use (Case-sensitive)
OUTPUT_FORMAT: Amazon Polly output format: pcm, mp3, ogg_vorbis 
TEXT: Full text to be digitized 
Caption format: srt or vtt

Let’s look at a few examples now.

Example 1

This example generates a PCM audio file along with an SRT caption file for two simple sentences:

$ polly-vtt testfile Joanna pcm "this is a test. this is a second sentence." --caption-format srt 

testfile.wav written successfully.
testfile.wav.srt written successfully.
Total Audio Length: 0:00:03.017500 
# of Sentences: 2

Example 2

This example demonstrates how to use a paragraph of text as input. This generates audio files in WAV, MP3, and OGG, and subtitles in SRT and VTT. The following example creates six files for the given input text:

pcm_testfile.wav
pcm_testfile.wav.vtt
mp3_testfile.mp3
mp3_testfile.mp3.vtt
ogg_testfile.ogg
ogg_testfile.ogg.srt

See the following code:

from polly_vtt import PollyVTT 

text = "News content is shaped by its own unique characteristics. Sentences and paragraphs are usually short and highly in formative because writers have to compress information into a limited space. Depending on the theme, news articles may con tain relevant terminology, place names, abbreviations, people’s names, and quotes. Excellent news writing is clear, precis e, and avoids ambiguity. The writing is dynamic, especially in online articles, because content may get updated multiple times per day as new information becomes available." 

polly_vtt = PollyVTT() 

# pcm with VTT captions 
polly_vtt.generate( 
"pcm_testfile", 
Text=text, 
VoiceId="Joanna", 
OutputFormat="pcm", 
) 

# mp3 with VTT captions 
polly_vtt.generate( 
"mp3_testfile", 
Text=text, 
VoiceId="Joanna", 
OutputFormat="mp3", 
)
 
# ogg with SRT captions 
polly_vtt.generate( 
"ogg_testfile", 
"srt",
Text=text, 
VoiceId="Joanna", 
OutputFormat="ogg_vorbis", 
)

Example 3

In most cases, however, you want to pass the text as an input file. The following is a Python example of this, with the same output as the previous example:

from polly_vtt import PollyVTT
import os
import boto3
import json

polly_vtt = PollyVTT()

try:
	f=open("input.txt", "r")
	print("file is opened")
	polly_vtt.generate(
	"pcm_testfile",
	Text=f.read(),
	VoiceId="Joanna",
	OutputFormat="pcm",
	)
	f.close()
except:
	print("error occurred while converting to PCM")
print("end of file")

# mp3 with VTT captions
try:
	f=open("input.txt", "r")
	print("file is opened")
	polly_vtt.generate(
	"mp3_testfile",
	Text=f.read(),
	VoiceId="Joanna",
	OutputFormat="mp3",
	)
	f.close()
except:
	print("error occurred while converting to MP3")
print("end of file")

# ogg with SRT captions
try:
	f=open("input.txt", "r")
	print("file is opened")
	polly_vtt.generate(
	"ogg_testfile",
	"srt",
	Text=f.read(),
	VoiceId="Joanna",
	OutputFormat="ogg_vorbis",
	)
	f.close()
except:
	print("error occurred while converting to OGG")
print("end of file")

The following is a testimonial post from the AWS internal training team of using Amazon Polly with closed captions:

The following video offers a short demo of how the internal training team at AWS uses PollyVTT():

Conclusion

In this post, we shared a method to generate audio and subtitles at the same time for a given text. The PollyVTT() function and SubtitleGeneratorForPolly address a common requirement for subtitles in an efficient and effective manner. The Amazon Polly team continues to invent and offer simplified solutions to complex customer requirements.

For more tutorials and information about Amazon Polly, check out the AWS Machine Learning Blog.

About the Authors

Abhishek Soni is a Partner Solutions Architect at AWS. He works with customers to provide technical guidance for the best outcome of workloads on AWS.

Dan McKee uses audio, video, and coffee to distill content into targeted, modular, and structured courses. In his role as Curriculum Developer Project Manager for the NetSec Domain at Amazon Web Services, he leverages his experience in Data Center Networking to help subject matter experts bring ideas to life.

Orlando Karam is a Technical Curriculum Developer at Amazon Web Services, which means he gets to play with cool new technologies and then talk about it. Occasionally, he also uses those cool technologies to make his job easier.

Accelerate your identity verification projects using AWS Amplify and Amazon Rekognition sample implementations

Amazon Rekognition allows you to mitigate fraudulent attacks and minimize onboarding friction for legitimate customers through a streamlined identity verification process. This can result in an increase in customer trust and safety. Key capabilities of this solution include:

Register a new user using a selfie
Register a new user after face match against an ID card and ID card data extraction
Authenticate returning user

Amazon Rekognition offers pre-trained facial recognition capabilities that you can quickly add to your user onboarding and authentication workflows to verify opted-in users’ identities online. No machine learning (ML) expertise is required to use this service.

In a previous post, we described a typical identity verification workflow and showed you how to build an identity verification solution using various Amazon Rekognition APIs. In this post, we have added a facial identity-based authentication user interface to show a complete end-to-end identity verification solution. We provide a complete sample implementation in our GitHub repository.

Solution overview

The following reference architecture shows how you can use Amazon Rekognition, along with other AWS services, to implement identity verification.

The architecture includes the following components:

Users access the front-end web portal hosted within the AWS Amplify Amplify is an end-to-end solution that enables front-end web developers to build and deploy secure, scalable full stack applications.
Applications invoke Amazon API Gateway to route requests to the correct AWS Lambda function depending on the user flow. There are four major actions in this solution: authenticate, register, register with ID card, and update.
API Gateway uses a service integration to run the AWS Step Functions express state machine corresponding to the specific endpoint called from API Gateway. Within each step, Lambda functions are responsible for triggering the correct set of calls to and from Amazon DynamoDB and Amazon Simple Storage Service (Amazon S3), along with the relevant Amazon Rekognition APIs.
DynamoDB holds face IDs (face-id), S3 path URIs, and unique IDs (for example employee ID number) for each face-id. Amazon S3 stores all the face images.
The final major component of the solution is Amazon Rekognition. Each flow (authenticate, register, register with ID card, and update) calls different Amazon Rekognition APIs depending on the task.

Before we deploy the solution, it’s important to know the following concepts and API descriptions:

Collections – Amazon Rekognition stores information about detected faces in server-side containers known as collections. You can use the facial information that’s stored in a collection to search for known faces in images, stored videos, and streaming videos. You can use collections in a variety of scenarios. For example, you might create a face collection to store scanned badge images by using the IndexFaces When an employee enters the building, an image of the employee’s face may be captured and sent to the SearchFacesByImage operation. If the face match produces a sufficiently high similarity score (say 99%), you can authenticate the employee.
DetectFaces API – This API detects faces within an image provided as input and returns information about faces. In a user registration workflow, this operation may help you screen images before moving to the next step. For example, you can check if a photo contains a face, if the person identified is in the right orientation, and if they’re not wearing a face blocker such as sunglasses or a cap.
IndexFaces API – This API detects faces in the input image and adds them to the specified collection. This operation is used to add a screened image to a collection for future queries.
SearchFacesByImage API – For a given input image, the API first detects the largest face in the image, and then searches the specified collection for matching faces. The operation compares the features of the input face with face features in the specified collection.
CompareFaces API – This API compares a face in the source input image with each of the 100 largest faces detected in the target input image. If the source image contains multiple faces, the service detects the largest face and compares it with each face detected in the target image. For our use case, we expect both the source and target image to contain a single face.
DeleteFaces API – This API deletes faces from a collection. You specify a collection ID and an array of face IDs to remove.

Workflows

The solution provides a sample of workflows to enable user registration, authentication, and updates to the user profile image. We detail each workflow in this section.

Register a new user using a face selfie

The following figure shows the workflow of a new user registration. Typical steps in this process are:

A user captures a selfie image.
A quality check of the selfie image is performed.
Note: A liveness detection check can also be performed after this step. For more details, please read this blog.
The selfie is checked against a database of existing user faces.

The following image illustrates the Step Functions workflow for new user registration.

Three functions are called in this workflow: detect-faces, search-faces, and index-faces. The detect-faces function calls the Amazon Rekognition DetectFaces API to determine if a face is detected in an image and is usable. Some of the quality checks include determining that only one face is present in the image, ensuring the face isn’t obscured by sunglasses or a hat, and confirming that the face isn’t rotated by using the pose dimension. If the image passes the quality check, the search-faces function searches for an existing face match in the Amazon Rekognition collections by confirming the FaceMatchThreshold confidence score meets your threshold objective. For more information, refer to Using similarity thresholds to match faces. If the face image doesn’t exist in the collections, the index-faces function is called to index the face in the collections. The face image metadata is stored in the DynamoDB table and the face images are stored in an S3 bucket.

If the new user registration succeeds, the face image attribute information is added in DynamoDB. You can customize the flow according to the business process. It often contains some or all of the steps presented in the preceding diagram. You can choose to run all the steps synchronously (wait for one step to complete before moving on to the next step). Alternately, you can run some of the steps asynchronously (don’t wait for that step to complete) to speed up the user registration process and improve the customer experience. If the steps aren’t successful, you must roll back the user registration.

Register a new user after face match against an ID card with ID card data extraction

In addition to user registration with image, this workflow allows users to register with an identification card like driver’s license. The steps to register a new user with an ID card are similar to the steps for registering a new user.

The following image illustrates the Step Functions workflow for new user registration with ID.

Four functions are called in this workflow: detect-faces, search-faces, index-faces and compare-faces. The sequence of operations in this workflow is similar to the user registration workflow with the addition of compare-faces. After verifying the quality of the selfie image and ensuring the face image is not present in the collection, the compare-faces function is invoked to verify the selfie image matches the face image in the ID card. If the images match, the relevant properties are extracted from the ID card. You can extract key-value pairs from identity documents using the newly launched Amazon Textract AnalyzeID API (for US regions) or Amazon Rekognition DetectText API (non-US regions and non-English languages). The extracted properties from the ID card are merged and the user’s face is indexed in the collection via the index-faces function.

The face image metadata is stored in the DynamoDB table and the face images are stored in an S3 bucket.

If the images don’t match or a duplicate registration is detected, the user receives a login failure. Login failures can be logged using an Amazon CloudWatch event, and actions can be triggered using Amazon Simple Notification Service (Amazon SNS) to notify security operations for monitoring and tracking failed logins. For more information, refer to Monitoring Amazon SNS topics using CloudWatch.

Authenticate returning user

Another common flow is an existing or returning user login. In this flow, a check of the user face (selfie) is performed against a previously registered face. Typical steps in this process include user face capture (selfie), check of the selfie image quality, and search and compare of the selfie against the faces database. The following diagram shows a possible flow.

The following image illustrates the workflow for authenticating an existing user.

This Step Function workflow calls three functions: detect-faces, compare-faces and search-faces. After the detect-faces function verifies that the captured face image is valid, the compare-faces function checks the link in the DynamoDB table for a face image in S3 bucket that matches an existing user. If a match is found, the user authenticates successfully. If a match isn’t found, the search-faces function is called to search for the face image in the collections. The user is verified and the authentication process completes if their face image exists in the collections. Otherwise, the user’s access is denied.

Prerequisites

Before you get started, complete the following prerequisites:

Create an AWS account.
Install the AWS Command Line Interface (AWS CLI) version 2 on your local machine. For instructions, refer to Installing or updating the latest version of the AWS CLI.
Set up the AWS CLI.
Install Node.js on your local machine.
Clone the sample repo on your local machine:

git clone https://github.com/aws-samples/rekognition-identity-verification.git

Deploy the solution

Choose the appropriate CloudFormation stack to provision the solution in your AWS account in your preferred Region. This solution deploys API Gateway integrated with Step Functions and Amazon Rekognition APIs to run the identity verification workflows.

Clicking on one of the following launch buttons will provision the solution into your AWS Account in the particular region.

N. Virginia (us-east-1)

Oregon (us-west-2)

Run the following steps on your local machine to deploy the Front-end application:

cd rekognition-identity-verification 
./fe-deployment.sh

Invoke the web UI

The web portal is deployed with Amplify. On the Amplify console, locate the hosted web application environment and the URL. Copy the URL and access it from your browser.

Register a new user using a face selfie

Open the web URL provided from Amplify.
Choose Register
Enable your camera and capture a face image.
Enter your user name and details.
Choose Signup to register your account.

Authenticate returning user

After you’re registered, you log in using the face ID as an authentication mechanism.

Open the web URL provided by Amplify
Capture your face ID.
Enter your user ID.
Choose Login.

You get a “Login successful” message after your face ID is verified with the registration image.

Register a new user after face match against an ID card with ID card data extraction

To test user registration with an ID, complete the following steps:

Open the web URL provided by Amplify.
Choose Register with ID
Enable your camera and capture a face image.
Drag and drop your ID card
Choose Register.

The following screenshot shows an example. The application supports ID card images of up to 256 KB.

You receive a “Successfully Registered User” message.

Clean up

To prevent accruing additional charges in your AWS account, delete the resources you provisioned by navigating to the AWS CloudFormation console and deleting the Riv-Prod stack.

Deleting the stack doesn’t delete the S3 bucket you created. This bucket stores all the face images. If you want to delete the S3 bucket, navigate to the Amazon S3 console, empty the bucket, and then confirm you want to permanently delete it.

Conclusion

Amazon Rekognition makes it easy to add image analysis to your identity verification applications using proven, highly scalable, deep learning technology that requires no ML expertise to use. Amazon Rekognition provides face detection and comparison capabilities. With a combination of the DetectFaces, CompareFaces, IndexFaces, SearchFacesByImage, DetectText and AnalyzeID, you can implement the common flows around new user registration and existing user logins.

Amazon Rekognition collections provide a method to store information about detected faces in server-side containers. You can then use the facial information stored in a collection to search for known faces in images. When using collections, you don’t need to store original photos after you index faces in the collection. Amazon Rekognition collections don’t persist actual images. Instead, the underlying detection algorithm detects the faces in the input image, extracts facial features into a feature vector for each face, and stores it in the collection.

To start your journey towards identity verification, visit Identity Verification using Amazon Rekognition.

About the authors

Vineet Kacchawaha is a Solutions Architect at AWS with expertise in Machine Learning. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS.

Ramesh Thiagarajan is a Senior Solutions Architect based out of San Francisco. He holds a Bachelor of Science in Applied Sciences and a master’s in Cyber Security. He specializes in cloud migration, cloud security, compliance, and risk management. Outside of work, he is a passionate gardener, and has an avid interest in real estate and home improvement projects.

Amit Gupta is an AI Services Solutions Architect at AWS. He is passionate about enabling customers with well-architected machine learning solutions at scale.

Tim Murphy is a Senior Solutions Architect for AWS, working with enterprise financial service customers building business cloud centric solutions. He has spent the last decade working with startups, non-profits, commercial enterprise, and government agencies, deploying infrastructure at scale. In his spare time when he isn’t tinkering with technology, you’ll most likely find him in far flung areas of the earth hiking mountains, surfing waves, or biking through a new city.

Nate Bachmeier is an AWS Senior Solutions Architect that nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing applications. Besides this, Nate is a full-time student and has two kids.

Jessie-Lee Fry is a Snr AIML Specialist with a focus on Computer Vision at AWS. She helps organizations leverage Machine Learning and AI to combat fraud and drive innovation on behalf of their customers. Outside of work, she enjoys spending time with her family, traveling and read all about Responsible AI.

Solution overview

AWS Step Functions workflow

Prerequisites

Deploy the solution

Test the solution

Review the output

Conclusion

Additional resources

About the authors

The problem

Solution overview

Store satellite imagery data in Amazon S3 as an input source

Use a Ground Truth labeling job to label the images

Use Amazon Rekognition to train the model with custom labels

Fine-tune the trained model

Deploy the model and analyze the images using the Rekognition Custom Labels API

Clean up

Conclusion

About the Authors

Mixed Precision Training in Practice

Getting Started With Mixed Precision Using torch.amp

Picking The Right Approach

Best Practices

Conclusion

Benefits of using Amazon Polly to generate audio with subtitles or closed captions

Solution overview

Install and use the function

Example 1

Example 2

Example 3

Conclusion

About the Authors

Solution overview

Workflows

Register a new user using a face selfie

Register a new user after face match against an ID card with ID card data extraction

Authenticate returning user

Prerequisites

Deploy the solution

Invoke the web UI

Register a new user using a face selfie

Authenticate returning user

Register a new user after face match against an ID card with ID card data extraction

Clean up

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.