Characters for good, created by artificial intelligence

As it becomes easier to create hyper-realistic digital characters using artificial intelligence, much of the conversation around these tools has centered on misleading and potentially dangerous deepfake content. But the technology can also be used for positive purposes — to revive Albert Einstein to teach a physics class, talk through a career change with your older self, or anonymize people while preserving facial communication.

To encourage the technology’s positive possibilities, MIT Media Lab researchers and their collaborators at the University of California at Santa Barbara and Osaka University have compiled an open-source, easy-to-use character generation pipeline that combines AI models for facial gestures, voice, and motion and can be used to create a variety of audio and video outputs. 

The pipeline also marks the resulting output with a traceable, as well as human-readable, watermark to distinguish it from authentic video content and to show how it was generated — an addition to help prevent its malicious use.

By making this pipeline easily available, the researchers hope to inspire teachers, students, and health-care workers to explore how such tools can help them in their respective fields. If more students, educators, health-care workers, and therapists have a chance to build and use these characters, the results could improve health and well-being and contribute to personalized education, the researchers write in Nature Machine Intelligence.

“It will be a strange world indeed when AIs and humans begin to share identities. This paper does an incredible job of thought leadership, mapping out the space of what is possible with AI-generated characters in domains ranging from education to health to close relationships, while giving a tangible roadmap on how to avoid the ethical challenges around privacy and misrepresentation,” says Jeremy Bailenson, founding director of the Stanford Virtual Human Interaction Lab, who was not associated with the study.

Although the world mostly knows the technology from deepfakes, “we see its potential as a tool for creative expression,” says the paper’s first author Pat Pataranutaporn, a PhD student in professor of media technology Pattie Maes’ Fluid Interfaces research group.  

Other authors on the paper include Maes; Fluid Interfaces master’s student Valdemar Danry and PhD student Joanne Leong; Media Lab Research Scientist Dan Novy; Osaka University Assistant Professor Parinya Punpongsanon; and University of California at Santa Barbara Assistant Professor Misha Sra.

Deeper truths and deeper learning

Generative adversarial networks, or GANs, a combination of two neural networks that compete against each other, have made it easier to create photorealistic images, clone voices, and animate faces. Pataranutaporn, with Danry, first explored its possibilities in a project called Machinoia, where he generated multiple alternative representations of himself — as a child, as an old man, as female — to have a self-dialogue of life choices from different perspectives. The unusual deepfaking experience made him aware of his “journey as a person,” he says. “It was deep truth — to uncover something about yourself you’ve never thought of before, using your own data on your own self.”

Self-exploration is only one of the positive applications of AI-generated characters, the researchers say. Experiments show, for instance, that these characters can make students more enthusiastic about learning and improve cognitive task performance. The technology offers a way for instruction to be “personalized to your interest, your idols, your context, and can be changed over time,” Pataranutaporn explains, as a complement to traditional instruction.

For instance, the MIT researchers used their pipeline to create a synthetic version of Johann Sebastian Bach, which had a live conversation with renowned cellist Yo Yo Ma in Media Lab Professor Tod Machover’s musical interfaces class — to the delight of both the students and Ma.

Other applications might include characters who help deliver therapy, to alleviate a growing shortage of mental health professionals and reach the estimated 44 percent of Americans with mental health issues who never receive counseling, or AI-generated content that delivers exposure therapy to people with social anxiety. In a related use case, the technology can be used to anonymize faces in video while preserving facial expressions and emotions, which may be useful for sessions where people want to share personally sensitive information such as health and trauma experiences, or for whistleblowers and witness accounts.

But there are also more artistic and playful use cases. In this fall’s Experiments in Deepfakes class, led by Maes and research affiliate Roy Shilkrot, students used the technology to animate the figures in a historical Chinese painting and to create a dating “breakup simulator,” among other projects.

Legal and ethical challenges

Many of the applications of AI-generated characters raise legal and ethical issues that must be discussed as the technology evolves, the researchers note in their paper. For instance, how will we decide who has the right to digitally recreate a historical character? Who is legally liable if an AI clone of a famous person promotes harmful behavior online? And is there any danger that we will prefer interacting with synthetic characters over humans?

“One of our goals with this research is to raise awareness about what is possible, ask questions and start public conversations about how this technology can be used ethically for societal benefit. What technical, legal, policy and educational actions can we take to promote positive use cases while reducing the possibility for harm?” states Maes.

By sharing the technology widely, while clearly labeling it as synthesized, Pataranutaporn says, “we hope to stimulate more creative and positive use cases, while also educating people about the technology’s potential benefits and harms

Read More

Q&A: Cathy Wu on developing algorithms to safely integrate robots into our world

Cathy Wu is the Gilbert W. Winslow Assistant Professor of Civil and Environmental Engineering and a member of the MIT Institute for Data, Systems, and Society. As an undergraduate, Wu won MIT’s toughest robotics competition, and as a graduate student took the University of California at Berkeley’s first-ever course on deep reinforcement learning. Now back at MIT, she’s working to improve the flow of robots in Amazon warehouses under the Science Hub, a new collaboration between the tech giant and the MIT Schwarzman College of Computing. Outside of the lab and classroom, Wu can be found running, drawing, pouring lattes at home, and watching YouTube videos on math and infrastructure via 3Blue1Brown and Practical Engineering. She recently took a break from all of that to talk about her work.

Q: What put you on the path to robotics and self-driving cars?

A: My parents always wanted a doctor in the family. However, I’m bad at following instructions and became the wrong kind of doctor! Inspired by my physics and computer science classes in high school, I decided to study engineering. I wanted to help as many people as a medical doctor could.

At MIT, I looked for applications in energy, education, and agriculture, but the self-driving car was the first to grab me. It has yet to let go! Ninety-four percent of serious car crashes are caused by human error and could potentially be prevented by self-driving cars. Autonomous vehicles could also ease traffic congestion, save energy, and improve mobility.

I first learned about self-driving cars from Seth Teller during his guest lecture for the course Mobile Autonomous Systems Lab (MASLAB), in which MIT undergraduates compete to build the best full-functioning robot from scratch. Our ball-fetching bot, Putzputz, won first place. From there, I took more classes in machine learning, computer vision, and transportation, and joined Teller’s lab. I also competed in several mobility-related hackathons, including one sponsored by Hubway, now known as Blue Bike.

Q: You’ve explored ways to help humans and autonomous vehicles interact more smoothly. What makes this problem so hard?

A: Both systems are highly complex, and our classical modeling tools are woefully insufficient. Integrating autonomous vehicles into our existing mobility systems is a huge undertaking. For example, we don’t know whether autonomous vehicles will cut energy use by 40 percent, or double it. We need more powerful tools to cut through the uncertainty. My PhD thesis at Berkeley tried to do this. I developed scalable optimization methods in the areas of robot control, state estimation, and system design. These methods could help decision-makers anticipate future scenarios and design better systems to accommodate both humans and robots.

Q: How is deep reinforcement learning, combining deep and reinforcement learning algorithms, changing robotics?

A: I took John Schulman and Pieter Abbeel’s reinforcement learning class at Berkeley in 2015 shortly after Deepmind published their breakthrough paper in Nature. They had trained an agent via deep learning and reinforcement learning to play “Space Invaders” and a suite of Atari games at superhuman levels. That created quite some buzz. A year later, I started to incorporate reinforcement learning into problems involving mixed traffic systems, in which only some cars are automated. I realized that classical control techniques couldn’t handle the complex nonlinear control problems I was formulating.

Deep RL is now mainstream but it’s by no means pervasive in robotics, which still relies heavily on classical model-based control and planning methods. Deep learning continues to be important for processing raw sensor data like camera images and radio waves, and reinforcement learning is gradually being incorporated. I see traffic systems as gigantic multi-robot systems. I’m excited for an upcoming collaboration with Utah’s Department of Transportation to apply reinforcement learning to coordinate cars with traffic signals, reducing congestion and thus carbon emissions.

Q: You’ve talked about the MIT course, 6.003 (Signals and Systems), and its impact on you. What about it spoke to you?

A: The mindset. That problems that look messy can be analyzed with common, and sometimes simple, tools. Signals are transformed by systems in various ways, but what do these abstract terms mean, anyway? A mechanical system can take a signal like gears turning at some speed and transform it into a lever turning at another speed. A digital system can take binary digits and turn them into other binary digits or a string of letters or an image. Financial systems can take news and transform it via millions of trading decisions into stock prices. People take in signals every day through advertisements, job offers, gossip, and so on, and translate them into actions that in turn influence society and other people. This humble class on signals and systems linked mechanical, digital, and societal systems and showed me how foundational tools can cut through the noise.

Q: In your project with Amazon you’re training warehouse robots to pick up, sort, and deliver goods. What are the technical challenges?

A: This project involves assigning robots to a given task and routing them there. [Professor] Cynthia Barnhart’s team is focused on task assignment, and mine, on path planning. Both problems are considered combinatorial optimization problems because the solution involves a combination of choices. As the number of tasks and robots increases, the number of possible solutions grows exponentially. It’s called the curse of dimensionality. Both problems are what we call NP Hard; there may not be an efficient algorithm to solve them. Our goal is to devise a shortcut.

Routing a single robot for a single task isn’t difficult. It’s like using Google Maps to find the shortest path home. It can be solved efficiently with several algorithms, including Dijkstra’s. But warehouses resemble small cities with hundreds of robots. When traffic jams occur, customers can’t get their packages as quickly. Our goal is to develop algorithms that find the most efficient paths for all of the robots.

Q: Are there other applications?

A: Yes. The algorithms we test in Amazon warehouses might one day help to ease congestion in real cities. Other potential applications include controlling planes on runways, swarms of drones in the air, and even characters in video games. These algorithms could also be used for other robotic planning tasks like scheduling and routing.

Q: AI is evolving rapidly. Where do you hope to see the big breakthroughs coming?

A: I’d like to see deep learning and deep RL used to solve societal problems involving mobility, infrastructure, social media, health care, and education. Deep RL now has a toehold in robotics and industrial applications like chip design, but we still need to be careful in applying it to systems with humans in the loop. Ultimately, we want to design systems for people. Currently, we simply don’t have the right tools.

Q: What worries you most about AI taking on more and more specialized tasks?

A: AI has the potential for tremendous good, but it could also help to accelerate the widening gap between the haves and the have-nots. Our political and regulatory systems could help to integrate AI into society and minimize job losses and income inequality, but I worry that they’re not equipped yet to handle the firehose of AI.

Q: What’s the last great book you read?

A:How to Avoid a Climate Disaster,” by Bill Gates. I absolutely loved the way that Gates was able to take an overwhelmingly complex topic and distill it down into words that everyone can understand. His optimism inspires me to keep pushing on applications of AI and robotics to help avoid a climate disaster.

Read More

Build custom Amazon SageMaker PyTorch models for real-time handwriting text recognition

In many industries, including financial services, banking, healthcare, legal, and real estate, automating document handling is an essential part of the business and customer service. In addition, strict compliance regulations make it necessary for businesses to handle sensitive documents, especially customer data, properly. Documents can come in a variety of formats, including digital forms or scanned documents (either PDF or images), and can include typed, handwritten, or embedded forms and tables. Manually extracting data and insight from these documents can be error-prone, expensive, time-consuming, and not scalable to a high volume of documents.

Optical character recognition (OCR) technology for recognizing typed characters has been around for years. Many companies manually extract data from scanned documents like PDFs, images, tables, and forms, or through simple OCR software that requires manual configuration, which often requires reconfiguration when the form changes.

The digital document is often a scan or image of a document, and therefore you can use machine learning (ML) models to automatically extract text and information (such as tables, images, captions, and key-pair values) from the document. For example, Amazon Textract, an API-based AI-enabled service, offers such capabilities with built-in trained models, which you can use in applications without the need for any ML skills. At the same time, custom ML models use computer vision (CV) techniques to automate text extraction from images; this is particularly helpful when handwritten text needs to be extracted, which is a challenging problem. This is also known as handwriting recognition (HWR), or handwritten text recognition (HTR). HTR can lead to making documents with handwritten content searchable or for storing the content of older documents and forms in modern databases.

Unlike standard text recognition that can be trained on documents with typed content or synthetic datasets that are easy to generate and inexpensive to obtain, HWR comes with many challenges. These challenges include variability in writing styles, low quality of old scanned documents, and collecting good quality labeled training datasets, which can be expensive or hard to collect.

In this post, we share the processes, scripts, and best practices to develop a custom ML model in Amazon SageMaker that applies deep learning (DL) techniques based on the concept outlined in the paper GNHK: A Dataset for English Handwriting in the Wild to transcribe text in images of handwritten passages into strings. If you have your own data, you can use this solution to label your data and train a new model with it. The solution also deploys the trained models as endpoints that you can use to perform inference on actual documents and convert handwriting script to text. We explain how you can create a secure public gateway to your endpoint by using Amazon API Gateway.

Prerequisites

To try out the solution in your own account, make sure that you have the following in place:

We recommend using the JumpStart solution, which creates the resources properly set up and configured to successfully run the solution.

You can also use your own data to train the models, in which case you need to have images of handwritten text stored in Amazon Simple Storage Service (Amazon S3).

Solution overview

In the next sections, we walk you through each step of creating the resources outlined in the following architecture. However, launching the solution with SageMaker JumpStart in your account automatically creates these resources for you.

Launching this solution creates multiple resources in your account, including seven sample notebooks, multiple accompanying custom scripts that we use in training models and inference, and two pre-built demo endpoints that you can use for real-time inference if you don’t want to do the end-to-end training and hosting. The notebooks are as follows:

  • Demo notebook – Shows you how to use the demo endpoints for real-time handwritten text recognition
  • Introduction – Explains the architecture and the different stages of the solution
  • Labeling your own data – Shows you how to use Amazon SageMaker Ground Truth to label your own dataset
  • Data visualization – Visualizes the outcomes of data labeling
  • Model training – Trains custom PyTorch models with GNHK data
  • Model training with your own data – Allows you to use your own labeled data for training models
  • Endpoints – Creates SageMaker endpoints with custom trained models

GNHK data overview

This solution uses the GoodNotes Handwriting Kollection (GNHK) dataset released by GoodNotes under CC-BY-4.0 License. This dataset is presented in a paper titled GNHK: A Dataset for English Handwriting in the Wild at the International Conference of Document Analysis and Recognition (ICDAR) in 2021, with the following citation:

@inproceedings{Lee2021,
  author={Lee, Alex W. C. and Chung, Jonathan and Lee, Marco},
  booktitle={International Conference of Document Analysis and Recognition (ICDAR)},
  title={GNHK: A Dataset for English Handwriting in the Wild},
  year={2021},
}

The GNHK dataset includes images of English handwritten text to allow ML practitioners and researchers to investigate new handwritten text recognition techniques. You can download the data for SageMaker training and testing in manifest format, which includes images, bounding box coordinates, and text strings for each bounding box. The following figure shows one of the images that is part of the training dataset.

Use your own labeled dataset

If you don’t want to use the GNHK dataset for training, you can train the models with your own data. If your data is labeled with the bounding box coordinates, you can create a custom manifest training file with the following format and readily use it for training the models. In this manifest file format, each line is a JSON that includes the following content:

{'source-ref': 'FILE_NAME.jpg',
 'annotations': 
 {'texts': 
 [{'text': 'FIRST_BOUNDING_BOX_CONTENT_TEXT',
    'polygon': [{'x': 178, 'y': 253},
     {'x': 172, 'y': 350},
     {'x': 627, 'y': 313},
     {'x': 615, 'y': 421}]},
   {'text': 'SECOND_BOUNDING_BOX_CONTENT_TEXT',
    'polygon': [{'x': 713, 'y': 307},
     {'x': 990, 'y': 322},
     {'x': 710, 'y': 404},
     {'x': 950, 'y': 413}]},
...

Label your raw data using Ground Truth

If you don’t have a labeled training dataset, you can use Ground Truth to label your data using your private workforce or external resources such as Amazon Mechanical Turk. Ground Truth is a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML. Ground Truth offers built-in workflows that support a variety of use cases, including text, images, and video. In addition, Ground Truth offers automatic data labeling, which uses an ML model to label your data. The following figure illustrates how Ground Truth works.

The JumpStart solution that is launched in your account creates a sample notebook (label_own_data.ipynb) that allows you to create Ground Truth labeling jobs to label your data using your private workforce. For details on how to set up labeling jobs for images as well as training and tutorial resources, see SageMaker Ground Truth Data Labeling Resources.

When data labeling is complete, you can use the accompanying data_visualization.ipynb notebook to visualize the results of the data labeling.

Train word segmentation and handwriting text recognition models

Now that the labeled data is prepared, you can use that to train a model that can recognize handwritten passages and return the text string equivalents. In this section, we walk you through this process and explain each step of building and training the models. We use PyTorch to take advantage of state-of-the-art frameworks for object detection and text recognition. You can also develop the same approach using other deep learning frameworks, such as TensorFlow or MXNet. SageMaker provides pre-built Docker images that include deep learning framework libraries and other dependencies needed for training and inference. For a complete list of pre-built Docker images, see Available Deep Learning Containers Images.

Train and test datasets

Before we get started with model training, we need to have a training dataset and a test (or validation) dataset to validate the trained model performance. The GNHK dataset already offers two separate datasets for training and testing in SageMaker manifest format, and this solution uses these datasets. If you want to use your own labeled dataset, the easiest way is to split a labeled data manifest file into train and test sets (for example, 80% training and 20% test).

SageMaker reads the training and test datasets from Amazon S3. After splitting the data, you need to store the manifest files and the corresponding images to Amazon S3, and then use the URI links in the training scripts, as outlined in the following sections.

Train the word segmentation model

For inference on images of handwritten text that consists of multiple lines and each line of multiple words, we need to create two models. The first model segments the image into single words by using bounding box prediction (or localization); the second model runs a text recognition on each segment, separately. Each model is hosted on a SageMaker inference endpoint for real-time inference. Both models use PyTorch framework containers for version 1.6.0. For more details on training and deploying models with PyTorch, including requirements for training and inference scripts, see Use PyTorch with the SageMaker Python SDK. For training purposes, we use the SageMaker PyTorch estimator class. For more details, see Create an Estimator. For training, you need a custom training script as the entry point. When launching this JumpStart solution in your account, SageMaker automatically adds all accompanying custom training and inference codes to your files. For the localization model, we use the custom 1_train_localisation.py code under the src_localisation folder. The estimator utilizes one GPU-based instance for training purposes.

In the following code snippet, we define model performance metrics and create a PyTorch estimator class with the entry point directing to the training script directory in the code repository. At the end, we launch the training by calling the .fit method on the estimator with the training dataset and validation on the test dataset.

# Define model performance metrics
metric_definitions=[
    {
        "Name": "iter",
        "Regex": ".*iter:s([0-9\.]+)s*"
    },
    {
        "Name": "total_loss",
        "Regex": ".*total_loss:s([0-9\.]+)s*"
    }
]

# Define PyTorch estimator class, and then call .fit method to launch training 
from sagemaker.pytorch import PyTorch

session = sagemaker.session.Session()
role = sagemaker_config["SageMakerIamRole"]

localization_estimator = PyTorch(entry_point='1_train_localisation.py',
                                 source_dir='src_localisation',
                                 dependencies=['utils', 'configs'],
                                 role=role,
                                 train_instance_type=["SageMakerTrainingInstanceType"],
                                 train_instance_count=1,
                                 output_path=output_path_s3_url,
                                 framework_version='1.6.0',
                                 py_version='py3',
                                 metric_definitions=metric_definitions,
                                 base_job_name='htr-word-segmentation',
                                 sagemaker_session=session
                                )

localization_estimator.fit({"train": train_dataset_s3_uri,
                            "test": test_dataset_s3_uri},
                           wait=False)

Train the handwriting text recognition model

After the word segments are determined by the previous model, the next piece of the inference pipeline is to run handwriting recognition inference on each segment. The process is the same, but this time we utilize a different custom training script, the 2_train_recogniser.py script under src_recognition as the entry point for the estimator, and train a new model. Similar to the previous model, this model also trains the model on the train dataset and evaluates its performance on the test dataset. If you launch the JumpStart solution in your account, SageMaker automatically adds these source codes to your files in your Studio domain.

# Define model performance metrics
metric_definitions = [
    {'Name': 'Iteration', 'Regex': 'Iteration ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},
    {'Name': 'train_loss', 'Regex': 'Train loss ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'},
    {'Name': 'test_loss',  'Regex': 'Test loss ([-+]?[0-9]*[.]?[0-9]+([eE][-+]?[0-9]+)?)'}
]

# Define PyTorch estimator class, and then call .fit method to launch training 
 recognition_estimator = PyTorch(entry_point='2_train_recogniser.py',
                                source_dir='src_recognition',
                                dependencies=['utils', 'configs'],
                                role=role,
                                instance_type=["SageMakerTrainingInstanceType"],
                                instance_count=1,
                                output_path=output_path_s3_url,
                                framework_version='1.6.0',
                                py_version='py3',
                                metric_definitions=metric_definitions,
                                base_job_name='htr-text-recognition',
                                sagemaker_session=session
                               )

recognition_estimator.fit({"train": train_dataset_s3_uri,
                           "test": test_dataset_s3_uri},
                          wait=False)

Next, we attach the estimators to the training jobs, and wait until the training is complete before proceeding with deployment of the models. The purpose of attaching is that if the status of the training job is Completed, it can be deployed to create a SageMaker endpoint and return a predictor, but if the training job is in progress, the attach blocks and displays log messages from the training job, until the training job is complete. Each training job may take around 1 hour to complete.

localisation_estimator = PyTorch.attach(training_job_name=localisation_estimator.latest_training_job.name,
                                        sagemaker_session=session)
recognition_estimator = PyTorch.attach(training_job_name=recognition_estimator.latest_training_job.name,
                                       sagemaker_session=session)

When both model trainings are complete, you can move on to the next step, which is creating an endpoint for real-time inference on images using the two models we just trained.

Create SageMaker endpoints for real-time inference

The next step in building this solution is to create endpoints with the trained models that we can use for real-time inference on handwritten text. We walk you through the steps of downloading the model artifacts, creating model containers, deploying the containers, and finally using the deployed models to make real-time inference on a demo image or your own image.

First we need to parse trained model artifacts from Amazon S3. After each training job, SageMaker stores the trained model in the form of a tar ball (.tar.gz) on Amazon S3 that you can download to utilize inside or outside of SageMaker. For this purpose, the following code snippet uses three utility functions (get_latest_training_job, get_model_data, and parse_model_data) from the sm_utils folder that is automatically added to your files in Studio when you launch the JumpStart solution in your account. The script shows how to download the PyTorch word segmentation (or localization) model data, compress it into a tar ball, and copy it to Amazon S3 for building the model later. You can repeat this process for the text recognition model.

from utils.sm_utils import get_latest_training_job, get_model_data, parse_model_data

# Download word segmentation model, rename it for packaging
os.mkdir("model_word_seg")

word_seg_training_job = get_latest_training_job('htr-word-segmentation')
word_seg_s3 = get_model_data(word_seg_training_job)
parse_model_data(word_seg_s3, "model_word_seg")

os.rename("model_word_seg/mask_rcnn/model_final.pth", "model_word_seg/mask_rcnn/model.pth")

# Repackage the model and copy to S3 for building the model later
!tar -czvf model.tar.gz --directory=model_word_seg/mask_rcnn/ model.pth
!aws s3 cp model.tar.gz s3://<YOUR-S3-BUCKET>/custom_data/artifacts/word-seg/model.tar.gz

Now that we have the trained model files, it’s easy to create a model container in SageMaker. Because we trained the model with the PyTorch estimator class, we can use the PyTorch model class to create a model container that uses a custom inference script. See Deploy PyTorch Models for more details. After we create the model, we can create a predictor by deploying the model as an endpoint for real-time inference. You can change the number of instances for your endpoint or select a different accelerated computing (GPU) instance from the list of available instances for real-time inference. The PyTorch model class uses the inference.py script for each model that is added to your files when you launch the JumpStart solution in your Studio domain. In the following code, we create the word segmentation model. You can follow the same approach for creating the text recognition model.

from sagemaker.pytorch import PyTorchModel

# Create word segmentation model
seg_model = PyTorchModel(model_data='s3://<YOUR-S3-BUCKET>/custom_data/artifacts/word-seg/model.tar.gz',
                     role=role,
                     source_dir="src_localisation",
                     entry_point="inference.py",
                     framework_version="1.6.0",
                     name=model_name,
                     py_version="py3"
                    )

Now we can build the endpoint by calling the .deploy method on the mode and creating a predictor. Then we attach the serializer and deserializer to the endpoint. You can follow the same approach for the second mode, for text recognition.

# Deploy word segmentation model to an endpoint
localisation_predictor = seg_model.deploy(instance_type=sagemaker_config["SageMakerInferenceInstanceType"],
                                      endpoint_name='word_segmentation_endpoint',
                                      initial_instance_count=1,
                                      deserializer= sagemaker.deserializers.JSONDeserializer(),
                                      serializer=sagemaker.serializers.JSONSerializer(),
                                      wait=False)

Endpoint creation should take about 6–7 minutes to complete. The following code creates waiters for endpoint creation and shows as complete when they’re done.

client = boto3.client('sagemaker')
waiter = client.get_waiter('endpoint_in_service')
waiter.wait(EndpointName='word_segmentation_endpoint') 

When the model deployments are complete, you can send an image of a handwritten passage to the first endpoint to get the bounding boxes and their coordinates for each word. Then use those coordinates to crop each bounding box and send them to the second endpoint individually and get the recognized text string for each bounding box. You can then take the outputs of the two endpoints and overlay the bounding boxes and the text on the raw image, or use the outputs in your downstream processes.

The following diagram illustrates the overall process workflow.

Extensions

Now that you have working endpoints that are making real-time inference, you can use them for your applications or website. However, your SageMaker endpoints are still not public facing; you need to build API Gateways to allow external traffic to your SageMaker endpoints. API Gateway is a fully managed service that makes it easy for developers to create, publish, maintain, monitor, and secure APIs at any scale. You can use API Gateway to present an external-facing, single point of entry for SageMaker endpoints, and provide security, throttling, authentication, firewall as provided by AWS WAF, and much more. With API Gateway mapping templates, you can invoke your SageMaker endpoint with a REST API request and receive an API response back. Mapping templates enable you to integrate your API Gateway directly with SageMaker endpoints without the need for any intermediate AWS Lambda function, making your online applications faster and cheaper. To create an API Gateway and use it to make real-time inference with your SageMaker endpoints (as in the following architecture), see Creating a machine learning-powered REST API with Amazon API Gateway mapping templates and Amazon SageMaker.

Conclusion

In this post, we explained an end-to-end solution for recognizing handwritten text using SageMaker custom models. The solution included labeling training data using Ground Truth, training data with PyTorch estimator classes and custom scripts, and creating SageMaker endpoints for real-time inference. We also explained how you can create a public API Gateway that can be securely used with your mobile applications or website.

For more SageMaker examples, visit the GitHub repository. In addition, you can find more PyTorch bring-your-own-script examples.

For more SageMaker Python examples for MXNet, TensorFlow, and PyTorch, visit Amazon SageMaker Pre-Built Framework Containers and the Python SDK.

For more Ground Truth examples, visit Introduction to Ground Truth Labeling Jobs. Additional information about SageMaker can be found in the technical documentation.


About the Authors

Jonathan Chung is an Applied Scientist in Halo Health tech. He works on applying classical signal processing and deep learning techniques to time series and biometrics data. Previously he was an Applied Scientists at AWS. He enjoys cooking and visiting historical cities around the world.

Dr. Nick Minaie, is the Manager of Data Science and Business Intelligence at Amazon, leading innovative machine learning product development for Amazon’s Time and Attendance team. Previously, he served as a Senior AI/ML Solutions Architect at AWS, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Nick enjoys family time, abstract painting, and exploring nature.

Dr. Li Zhang is a Principal Product Manager-Technical for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms, a service that helps data scientists and machine learning practitioners get started with training and deploying their models, and uses reinforcement learning with Amazon SageMaker. His past work as a principal research staff member and master inventor at IBM Research has won the test of time paper award at IEEE INFOCOM.

Shenghua Yue is a Software Development Engineer at Amazon SageMaker. She focuses on building machine learning tools and products for customers. Outside of work, she enjoys outdoors, yoga and hiking.

Read More

A Scalable Approach for Partially Local Federated Learning

Federated learning enables users to train a model without sending raw data to a central server, thus avoiding the collection of privacy-sensitive data. Often this is done by learning a single global model for all users, even though the users may differ in their data distributions. For example, users of a mobile keyboard application may collaborate to train a suggestion model but have different preferences for the suggestions. This heterogeneity has motivated algorithms that can personalize a global model for each user.

However, in some settings privacy considerations may prohibit learning a fully global model. Consider models with user-specific embeddings, such as matrix factorization models for recommender systems. Training a fully global federated model would involve sending user embedding updates to a central server, which could potentially reveal the preferences encoded in the embeddings. Even for models without user-specific embeddings, having some parameters be completely local to user devices would reduce server-client communication and responsibly personalize those parameters to each user.

Left: A matrix factorization model with a user matrix P and items matrix Q. The user embedding for a user u (Pu) and item embedding for item i (Qi) are trained to predict the user’s rating for that item (Rui). Right: Applying federated learning approaches to learn a global model can involve sending updates for Pu to a central server, potentially leaking individual user preferences.

In “Federated Reconstruction: Partially Local Federated Learning”, presented at NeurIPS 2021, we introduce an approach that enables scalable partially local federated learning, where some model parameters are never aggregated on the server. For matrix factorization, this approach trains a recommender model while keeping user embeddings local to each user device. For other models, this approach trains a portion of the model to be completely personal for each user while avoiding communication of these parameters. We successfully deployed partially local federated learning to Gboard, resulting in better recommendations for hundreds of millions of keyboard users. We’re also releasing a TensorFlow Federated tutorial demonstrating how to use Federated Reconstruction.

Federated Reconstruction
Previous approaches for partially local federated learning used stateful algorithms, which require user devices to store a state across rounds of federated training. Specifically, these approaches required devices to store local parameters across rounds. However, these algorithms tend to degrade in large-scale federated learning settings. In these cases, the majority of users do not participate in training, and users who do participate likely only do so once, resulting in a state that is rarely available and can get stale across rounds. Also, all users who do not participate are left without trained local parameters, preventing practical applications.

Federated Reconstruction is stateless and avoids the need for user devices to store local parameters by reconstructing them whenever needed. When a user participates in training, before updating any globally aggregated model parameters, they randomly initialize and train their local parameters using gradient descent on local data with global parameters frozen. They can then calculate updates to global parameters with local parameters frozen. A round of Federated Reconstruction training is depicted below.

Models are partitioned into global and local parameters. For each round of Federated Reconstruction training: (1) The server sends the current global parameters g to each user i; (2) Each user i freezes g and reconstructs their local parameters li; (3) Each user i freezes li and updates g to produce gi; (4) Users’ gi are averaged to produce the global parameters for the next round. Steps (2) and (3) generally use distinct parts of the local data.

This simple approach avoids the challenges of previous methods. It does not assume users have a state from previous rounds of training, enabling large-scale training, and local parameters are always freshly reconstructed, preventing staleness. Users unseen during training can still get trained models and perform inference by simply reconstructing local parameters using local data.

Federated Reconstruction trains better performing models for unseen users compared to other approaches. For a matrix factorization task with unseen users, the approach significantly outperforms both centralized training and baseline Federated Averaging.

RMSE ↓ Accuracy ↑
Centralized 1.36 40.8%
FedAvg .934 40.0%
FedRecon (this work) .907 43.3%
Root-mean-square-error (lower is better) and accuracy for a matrix factorization task with unseen users. Centralized training and Federated Averaging (FedAvg) both reveal privacy-sensitive user embeddings to a central server, while Federated Reconstruction (FedRecon) avoids this.

These results can be explained via a connection to meta learning (i.e., learning to learn); Federated Reconstruction trains global parameters that lead to fast and accurate reconstruction of local parameters for unseen users. That is, Federated Reconstruction is learning to learn local parameters. In practice, we observe that just one gradient descent step can yield successful reconstruction, even for models with about one million local parameters.

Federated Reconstruction also provides a way to personalize models for heterogeneous users while reducing communication of model parameters — even for models without user-specific embeddings. To evaluate this, we apply Federated Reconstruction to personalize a next word prediction language model and observe a substantial increase in performance, attaining accuracy on par with other personalization methods despite reduced communication. Federated Reconstruction also outperforms other personalization methods when executed at a fixed communication level.

Accuracy ↑ Communication ↓
FedYogi 24.3% Whole Model
FedYogi + Finetuning 30.8% Whole Model
FedRecon (this work) 30.7% Partial Model
Accuracy and server-client communication for a next word prediction task without user-specific embeddings. FedYogi communicates all model parameters, while FedRecon avoids this.

Real-World Deployment in Gboard
To validate the practicality of Federated Reconstruction in large-scale settings, we deployed the algorithm to Gboard, a mobile keyboard application with hundreds of millions of users. Gboard users use expressions (e.g., GIFs, stickers) to communicate with others. Users have highly heterogeneous preferences for these expressions, making the setting a good fit for using matrix factorization to predict new expressions a user might want to share.

Gboard users can communicate with expressions, preferences for which are highly personal.

We trained a matrix factorization model over user-expression co-occurrences using Federated Reconstruction, keeping user embeddings local to each Gboard user. We then deployed the model to Gboard users, leading to a 29.3% increase in click-through-rate for expression recommendations. Since most Gboard users were unseen during federated training, Federated Reconstruction played a key role in this deployment.

Further Explorations
We’ve presented Federated Reconstruction, a method for partially local federated learning. Federated Reconstruction enables personalization to heterogeneous users while reducing communication of privacy-sensitive parameters. We scaled the approach to Gboard in alignment with our AI Principles, improving recommendations for hundreds of millions of users.

For a technical walkthrough of Federated Reconstruction for matrix factorization, check out the TensorFlow Federated tutorial. We’ve also released general-purpose TensorFlow Federated libraries and open-source code for running experiments.

Acknowledgements
Karan Singhal, Hakim Sidahmed, Zachary Garrett, Shanshan Wu, Keith Rush, and Sushant Prakash co-authored the paper. Thanks to Wei Li, Matt Newton, and Yang Lu for their partnership on Gboard deployment. We’d also like to thank Brendan McMahan, Lin Ning, Zachary Charles, Warren Morningstar, Daniel Ramage, Jakub Konecný, Alex Ingerman, Blaise Agüera y Arcas, Jay Yagnik, Bradley Green, and Ewa Dominowska for their helpful comments and support.

Read More

Improving the factual accuracy of language models through web browsing

We’ve fine-tuned GPT-3 to more accurately answer open-ended questions using a text-based web browser. Our prototype copies how humans research answers to questions online – it submits search queries, follows links, and scrolls up and down web pages. It is trained to cite its sources, which makes it easier to give feedback to improve factual accuracy. We’re excited about developing more truthful AI, but challenges remain, such as coping with unfamiliar types of questions.

Read paperBrowse samples

Language models like GPT-3 are useful for many different tasks, but have a tendency to “hallucinate” information when performing tasks requiring obscure real-world knowledge. To address this, we taught GPT-3 to use a text-based web-browser. The model is provided with an open-ended question and a summary of the browser state, and must issue commands such as “Search …”, “Find in page: …” or “Quote: …”. In this way, the model collects passages from web pages, and then uses these to compose an answer.

The model is fine-tuned from GPT-3 using the same general methods we’ve used previously. We begin by training the model to copy human demonstrations, which gives it the ability to use the text-based browser to answer questions. Then we improve the helpfulness and accuracy of the model’s answers, by training a reward model to predict human preferences, and optimizing against it using either reinforcement learning or rejection sampling.

Cherry-picked samples from our best-performing model (175B with best-of-64 against a reward model).

Explore more samples

ELI5 results

Our system is trained to answer questions from ELI5, a dataset of open-ended questions scraped from the “Explain Like I’m Five” subreddit. We trained three different models, corresponding to three different inference-time compute budgets. Our best-performing model produces answers that are preferred 56% of the time to answers written by our human demonstrators, with a similar level of factual accuracy. Even though these were the same kind of demonstrations used to train the model, we were able to outperform them by using human feedback to improve the model’s answers.

Results of human evaluations on the ELI5 test set, comparing our model with human demonstrators. The amount of rejection sampling (the n in best-of-n) was chosen to be compute-efficient. Error bars show ±1 standard error.

TruthfulQA results

For questions taken from the training distribution, our best model’s answers are about as factually accurate as those written by our human demonstrators, on average. However, out-of-distribution robustness is a challenge. To probe this, we evaluated our models on TruthfulQA, an adversarially-constructed dataset of short-form questions designed to test whether models fall prey to things like common misconceptions. Answers are scored on both truthfulness and informativeness, which trade off against one another (for example, “I have no comment” is considered truthful but not informative).

Our models outperform GPT-3 on TruthfulQA and exhibit more favourable scaling properties. However, our models lag behind human performance, partly because they sometimes quote from unreliable sources (as shown in the question about ghosts above). We hope to reduce the frequency of these failures using techniques like adversarial training.

TruthfulQA results. For GPT-3, we used the prompts and automated metric from the TruthfulQA paper. For the web-browsing model, we truncated the long-form answers and used human evaluation, since the answers are out-of-distribution for the automated metric. Error bars show ±1 standard error.

Evaluating factual accuracy

In order to provide feedback to improve factual accuracy, humans must be able to evaluate the factual accuracy of claims produced by models. This can be extremely challenging, since claims can be technical, subjective or vague. For this reason, we require the model to cite its sources. This allows humans to evaluate factual accuracy by checking whether a claim is supported by a reliable source. As well as making the task more manageable, it also makes it less ambiguous, which is important for reducing label noise.

However, this approach raises a number of questions. What makes a source reliable? What claims are obvious enough to not require support? What trade-off should be made between evaluations of factual accuracy and other criteria such as coherence? All of these were difficult judgment calls. We do not think that our model picked up on much of this nuance, since it still makes basic errors. But we expect these kinds of decisions to become more important as AI systems improve, and cross-disciplinary research is needed to develop criteria that are both practical and epistemically sound. We also expect further considerations such as transparency to be important.

Eventually, having models cite their sources will not be enough to evaluate factual accuracy. A sufficiently capable model would cherry-pick sources it expects humans to find convincing, even if they do not reflect a fair assessment of the evidence. There are already signs of this happening (see the questions about boats above). We hope to mitigate this using methods like debate.

Risks of deployment and training

Although our model is generally more truthful than GPT-3 (in that it generates false statements less frequently), it still poses risks. Answers with citations are often perceived as having an air of authority, which can obscure the fact that our model still makes basic errors. The model also tends to reinforce the existing beliefs of users. We are researching how best to address these and other concerns.

In addition to these deployment risks, our approach introduces new risks at train time by giving the model access to the web. Our browsing environment does not allow full web access, but allows the model to send queries to the Microsoft Bing Web Search API and follow links that already exist on the web, which can have side-effects. From our experience with GPT-3, the model does not appear to be anywhere near capable enough to dangerously exploit these side-effects. However, these risks increase with model capability, and we are working on establishing internal safeguards against them.

Conclusion

Human feedback and tools such as web browsers offer a promising path towards robustly truthful, general-purpose AI systems. Our current system struggles with challenging or unfamiliar circumstances, but still represents significant progress in this direction.

If you’d like to help us build more helpful and truthful AI systems, we’re hiring!


References
  1. O. Evans, O. Cotton-Barratt, L. Finnveden, A. Bales, A. Balwit, P. Wills, L. Righetti, and W. Saunders. Truthful AI: Developing and governing AI that does not lie. arXiv preprint arXiv:2110.06674, 2021.
  2. J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  3. K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  4. A. Fan, Y. Jernite, E. Perez, D. Grangier, J. Weston, and M. Auli. ELI5: Long form question answering. arXiv preprint arXiv:1907.09190, 2019.
  5. S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  6. D. Metzler, Y. Tay, D. Bahri, and M. Najork. Rethinking search: Making experts out of dilettantes. arXiv preprint arXiv:2105.02274, 2021.


Acknowledgments

Thanks to our paper co-authors: Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Roger Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight and Benjamin Chess.

Thanks to those who helped with and provided feedback on this release: Steven Adler, Sam Altman, Beth Barnes, Miles Brundage, Kevin Button, Steve Dowling, Alper Ercetin, Matthew Knight, Gretchen Krueger, Ryan Lowe, Andrew Mayne, Bob McGrew, Mira Murati, Richard Ngo, Jared Salzano, Natalie Summers and Hannah Wong.

Thanks to the team at Surge AI for helping us with data collection, and to all of our contractors for providing demonstrations and comparisons, without which this project would not have been possible.


OpenAI

Omniverse Creator Uses AI to Make Scenes With Singing Digital Humans

The thing about inspiration is you never know where it might come from, or where it might lead.

Anderson Rohr, a 3D generalist and freelance video editor based in southern Brazil, has for more than a dozen years created content ranging from wedding videos to cinematic animation.

After seeing another creator animate a sci-fi character’s visage and voice using NVIDIA Omniverse and its AI-powered Audio2Face application, Rohr said he couldn’t help but play around with the technology.

The result is a grimly voiced, lip-synced cover of “Bad Moon Rising,” the 1960s anthem from Credence Clearwater Revival, which Rohr created using his own voice.

To make the video, Rohr used an NVIDIA Studio system with a GeForce RTX 3090 GPU.

Rohr’s Artistic Workflow

For this personal project, Rohr first recorded himself singing and opened the file in Audio2Face.

The application, built on NVIDIA AI and Omniverse technology, instantly generates expressive facial animations for digital humans with only a voice-over track or any other audio source.

Rohr then manually animated the eyes, brows and neck of his character and tweaked the lighting for the scene — all of which was rendered in Omniverse via Epic Games Unreal Engine, using an Omniverse Connector and the NVIDIA RTX Global Illumination software development kit.

“NVIDIA Omniverse is helping me achieve more natural results for my digital humans and speeding up my workflow, so that I can spend more time on the creative process,” Rohr said.

Before using Omniverse, some of Rohr’s animations took as long as 300 hours to render. He also faced software incompatibilities, which he said further slowed his work.

Now, with Omniverse and its connectors for various software applications, Rohr’s renderings are achieved in real time.

“Omniverse goes beyond my expectations,” he said. “I see myself using it a lot, and I hope my artwork inspires people to seek real-time results for virtual productions, games, cinematic scenes or any other creative project.”

With Omniverse, NVIDIA Studio creators can supercharge their artistic workflows with optimized RTX-accelerated hardware and software drivers, and state-of-the-art AI and simulation features.

Watch Rohr talk more about his work with NVIDIA Omniverse:

The post Omniverse Creator Uses AI to Make Scenes With Singing Digital Humans appeared first on The Official NVIDIA Blog.

Read More

Get the Best of Cloud Gaming With GeForce NOW RTX 3080 Memberships Available Instantly

The future of cloud gaming is available NOW, for everyone, with preorders closing and GeForce NOW RTX 3080 memberships moving to instant access. Gamers can sign up for a six-month GeForce NOW RTX 3080 membership and instantly stream the next generation of cloud gaming, starting today.

Snag the NVIDIA SHIELD TV or SHIELD TV Pro for $20 off and stream PC games to the biggest screen in the home at up to 4K HDR resolution.

Participate in a unique cloud-based DAF Drive, powered by GeForce NOW and Euro Truck Simulator 2.

And check out the four new titles joining the ever-expanding GeForce NOW library this week.

RTX 3080 Memberships Available Instantly

The next generation of cloud gaming is ready and waiting.

Make the leap to the newest generation of cloud gaming instantly. GeForce NOW RTX 3080 memberships are available today for instant access. Preorders poof, be gone!

The new tier of service transforms nearly any device into a gaming rig capable of streaming at up to 1440p resolution and 120 frames per second on PCs, native 1440p or 1600p at 120 FPS on Macs, and 4K HDR at 60 FPS on SHIELD TV, with ultra-low latency that rivals many local gaming experiences. On top of this, the membership comes with the longest gaming session length — clocking in at eight hours — as well as full control to customize in-game graphics settings, and RTX ON rendering environments in cinematic quality in supported games.

Level up your gaming experience to enjoy the GeForce NOW library of over 1,100 games with the boost of a six-month RTX 3080 membership streaming across your devices for $99.99. Founders receive 10 percent off the subscription price and can upgrade with no risk to their “Founders for Life” benefits.

For more information, check out our membership FAQ.

The Deal With SHIELD

The GeForce NOW experience goes legendary, playing in 4K HDR exclusively on the NVIDIA SHIELD — which is available with a sweet deal this holiday season.

Save $20 on SHIELD TV this holiday.
Grab a controller and stream PC gaming at up to 4K with GeForce NOW on SHIELD TV.

Just in time for the holidays, give the gift of great entertainment at a discounted price. Starting Dec. 13 in select regions, get $20 ($30 CAD, €25, £20) off SHIELD TV and SHIELD TV Pro. But hurry, this offer ends soon! And in the U.S., get six months of Peacock Premium as an added bonus, to enrich the entertainment experience.

With the new GeForce NOW RTX 3080 membership, PC gamers everywhere can stream with 4K resolution and HDR on the SHIELD TV, bringing PC gaming to the biggest screen in the house. Connect to Steam, Epic Games Store and more to play from your library, find new games or check out the 100+ free-to-play titles included with a GeForce NOW membership.

Customize play even further with your preferred gaming controller by connecting SHIELD TV with Xbox One, Series X, PlayStation DualSense or DualShock 4 and Scuf controllers and bring your gaming sessions to life with immersive 7.1 surround sound.

Roll On Into the Ride and Drive

Euro Truck Simulator 2 on GeForce NOW
Push the pedal to the metal driving the 2021 DAF XF, available in Euro Truck Simulator 2.

GeForce NOW is powering up new experiences with SCS Software by supporting a unique DAF Drive experience. It adds the New Generation DAF XF to the popular game Euro Truck Simulator 2 and gives everyone the opportunity to take a virtual test drive through a short and scenic route, streaming with GeForce NOW. Take the wheel of one of the DAF Truck vehicles, instantly, on the DAF virtual experience website.

Coming in tow is a free in-game content update to the full Euro Truck Simulator 2 game, which brings the 2021 DAF XF to players. Ride in style as you travel across Europe in the newest truck, test your skill and speed, deliver cargo and become king of the road, streaming on the cloud.

Moar Gamez Now & Later, Plz

GTFO on GeForce NOW
The only way to survive the Rundown is by working together.

Late last week a pair of games got big GeForce NOW announcements, GTFO and ARC Raiders.

GTFO is now out of early access. Jump on into this extreme cooperative horror shooter that requires stealth, strategy and teamwork to survive a deadly, underground prison.

ARC Raiders, a free-to-play cooperative third-person shooter from Embark Studios, is coming to GeForce NOW in 2022. In the game, which will be available on Steam and Epic Games Store, you and your squad of Raiders will unite to resist the onslaught of ARC – a ruthless mechanized threat descending from space.

Plus, slide on into the weekend with a pack of four new titles ready to stream from the GeForce NOW library today:

We make every effort to launch games on GeForce NOW as close to their release as possible, but, in some instances, games may not be available immediately.

Grab a Gift for a Gamer

Looking to spoil a gamer or yourself this holiday season?

Digital gift cards for GeForce NOW Priority memberships are available in two-, six- or 12-month options. Make your favorite player happy by powering up their GeForce NOW compatible devices with the kick of a full gaming rig, priority access to gaming servers, extended session lengths and RTX ON for supported games.

Gift cards can be redeemed on an existing GeForce NOW account or added to a new one. Existing Founders and Priority members will have the number of months added to their accounts.

As your weekend gaming session kicks off, we’ve got a question for you:

Shout at us on Twitter or in the comments below.

The post Get the Best of Cloud Gaming With GeForce NOW RTX 3080 Memberships Available Instantly appeared first on The Official NVIDIA Blog.

Read More

Achieve 35% faster training with Hugging Face Deep Learning Containers on Amazon SageMaker

Natural language processing (NLP) has been a hot topic in the AI field for some time. As current NLP models get larger and larger, data scientists and developers struggle to set up the infrastructure for such growth of model size. For faster training time, distributed training across multiple machines is a natural choice for developers. However, distributed training comes with extra node communication overhead, which negatively impacts the efficiency of model training.

This post shows how to pretrain an NLP model (ALBERT) on Amazon SageMaker by using Hugging Face Deep Learning Container (DLC) and transformers library. We also demonstrate how a SageMaker distributed data parallel (SMDDP) library can provide up to a 35% faster training time compared with PyTorch’s distributed data parallel (DDP) library.

SageMaker and Hugging Face

SageMaker is a cloud machine learning (ML) platform from AWS. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models by bringing together a broad set of capabilities purpose-built for ML.

Hugging Face’s transformers library is the most popular open-source library for innovative NLP and computer vision. It provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, and text generation in over 100 languages.

AWS and Hugging Face collaborated to create an Amazon SageMaker Hugging Face DLC for training and inference. With SageMaker, you can scale training from a small cluster to a large one without the need to manage the infrastructure on your own. With the help of the SageMaker enhancement libraries and AWS Deep Learning Containers, we can significantly speed up NLP model training.

Solution overview

In this section, we discuss the various components to set up our model training.

ALBERT model

ALBERT, released in 2019, is an optimized version of BERT. ALBERT-large uses 18 times fewer parameters in size and is 1.7 times faster in training speed than BERT-large. For more details, refer to the original paper, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.

Also compared with BERT, two parameter reduction operations were applied:

  • Factorized embedding parameterization – Decomposes large vocabulary embedding into two smaller ones, which helps grow the hidden layer number
  • Cross-layer parameter sharing – Shares all parameters across layers, which helps reduce the total parameter size by 18 times

Pretrain task

In this post, we train the ALBERT-base model (11 million parameters) using the most commonly used task in NLP pretraining: masked language modeling (MLM). MLM replaces input tokens with mask tokens randomly and trains the model to predict the masked ones. To simplify the training procedure, we removed the sentence order prediction task and kept the MLM task.

Set up the number of training steps and global batch sizes at different scales

To make a fair comparison across different training scales (namely, different numbers of nodes), we train by using different numbers of nodes but the same number of examples. For example, if we set a single GPU batch size to 16:

  • Two nodes (16 GPUs) training run 2,500 steps with global batch size 256
  • Four nodes (32 GPUs) training run 1,250 steps with global batch size 512

Dataset

As in the original ALBERT paper, the dataset we used for the ALBERT pretraining is the English Wikipedia Dataset and Book Corpus. This collection is taken from English-language Wikipedia and more than 11,000 English-language books. After being preprocessed and tokenized from the text, the total dataset size is around 75 GB and stored in an Amazon Simple Storage Service (Amazon S3) bucket.

In practice, we used the Amazon S3 plugin to stream the data. The S3 plugin is a high-performance PyTorch dataset library that can directly and efficiently access datasets stored in S3 buckets.

Metrics

In this post, we focus on two performance metrics:

  • Throughput – How many samples are processed per second
  • Scaling efficiency – Defined as T(N) / (N * T(1)), T(1) is the throughput of 1 node, T(N) is the throughput of N nodes

Infrastructure

We use P4d instances in SageMaker to train our model. P4d instances are powered by the latest NVIDIA A100 Tensor Core GPUs and deliver exceptionally high throughput and low latency networking. These instances are the first in the cloud to support 400 Gbps instance networking.

For SageMaker training, we prepared a Docker container image based on AWS Deep Learning Containers, which has PyTorch 1.8.1 and Elastic Fabric Adapter (EFA) enabled.

Tuning the number of data loader workers

The num_workers parameter indicates how many subprocesses to use for data loading. This parameter is set to zero by default. Zero means that the data is loaded in the main process.

In the early stage of our experiments, we scaled the SageMaker distributed data parallel library trainings from 2 nodes to 16 nodes and kept the default num_workers parameter unchanged. We noticed that scaling efficiency kept reducing, as shown in the following table.

Nodes algorithm Train time (s) Throughput (samples/s) num_workers scaling efficiency max_step global_batch_size
2 SMDDP 197.8595 3234.6185 0 1 2500 256
4 SMDDP 117.7135 5436.92949 0 0.84043 1250 512
8 SMDDP 79.0686 8094.23716 0 0.62559 625 1024
16 SMDDP 59.2767 10814.09728 0 0.41724 313 2048

Increasing the num_workers parameter of the data loader can let more CPU cores handle data preparation for GPU computation, which helps the training run faster. An AWS p4d instance has 96 vCPUs, which gives us plenty of space to tune the number of data loader workers.

We designed our experiments to find the best value for num_workers. We gradually increased the data loader worker number under different training scales. These results are generated using the SageMaker distributed data parallel library.

We tuned the following parameters:

  • Data loader number of workers: 0, 4, 8, 12, 16
  • Node number: 2, 4, 8, 16
  • Single GPU batch size: 16

As we can see from the following results, the throughput and scaling efficiency went into saturation when the data loader worker number was 12.

Next, we wanted to see if this situation would be similar if the global batch size changed, which indicates that the upper bound of the throughput changed. We set the single GPU batch size to 32 and retrained the models. We tuned the following parameters:

  • Data loader number of workers: 0, 4, 8, 12, 16
  • Node number: 2, 4, 8, 16
  • Single GPU batch size: 32

The following figure shows our results.

We got similar results from this second set of experiments, with 12 data loader workers still providing the best result.

From the two preceding sets of results, we can see that a good starting point is to set the data loader worker number equal to the free CPU number. For example, the P4d instance has 96 vCPUs and 8 processes. Each process has 12 CPUs on average, so we can set the data loader worker number equal to 12.

This can be a good empirical rule. However, different hardware and local batch size may result in some variance, so we suggest tuning the data loader worker number for your use case.

Finally, let’s look at how much improvement we got from tuning the number of data loader workers.

The following graphs show the throughput comparison with a batch size of 16 and 32. In both cases, we can observe consistent throughput gains by increasing the data loader worker number from zero to 12.

The following table summarizes our throughput increase when we compare throughput between data loader worker numbers equal to 0 and 12.

Nodes Throughput Increase (local batch size 16) Throughput Increase (local batch size 32)
2 15.53% 25.24%
4 23.98% 40.89%
8 41.14% 65.15%
16 60.37% 102.16%

Compared with the default data loader worker number setup, setting the data loader worker number to 12 results in a 102% throughput increase. This means we made the training speed twice as fast by using rich hardware resources in the P4d instance.

SMDDP vs. DDP

The SageMaker distributed data parallel library for PyTorch implements torch.distributed APIs, optimizing network communication by using the AWS network infrastructure and topology. In particular, SMDDP optimizes key collective communication primitives used throughout the training loop. SMDDP is available through the Amazon Elastic Container Registry (Amazon ECR) in the SageMaker training platform. You can start a SageMaker training job from the SageMaker Python SDK or the SageMaker APIs through the AWS SDK for Python (Boto3) or the AWS Command Line Interface (AWS CLI).

The distributed data parallel library is PyTorch’s data parallelism module. It implements data parallelism at the module level, which can run across multiple machines.

We compared SMDDP and DDP. As previous sections suggested, we set hyperparameters as follows:

  • Data loader worker number: 12
  • Single GPU batch size: 16

The following graphs compare throughput and scaling efficiency.

The following table summarizes our throughput speed increase.

Nodes SMDDP Throughput Speed Increase
2 13.96%
4 33.07%
8 34.94%
16 31.64%

From 2 nodes (16 A100 GPUs) to 16 nodes (128 A100 GPUs) in the ALBERT trainings, SMDDP consistently performed better than DDP. When we have more nodes and GPUs, we benefit from using SMDDP.

Summary

In this post, we demonstrated how we can use SageMaker to scale our NLP training jobs from 16 GPUs to 128 GPUs by changing a few lines of code. We also discussed why it’s important to tune the data loader worker number parameter. It provides up to a 102.16% increase in the 16-node training case, and setting that parameter to the vCPU number divided by the number of processes can be a good starting point. In our tests, SMDDP performed much better (almost 35% better) than DDP when increasing the training scale. The larger the scale we use, the more time and money SMDDP can save.

For detailed instructions on how to run the training in this post, we will provide the open-source training code in the AWS Samples GitHub repo soon. You can find more information through the following resources:


About the Authors

Yu Liu is a Software Developer with AWS Deep Learning. He focuses on optimizing distributed Deep Learning models and systems. In his spare time, he enjoys traveling, singing and exploring new technologies. He is also a metaverse believer.

Roshani Nagmote is a Software Developer for AWS Deep Learning. She focuses on building distributed Deep Learning systems and innovative tools to make Deep Learning accessible for all. In her spare time, she enjoys hiking, exploring new places and is a huge dog lover.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for Amazon Machine Learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Michele Monclova is a principal product manager at AWS on the SageMaker team. She is a native New Yorker and Silicon Valley veteran. She is passionate about innovations that improve our quality of life.

Philipp Schmid is a Machine Learning Engineer and Tech Lead at Hugging Face, where he leads the collaboration with the Amazon SageMaker team. He is passionate about democratizing, optimizing, and productionizing cutting-edge NLP models and improving the ease of use for Deep Learning.

Jeff Boudier builds products at Hugging Face, creator of Transformers, the leading open-source ML library. Previously Jeff was a co-founder of Stupeflix, acquired by GoPro, where he served as director of Product Management, Product Marketing, Business Development and Corporate Development.

Read More

Nonsense can make sense to machine-learning models

For all that neural networks can accomplish, we still don’t really understand how they operate. Sure, we can program them to learn, but making sense of a machine’s decision-making process remains much like a fancy puzzle with a dizzying, complex pattern where plenty of integral pieces have yet to be fitted. 

If a model was trying to classify an image of said puzzle, for example, it could encounter well-known, but annoying adversarial attacks, or even more run-of-the-mill data or processing issues. But a new, more subtle type of failure recently identified by MIT scientists is another cause for concern: “overinterpretation,” where algorithms make confident predictions based on details that don’t make sense to humans, like random patterns or image borders. 

This could be particularly worrisome for high-stakes environments, like split-second decisions for self-driving cars, and medical diagnostics for diseases that need more immediate attention. Autonomous vehicles in particular rely heavily on systems that can accurately understand surroundings and then make quick, safe decisions. The network used specific backgrounds, edges, or particular patterns of the sky to classify traffic lights and street signs — irrespective of what else was in the image. 

The team found that neural networks trained on popular datasets like CIFAR-10 and ImageNet suffered from overinterpretation. Models trained on CIFAR-10, for example, made confident predictions even when 95 percent of input images were missing, and the remainder is senseless to humans. 

“Overinterpretation is a dataset problem that’s caused by these nonsensical signals in datasets. Not only are these high-confidence images unrecognizable, but they contain less than 10 percent of the original image in unimportant areas, such as borders. We found that these images were meaningless to humans, yet models can still classify them with high confidence,” says Brandon Carter, MIT Computer Science and Artificial Intelligence Laboratory PhD student and lead author on a paper about the research. 

Deep-image classifiers are widely used. In addition to medical diagnosis and boosting autonomous vehicle technology, there are use cases in security, gaming, and even an app that tells you if something is or isn’t a hot dog, because sometimes we need reassurance. The tech in discussion works by processing individual pixels from tons of pre-labeled images for the network to “learn.” 

Image classification is hard, because machine-learning models have the ability to latch onto these nonsensical subtle signals. Then, when image classifiers are trained on datasets such as ImageNet, they can make seemingly reliable predictions based on those signals. 

Although these nonsensical signals can lead to model fragility in the real world, the signals are actually valid in the datasets, meaning overinterpretation can’t be diagnosed using typical evaluation methods based on that accuracy. 

To find the rationale for the model’s prediction on a particular input, the methods in the present study start with the full image and repeatedly ask, what can I remove from this image? Essentially, it keeps covering up the image, until you’re left with the smallest piece that still makes a confident decision. 

To that end, it could also be possible to use these methods as a type of validation criteria. For example, if you have an autonomously driving car that uses a trained machine-learning method for recognizing stop signs, you could test that method by identifying the smallest input subset that constitutes a stop sign. If that consists of a tree branch, a particular time of day, or something that’s not a stop sign, you could be concerned that the car might come to a stop at a place it’s not supposed to.

While it may seem that the model is the likely culprit here, the datasets are more likely to blame. “There’s the question of how we can modify the datasets in a way that would enable models to be trained to more closely mimic how a human would think about classifying images and therefore, hopefully, generalize better in these real-world scenarios, like autonomous driving and medical diagnosis, so that the models don’t have this nonsensical behavior,” says Carter. 

This may mean creating datasets in more controlled environments. Currently, it’s just pictures that are extracted from public domains that are then classified. But if you want to do object identification, for example, it might be necessary to train models with objects with an uninformative background. 

This work was supported by Schmidt Futures and the National Institutes of Health. Carter wrote the paper alongside Siddhartha Jain and Jonas Mueller, scientists at Amazon, and MIT Professor David Gifford. They are presenting the work at the 2021 Conference on Neural Information Processing Systems.

Read More