Think Aggressively This GFN Thursday with Outriders Demo, 11 Additional Games

Here comes another GFN Thursday, dropping in to co-op with you as we explore the world of Square Enix’s new Outriders game. Before we get into the rest of this week’s new additions, let’s head to Enoch and take a closer look at what makes People Can Fly’s upcoming launch special.

Let’s Ride

From the studio that launched Bulletstorm, Gears of War Judgement and Painkiller, Outriders takes gamers to the world of Enoch. Embark on a brand-new experience: a single-player or co-op RPG shooter with brutal powers, intense action, deep RPG mechanics, and a compelling story set in a dark sci-fi world. The game dynamically adjusts the balance to account for how many players are in a session to keep the challenge level just right.

Play Outriders on GeForce NOW
Outriders is coming to GeForce NOW in April, but members can play the demo now.

Combining intense gunplay with violent powers and an arsenal of increasingly twisted weaponry and gear-sets, OUTRIDERS offers countless hours of gameplay from one of the finest shooter developers in the industry, People Can Fly.

The demo has a ton of content to explore. Beyond the main storyline, gamers can explore four quests. And all progress made in the demo carries over to the full game when it launches in April.

GeForce NOW members can play on any supported device — PC, Mac, Chromebook, iOS, Android or Android TV. And with crossplay, members can join friends in Enoch regardless of which platform their friends are playing on.

Like most day-and-date releases on GeForce NOW, we expect to have the Outriders demo streaming within a few hours of it going live on Steam.

Let’s Play Today

In addition to the Outriders demo, let’s take a look at this week’s 11 more new additions to the GeForce NOW library.

Curse of the Dead Gods on GeForce NOW

Curse of the Dead Gods (Steam)

Now out of Early Access, this skill-based roguelike challenges you to explore endless dungeon rooms while turning curses to your advantage. IGN calls the game’s combat system “mechanically simple, but impressively deep.”

Old School Runescape on GeForce NOW

Old School RuneScape (Steam)

Old School RuneScape is RuneScape, but older! This is the open-world gamers know and love, but as it was in 2007. Saying that, it’s even better – Old School is shaped by you, the players, with regular new content, fixes and expansions voted for by the fans!

Rogue Heroes: Ruins of Tasos on GeForce NOW

Rogue Heroes: Ruins of Tasos (Steam)

Classic adventure for you and three of your friends! Delve deep into procedural dungeons, explore an expansive overworld full of secrets and take down the Titans to save the once peaceful land of Tasos.

In addition, members can look for the following:

What are you planning to play this weekend? Let us know on Twitter or in the comments below.

The post Think Aggressively This GFN Thursday with Outriders Demo, 11 Additional Games appeared first on The Official NVIDIA Blog.

Read More

Self-Supervised Policy Adaptation during Deployment

Our method learns a task in a fixed, simulated environment and quickly adapts
to new environments (e.g. the real world) solely from online interaction during

The ability for humans to generalize their knowledge and experiences to new
situations is remarkable, yet poorly understood. For example, imagine a human
driver that has only ever driven around their city in clear weather. Even
though they never encountered true diversity in driving conditions, they have
acquired the fundamental skill of driving, and can adapt reasonably fast to
driving in neighboring cities, in rainy or windy weather, or even driving a
different car, without much practice nor additional driver’s lessons. While
humans excel at adaptation, building intelligent systems with common-sense
knowledge and the ability to quickly adapt to new situations is a long-standing
problem in artificial intelligence.

Neural Mechanics: Symmetry and Broken Conservation Laws In Deep Learning Dynamics

Just like the fundamental laws of classical and quantum mechanics taught us how to control and optimize the physical world for engineering purposes, a better understanding of the laws governing neural network learning dynamics can have a profound impact on the optimization of artificial neural networks. This raises a foundational question: what, if anything, can we quantitatively understand about the learning dynamics of state-of-the-art deep learning models driven by real-world datasets?

In order to make headway on this extremely difficult question, existing works have made major simplifying assumptions on the architecture, such as restricting to a single hidden layer 1, linear activation functions 2, or infinite width layers 3. These works have also ignored the complexity introduced by the optimizer through stochastic and discrete updates. In the present work, rather than introducing unrealistic assumptions on the architecture or optimizer, we identify combinations of parameters with simpler dynamics (as shown Fig. 1) that can be solved exactly!

Fig. 1. We plot the per-parameter dynamics (left) and per-filter squared Euclidean norm dynamics (right) for the convolutional layers of a VGG-16 model (with batch normalization) trained on Tiny ImageNet with SGD with learning rate , weight decay , and batch size . Each color represents a different convolutional block. While the parameter dynamics are noisy and chaotic, the neuron dynamics are smooth and patterned.

Symmetries in the loss shape gradient and Hessian geometry

While we commonly initialize neural networks with random weights, their gradients and Hessians at all points in training, no matter the loss or dataset, obey certain geometric constraints. Some of these constraints have been noticed previously as a form of implicit regularization, while others have been leveraged algorithmically in applications from network pruning to interpretability. Remarkably, all these geometric constraints can be understood as consequences of numerous symmetries in the loss introduced by neural network architectures.

A set of parameters observes a symmetry in the loss if the loss doesn’t change under a certain transformation of these parameters. This invariance introduces associated geometric constraints on the gradient and Hessian. We consider three families of symmetries (translation, scale, and rescale) that commonly appear in modern neural network architectures.

  • Translation symmetry is defined by the transformation where is the indicator vector for some subset of the parameters . Any network using the softmax function gives rise to translation symmetry for the parameters immediately preceding the function.
  • Scale symmetry is defined by the transformation where . Batch normalization leads to scale invariance for the parameters immediately preceding the function.
  • Rescale symmetry is defined by the transformation where and are two disjoint sets of parameters. For networks with continuous, homogeneous activation functions (e.g. ReLU, Leaky ReLU, linear), this symmetry emerges at every hidden neuron by considering all incoming and outgoing parameters to the neuron.

These symmetries enforce geometric constraints on the gradient of a neural network ,

Fig. 2. We visualize the vector fields associated with simple network components that have translation, scale, and rescale symmetry. On the right we consider the vector field associated with a neuron where is the softmax function. In the middle we consider the vector field associated with a neuron where is the batch normalization function. On the left we consider the vector field associated with a linear path .

Symmetry leads to conservation laws under gradient flow

We now consider how geometric constraints on gradients and Hessians, arising as a consequence of symmetry, impact the learning dynamics given by stochastic gradient descent (SGD). We will consider a model parameterized by , a training dataset of size , and a training loss with corresponding gradient . The gradient descent update with learning rate is , which is a forward Euler discretization with step size of the ordinary differential equation (ODE) . In the limit as , gradient descent exactly matches the dynamics of this ODE, which is commonly referred to as gradient flow. Equipped with a continuous model for the learning dynamics, we now ask how do the dynamics interact with the geometric properties introduced by symmetries?

Strikingly similar to Noether’s theorem, which describes a fundamental relationship between symmetry and conservation for physical systems governed by Lagrangian dynamics, every symmetry of a network architecture has a corresponding “conserved quantity” through training under gradient flow. Just as the total kinetic and potential energy is conserved for an idealized spring in harmonic motion, certain combinations of parameters are constant under gradient flow dynamics.

Consider some subset of the parameters that respects either a translation, scale, or rescale symmetry. As discussed earlier, the gradient of the loss is always perpendicular to the vector field that generates the symmetry . Projecting the gradient flow learning dynamics onto the generator vector field yields a differential equation . Integrating this equation through time results in the conservation laws,

Each of these equations define a conserved constant through training, effectively restricting the possible trajectory the parameters take through learning. For parameters with translation symmetry, their sum is conserved, effectively constraining their dynamics to a hyperplane. For parameters with scale symmetry, their Euclidean norm is conserved, effectively constraining their dynamics to a sphere. For parameters with rescale symmetry, their difference in squared Euclidean norm is conserved, effectively constraining their dynamics to a hyperbola.

Fig. 3. Associated with each symmetry is a conserved quantity constraining the gradient flow dynamics to a surface. For translation symmetry (right) the flow is constrained to a hyperplane where the intercept is conserved. For scale symmetry (middle) the flow is constrained to a sphere where the radius is conserved. For rescale symmetry (left) the flow is constrained to a hyperbola where the axes are conserved. The color represents the value of the conserved quantity, where blue is positive and red is negative, and the black lines are level sets.

A realistic continuous model for stochastic gradient descent

While the conservation laws derived with gradient flow are quite striking, empirically we know they are broken, as demonstrated in Fig. 1. Gradient flow is too simple of a continuous model for realistic SGD training, it fails to account for the effect of hyperparameters such as weight decay and momentum, the effect of stochasticity introduced by random batches of data, and the effect of discrete updates due to a finite learning rate. Here, we consider how to address these effects individually to construct more realistic continuous models of SGD.

Modeling weight decay. Explicit regularization through the addition of an penalty on the parameters, with regularization constant , is a very common practice when training neural networks. Weight decay modifies the gradient flow trajectory pulling the network towards the origin in parameter space.

Modeling momentum. Momentum is a common extension to SGD that uses an exponentially moving average of gradients to update parameters rather than a single gradient evaluation. The method introduces an additional hyperparameter , which controls how past gradients are used in future updates, resulting in a form of “inertia” that accelerates the learning dynamics rescaling time, but leaves the gradient flow trajectory intact.

Modeling stochasticity. Stochastic gradients arise when we consider a batch of size drawn uniformly from the indices forming the unbiased gradient estimate . We can model the batch gradient as a noisy version of the true gradient . However, because both the batch gradient and true gradient observe the same geometric properties introduced by symmetry, this noise has a special low-rank structure. In other words, stochasticity introduced by random batches does not affect the gradient flow dynamics in the directions associated with symmetry.

Modeling discretization. Gradient descent always moves in the direction of steepest descent on a loss function at each step, however, due to the finite nature of the learning rate, it fails to remain on the continuous steepest descent path given by gradient flow. In order to model this discrepancy, we borrow tools from the numerical analysis of partial differential equations. In particular, we use modified equation analysis 4, which determines how to model the numerical artifacts introduced by a discretization of a PDE. In our paper we present two methods based on modified equation analysis and recent works 5, 6, which modify gradient flow, with either higher order derivatives of the loss or higher order temporal derivatives of the parameters, to account for the effect of discretization on the learning dynamics.

Fig. 4. We visualize the trajectories of gradient descent with momentum (black dots), gradient flow (blue line), and the modified dynamics (red line) on the quadratic loss . The modified continuous dynamics visually track the discrete dynamics much better than the original gradient flow dynamics.

Combining symmetry and modified gradient flow to derive exact learning dynamics

We now study how weight decay, momentum, stochastic gradients, and finite learning rates all interact to break the conservation laws of gradient flow. Remarkably, even when using a more realistic continuous model for stochastic gradient descent, we can derive exact learning dynamics for the previously conserved quantities. To do this we (i) consider a realistic continuous model for SGD, (ii) project these learning dynamics onto the generator vector fields associated with each symmetry, (iii) harness the geometric constraints introduced by symmetry to derive simplified ODEs, and (iv) solve these ODEs to obtain exact dynamics for the previously conserved quantities. We first consider the continuous model of SGD without momentum incorporating weight decay, stochasticity, and a finite learning rate. In this setting, the exact dynamics for the parameter combinations tied to the symmetries are,

Notice how these equations are equivalent to the conservation laws when . Remarkably, even in typical hyperparameter settings (weight decay, stochastic batches, finite learning rates), these solutions match nearly perfectly with empirical results from modern neural networks (VGG-16) trained on real-world datasets (Tiny ImageNet), as shown in Fig. 5.

Fig. 5. We plot the column sum of the final linear layer (left) and the difference between squared channel norms of the fifth and fourth convolutional layer (right) of a VGG-16 model without batch normalization. We plot the squared channel norm of the second convolution layer (middle) of a VGG-16 model with batch normalization. Both models are trained on Tiny ImageNet with SGD with learning rate , weight decay , batch size , for epochs. Colored lines are empirical and black dashed lines are the theoretical predictions.

Translation dynamics. For parameters with translation symmetry, this equation implies that the sum of these parameters decays exponentially to zero at a rate proportional to the weight decay. In particular, the dynamics do not directly depend on the learning rate nor any information of the dataset due to the lack of curvature in the gradient field for these parameters (as shown in Fig. 2).

Scale dynamics. For parameters with scale symmetry, this equation implies that the norm for these parameters is the sum of an exponentially decaying memory of the norm at initialization and an exponentially weighted integral of gradient norms accumulated through training. Compared to the translation dynamics, the scale dynamics do depend on the data through the gradient norms accumulated throughout training.

Rescale dynamics. For parameters with rescale symmetry, this equation is the sum of an exponentially decaying memory of the difference in norms at initialization and an exponentially weighted integral of difference in gradient norms accumulated through training. Similar to the scale dynamics, the rescale dynamics do depend on the data through the gradient norms, however unlike the scale dynamics we have no guarantee that the integral term is always positive.


Despite being the central guiding principle in the exploration of the physical world, symmetry has been underutilized in understanding the mechanics of neural networks. In this paper, we constructed a unifying theoretical framework harnessing the geometric properties of symmetry and realistic continuous equations for SGD that model weight decay, momentum, stochasticity, and discretization. We use this framework to derive exact dynamics for meaningful combinations of parameters, which we experimentally verified on large scale neural networks and datasets. Overall, our work provides a first step towards understanding the mechanics of learning in neural networks without unrealistic simplifying assumptions.

For more details check out our ICLR paper or this seminar presentation!


We would like to thank our collaborator Javier Sagastuy-Brena and advisors Surya Ganguli and Daniel Yamins.
We would also like to thank Megha Srivastava for very helpful feedback on this post.

  1. David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks.Advances in neural information processing systems, 8:302–308, 1995. 

  2. Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A., May 2019. 

  3. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp.8571–8580, 2018 

  4. RF Warming and BJ Hyett. The modified equation approach to the stability and accuracy analysis of finite-difference methods. Journal of computational physics, 14(2):159–179, 1974. 

  5. David GT Barrett and Benoit Dherin. Implicit gradient regularization.arXiv preprintarXiv:2009.11162, 2020. 

  6. Nikola B Kovachki and Andrew M Stuart. Analysis of momentum methods.arXiv preprint arXiv:1906.04285, 2019. 

Read More

Using container images to run TensorFlow models in AWS Lambda

TensorFlow is an open-source machine learning (ML) library widely used to develop neural networks and ML models. Those models are usually trained on multiple GPU instances to speed up training, resulting in expensive training time and model sizes up to a few gigabytes. After they’re trained, these models are deployed in production to produce inferences. They can be synchronous, asynchronous, or batch-based workloads. Those endpoints need to be highly scalable and resilient in order to process from zero to millions of requests. This is where AWS Lambda can be a compelling compute service for scalable, cost-effective, and reliable synchronous and asynchronous ML inferencing. Lambda offers benefits such as automatic scaling, reduced operational overhead, and pay-per-inference billing.

This post shows you how to use any TensorFlow model with Lambda for scalable inferences in production with up to 10 GB of memory. This allows us to use ML models in Lambda functions up to a few gigabytes. For this post, we use TensorFlow-Keras pre-trained ResNet50 for image classification.

Overview of solution

Lambda is a serverless compute service that lets you run code without provisioning or managing servers. Lambda automatically scales your application by running code in response to every event, allowing event-driven architectures and solutions. The code runs in parallel and processes each event individually, scaling with the size of the workload, from a few requests per day to hundreds of thousands of workloads. The following diagram illustrates the architecture of our solution.

The following diagram illustrates the architecture of our solution.

You can package your code and dependencies as a container image using tools such as the Docker CLI. The maximum container size is 10 GB. After the model for inference is Dockerized, you can upload the image to Amazon Elastic Container Registry (Amazon ECR). You can then create the Lambda function from the container imaged stored in Amazon ECR.


For this walkthrough, you should have the following prerequisites:

Implementing the solution

We use a pre-trained model from the TensorFlow Hub for image classification. When an image is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket, a Lambda function is invoked to detect the image and print it to the Amazon CloudWatch logs. The following diagram illustrates this workflow.

he following diagram illustrates this workflow.

To implement the solution, complete the following steps:

  1. On your local machine, create a folder with the name lambda-tensorflow-example.
  2. Create a requirements.txt file in that directory.
  3. Add all the needed libraries for your ML model. For this post, we use TensorFlow 2.4.
  4. Create an script that contains the code for the Lambda function.
  5. Create a Dockerfile in the same directory.

The following text is an example of the requirements.txt file to run TensorFlow code for our use case:

# List all python libraries for the lambda

The Python code is placed in The inference function in needs to follow a specific structure to be invoked by the Lambda runtime. For more information about handlers for Lambda, see AWS Lambda function handler in Python. See the following code:

import json
import boto3
import numpy as np
import PIL.Image as Image

import tensorflow as tf
import tensorflow_hub as hub


model = tf.keras.Sequential([hub.KerasLayer("model/")])[None, IMAGE_WIDTH, IMAGE_HEIGHT, 3])

imagenet_labels= np.array(open('model/ImageNetLabels.txt').read().splitlines())
s3 = boto3.resource('s3')

def lambda_handler(event, context):
  bucket_name = event['Records'][0]['s3']['bucket']['name']
  key = event['Records'][0]['s3']['object']['key']

  img = readImageFromBucket(key, bucket_name).resize(IMAGE_SHAPE)
  img = np.array(img)/255.0

  prediction = model.predict(img[np.newaxis, ...])
  predicted_class = imagenet_labels[np.argmax(prediction[0], axis=-1)]

  print('ImageName: {0}, Prediction: {1}'.format(key, predicted_class))

def readImageFromBucket(key, bucket_name):
  bucket = s3.Bucket(bucket_name)
  object = bucket.Object(key)
  response = object.get()

The following Dockerfile for Python 3.8 uses the AWS provided open-source base images that can be used to create container images. The base images are preloaded with language runtimes and other components required to run a container image on Lambda.

# Pull the base image with python 3.8 as a runtime for your Lambda

# Install OS packages for Pillow-SIMD
RUN yum -y install tar gzip zlib freetype-devel 
    && yum clean all

# Copy the earlier created requirements.txt file to the container
COPY requirements.txt ./

# Install the python requirements from requirements.txt
RUN python3.8 -m pip install -r requirements.txt
# Replace Pillow with Pillow-SIMD to take advantage of AVX2
RUN pip uninstall -y pillow && CC="cc -mavx2" pip install -U --force-reinstall pillow-simd

# Copy the earlier created file to the container

# Download ResNet50 and store it in a directory
RUN mkdir model
RUN curl -L -o ./model/resnet.tar.gz
RUN tar -xf model/resnet.tar.gz -C model/
RUN rm -r model/resnet.tar.gz

# Download ImageNet labels
RUN curl -o ./model/ImageNetLabels.txt

# Set the CMD to your handler
CMD ["app.lambda_handler"]

Your folder structure should look like the following screenshot.

Your folder structure should look like the following screenshot.

You can build and push the container image to Amazon ECR with the following bash commands. Replace the <AWS_ACCOUNT_ID> with your own AWS account ID and also specify a <REGION>.

# Build the docker image
docker build -t  lambda-tensorflow-example .

# Create a ECR repository
aws ecr create-repository --repository-name lambda-tensorflow-example --image-scanning-configuration scanOnPush=true --region <REGION>

# Tag the image to match the repository name
docker tag lambda-tensorflow-example:latest <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>

# Register docker to ECR
aws ecr get-login-password --region <REGION> | docker login --username AWS --password-stdin <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>

# Push the image to ECR
docker push <AWS_ACCOUNT_ID>.dkr.ecr.<REGION>

If you want to test your model inference locally, the base images for Lambda include a Runtime Interface Emulator (RIE) that allows you to also locally test your Lambda function packaged as a container image to speed up the development cycles.

Creating an S3 bucket

As a next step, we create an S3 bucket to store the images used to predict the image class.

  1. On the Amazon S3 console, choose Create bucket.
  2. Give the S3 bucket a name, such as tensorflow-images-for-inference-<Random_String> and replace the <Random_String> with a random value.
  3. Choose Create bucket.

Creating the Lambda function with the TensorFlow code

To create your Lambda function, complete the following steps:

  1. On the Lambda console, choose Functions.
  2. Choose Create function.
  3. Select Container image.
  4. For Function name, enter a name, such as tensorflow-endpoint.
  5. For Container image URI, enter the earlier created lambda-tensorflow-example repository.

  1. Choose Browse images to choose the latest image.
  2. Click Create function to initialize the creation of it.
  3. To improve the Lambda runtime, increase the function memory to at least 6 GB and timeout to 5 minutes in the Basic settings.

For more information about function memory and timeout settings, see New for AWS Lambda – Functions with Up to 10 GB of Memory and 6 vCPUs.

Connecting the S3 bucket to your Lambda function

After the successful creation of the Lambda function, we need to add a trigger to it so that whenever a file is uploaded to the S3 bucket, the function is invoked.

  1. On the Lambda console, choose your function.
  2. Choose Add trigger.

Choose Add trigger.

  1. Choose S3.
  2. For Bucket, choose the bucket you created earlier.

For Bucket, choose the bucket you created earlier.

After the trigger is added, you need to allow the Lambda function to connect to the S3 bucket by setting the appropriate AWS Identity and Access Management (IAM) rights for its execution role.

  1. On the Permissions tab for your function, choose the IAM role.
  2. Choose Attach policies.
  3. Search for AmazonS3ReadOnlyAccess and attach it to the IAM role.

Now you have configured all the necessary services to test your function. Upload a JPG image to the created S3 bucket by opening the bucket in the AWS management console and clicking Upload. After a few seconds, you can see the result of the prediction in the CloudWatch logs. As a follow-up step, you could store the predictions in an Amazon DynamoDB table.

After uploading a JPG picture to the S3 bucket we will get the predicted image class as a result printed to CloudWatch. The Lambda function will be triggered by EventBridge and pull the image from the bucket. As an example, we are going to use the picture of this parrot to get predicted by our inference endpoint.

In the CloudWatch logs the predicted class is printed. Indeed, the model predicts the correct class for the picture (macaw):


In order to achieve optimal performance, you can try various levels of memory setting (which linearly changes the assigned vCPU, to learn more, read this AWS News Blog). In the case of our deployed model, we realize most performance gains at about 3GB – 4GB (~2vCPUs) setting and gains beyond that are relatively low. Different models see different level of performance improvement by increased amount of CPU so it is best to determine this experimentally for your own model. Additionally, it is highly recommended that you compile your source code to take advantage of Advanced Vector Extensions 2 (AVX2) on Lambda that further increases the performance by allowing vCPUs to run higher number of integer and floating-point operations per clock cycle.


Container image support for Lambda allows you to customize your function even more, opening up a lot of new use cases for serverless ML. You can bring your custom models and deploy them on Lambda using up to 10 GB for the container image size. For smaller models that don’t need much computing power, you can perform online training and inference purely in Lambda. When the model size increases, cold start issues become more and more important and need to be mitigated. There is also no restriction on the framework or language with container images; other ML frameworks such as PyTorch, Apache MXNet, XGBoost, or Scikit-learn can be used as well!

If you do require GPU for your inference, you can consider using containers services such as Amazon Elastic Container Service (Amazon ECS), Kubernetes, or deploy the model to an Amazon SageMaker endpoint.

About the Author

Jan Bauer is a Cloud Application Developer at AWS Professional Services. His interests are serverless computing, machine learning, and everything that involves cloud computing.

Read More

Process documents containing handwritten tabular content using Amazon Textract and Amazon A2I

Even in this digital age where more and more companies are moving to the cloud and using machine learning (ML) or technology to improve business processes, we still see a vast number of companies reach out and ask about processing documents, especially documents with handwriting. We see employment forms, time cards, and financial applications with tables and forms that contain handwriting in addition to printed information. To complicate things, each document can be in various formats, and each institution within any given industry may have several different formats. Organizations are looking for a simple solution that can process complex documents with varying formats, including tables, forms, and tabular data.

Extracting data from these documents, especially when you have a combination of printed and handwritten text, is error-prone, time-consuming, expensive, and not scalable. Text embedded in tables and forms adds to the extraction and processing complexity. Amazon Textract is an AWS AI service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables.

After the data is extracted, the postprocessing step in a document management workflow involves reviewing the entries and making changes as required by downstream processing applications. Amazon Augmented AI (Amazon A2I) makes it easy to configure a human review into your ML workflow. This allows you to automatically have a human step to review your ML pipeline if the results fall below a specified confidence threshold, set up review and auditing workflows, and modify the prediction results as needed.

In this post, we show how you can use the Amazon Textract Handwritten feature to extract tabular data from documents and have a human review loop using the Amazon A2I custom task type to make sure that the predictions are highly accurate. We store the results in Amazon DynamoDB, which is a key-value and document database that delivers single-digit millisecond performance at any scale, making the data available for downstream processing.

We walk you through the following steps using a Jupyter notebook:

  1. Use Amazon Textract to retrieve tabular data from the document and inspect the response.
  2. Set up an Amazon A2I human loop to review and modify the Amazon Textract response.
  3. Evaluating the Amazon A2I response and storing it in DynamoDB for downstream processing.


Before getting started, let’s configure the walkthrough Jupyter notebook using an AWS CloudFormation template and then create an Amazon A2I private workforce, which is needed in the notebook to set up the custom Amazon A2I workflow.

Setting up the Jupyter notebook

We deploy a CloudFormation template that performs much of the initial setup work for you, such as creating an AWS Identity and Access Management (IAM) role for Amazon SageMaker, creating a SageMaker notebook instance, and cloning the GitHub repo into the notebook instance.

  1. Choose Launch Stack to configure the notebook in the US East (N. Virginia) Region:
  2. Don’t make any changes to stack name or parameters.
  3. In the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources.
  4. Choose Create stack.

Choose Create stackThe following screenshot of the stack details page shows the status of the stack as CREATE_IN_PROGRESS. It can take up to 20 minutes for the status to change to CREATE_COMPLETE.

The following screenshot of the stack details page shows the status of the stack as CREATE_IN_PROGRESS

  1. On the SageMaker console, choose Notebook Instances.
  2. Choose Open Jupyter for the TextractA2INotebook notebook you created.
  3. Open textract-hand-written-a2i-forms.ipynb and follow along there.

Setting up an Amazon A2I private workforce

For this post, you create a private work team and add only one user (you) to it. For instructions, see Create a Private Workforce (Amazon SageMaker Console). When the user (you) accepts the invitation, you have to add yourself to the workforce. For instructions, see the Add a Worker to a Work Team section in Manage a Workforce (Amazon SageMaker Console).

After you create a labeling workforce, copy the workforce ARN and enter it in the notebook cell to set up a private review workforce:

WORKTEAM_ARN= "<your workteam ARN>"

In the following sections, we walk you through the steps to use this notebook.

Retrieving tabular data from the document and inspecting the response

In this section, we go through the following steps using the walkthrough notebook:

  1. Review the sample data, which has both printed and handwritten content.
  2. Set up the helper functions to parse the Amazon Textract response.
  3. Inspect and analyze the Amazon Textract response.

Reviewing the sample data

Review the sample data by running the following notebook cell:

# Document
documentName = "test_handwritten_document.png"


We use the following sample document, which has both printed and handwritten content in tables.

We use the following sample document, which has both printed and handwritten content in tables.

Use the Amazon Textract Parser Library to process the response

We will now import the Amazon Textract Response Parser library to parse and extract what we need from Amazon Textract’s response. There are two main functions here. One, we will extract the form data (key-value pairs) part of the header section of the document. Two, we will parse the table and cells to create a csv file containing the tabular data. In this notebook, we will use Amazon Textract’s Sync API for document extraction, AnalyzeDocument. This accepts image files (png or jpeg) as an input.

client = boto3.client(
         region_name= 'us-east-1',

with open(documentName, 'rb') as file:
        img_test =
        bytes_test = bytearray(img_test)
        print('Image loaded', documentName)

# process using image bytes
response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES','FORMS'])

You can use the Amazon Textract Response Parser library to easily parse JSON returned by Amazon Textract. The library parses JSON and provides programming language specific constructs to work with different parts of the document. For more details, please refer to the Amazon Textract Parser Library

from trp import Document
# Parse JSON response from Textract
doc = Document(response)

# Iterate over elements in the document
for page in doc.pages:
    # Print lines and words
    for line in page.lines:
        print("Line: {}".format(line.text))
        for word in line.words:
            print("Word: {}".format(word.text))

    # Print tables
    for table in page.tables:
        for r, row in enumerate(table.rows):
            for c, cell in enumerate(row.cells):
                print("Table[{}][{}] = {}".format(r, c, cell.text))

    # Print fields
    for field in page.form.fields:
        print("Field: Key: {}, Value: {}".format(field.key.text, field.value.text))

Now that we have the contents we need from the document image, let’s create a csv file to store it and also use it for setting up the Amazon A2I human loop for review and modification as needed.

# Lets get the form data into a csv file
with open('test_handwritten_form.csv', 'w', newline='') as csvfile:
    formwriter = csv.writer(csvfile, delimiter=',',
    for field in page.form.fields:
        formwriter.writerow([field.key.text+" "+field.value.text])

# Lets get the table data into a csv file
with open('test_handwritten_tab.csv', 'w', newline='') as csvfile:
    tabwriter = csv.writer(csvfile, delimiter=',')
    for r, row in enumerate(table.rows):
        csvrow = []
        for c, cell in enumerate(row.cells):
            if cell.text:
                #csvrow += '{}'.format(cell.text.rstrip())+","

Alternatively, if you would like to modify this notebook to use a PDF file or for batch processing of documents, use the StartDocumentAnalysis API. StartDocumentAnalysis returns a job identifier (JobId) that you use to get the results of the operation. When text analysis is finished, Amazon Textract publishes a completion status to the Amazon Simple Notification Service (Amazon SNS) topic that you specify in NotificationChannel. To get the results of the text analysis operation, first check that the status value published to the Amazon SNS topic is SUCCEEDED. If so, call GetDocumentAnalysis, and pass the job identifier (JobId) from the initial call to StartDocumentAnalysis.

Inspecting and analyzing the Amazon Textract response

We now load the form line items into a Pandas DataFrame and clean it up to ensure we have the relevant columns and rows that downstream applications need. We then send it to Amazon A2I for human review.

Run the following notebook cell to inspect and analyze the key-value data from the Amazon Textract response:

# Load the csv file contents into a dataframe, strip out extra spaces, use comma as delimiter
df_form = pd.read_csv('test_handwritten_form.csv', header=None, quoting=csv.QUOTE_MINIMAL, sep=',')
# Rename column
df_form = df_form.rename(columns={df_form.columns[0]: 'FormHeader'})
# display the dataframe

The following screenshot shows our output.

Run the following notebook cell to inspect and analyze the tabular data from the Amazon Textract response:

# Load the csv file contents into a dataframe, strip out extra spaces, use comma as delimiter
df_tab = pd.read_csv('test_handwritten_tab.csv', header=1, quoting=csv.QUOTE_MINIMAL, sep=',')
# display the dataframe

The following screenshot shows our output.

The following screenshot shows our output.

We can see that Amazon Textract detected both printed and handwritten content from the tabular data.

Setting up an Amazon A2I human loop

Amazon A2I supports two built-in task types: Amazon Textract key-value pair extraction and Amazon Rekognition image moderation, and a custom task type that you can use to integrate a human review loop into any ML workflow. You can use a custom task type to integrate Amazon A2I with other AWS services like Amazon Comprehend, Amazon Transcribe, and Amazon Translate, as well as your own custom ML workflows. To learn more, see Use Cases and Examples using Amazon A2I.

In this section, we show how to use the Amazon A2I custom task type to integrate with Amazon Textract tables and key-value pairs through the walkthrough notebook for low-confidence detection scores from Amazon Textract responses. It includes the following steps:

  1. Create a human task UI.
  2. Create a workflow definition.
  3. Send predictions to Amazon A2I human loops.
  4. Sign in to the worker portal and annotate or verify the Amazon Textract results.

Creating a human task UI

You can create a task UI for your workers by creating a worker task template. A worker task template is an HTML file that you use to display your input data and instructions to help workers complete your task. If you’re creating a human review workflow for a custom task type, you must create a custom worker task template using HTML code. For more information, see Create Custom Worker Task Template.

For this post, we created a custom UI HTML template to render Amazon Textract tables and key-value pairs in the notebook. You can find the template tables-keyvalue-sample.liquid.html in our GitHub repo and customize it for your specific document use case.

This template is used whenever a human loop is required. We have over 70 pre-built UIs available on GitHub. Optionally, you can create this workflow definition on the Amazon A2I console. For instructions, see Create a Human Review Workflow.

After you create this custom template using HTML, you must use this template to generate an Amazon A2I human task UI Amazon Resource Name (ARN). This ARN has the following format: arn:aws:sagemaker:<aws-region>:<aws-account-number>:human-task-ui/<template-name>. This ARN is associated with a worker task template resource that you can use in one or more human review workflows (flow definitions). Generate a human task UI ARN using a worker task template by using the CreateHumanTaskUi API operation by running the following notebook cell:

def create_task_ui():
    Creates a Human Task UI resource.

    struct: HumanTaskUiArn
    response = sagemaker_client.create_human_task_ui(
        UiTemplate={'Content': template})
    return response
# Create task UI
humanTaskUiResponse = create_task_ui()
humanTaskUiArn = humanTaskUiResponse['HumanTaskUiArn']

The preceding code gives you an ARN as output, which we use in setting up flow definitions in the next step:


Creating the workflow definition

In this section, we create a flow definition. Flow definitions allow us to specify the following:

  • The workforce that your tasks are sent to
  • The instructions that your workforce receives (worker task template)
  • Where your output data is stored

For this post, we use the API in the following code:

create_workflow_definition_response = sagemaker_client.create_flow_definition(
        FlowDefinitionName= flowDefinitionName,
        RoleArn= role,
        HumanLoopConfig= {
            "WorkteamArn": WORKTEAM_ARN,
            "HumanTaskUiArn": humanTaskUiArn,
            "TaskCount": 1,
            "TaskDescription": "Review the table contents and correct values as indicated",
            "TaskTitle": "Employment History Review"
            "S3OutputPath" : OUTPUT_PATH
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn'] # let's save this ARN for future use

Optionally, you can create this workflow definition on the Amazon A2I console. For instructions, see Create a Human Review Workflow.

Sending predictions to Amazon A2I human loops

We create an item list from the Pandas DataFrame where we have the Amazon Textract output saved. Run the following notebook cell to create a list of items to be sent for review:

NUM_TO_REVIEW = len(df_tab) # number of line items to review
dfstart = df_tab['Start Date'].to_list()
dfend = df_tab['End Date'].to_list()
dfemp = df_tab['Employer Name'].to_list()
dfpos = df_tab['Position Held'].to_list()
dfres = df_tab['Reason for leaving'].to_list()
item_list = [{'row': "{}".format(x), 'startdate': dfstart[x], 'enddate': dfend[x], 'empname': dfemp[x], 'posheld': dfpos[x], 'resleave': dfres[x]} for x in range(NUM_TO_REVIEW)]

You get an output of all the rows and columns received from Amazon Textract:

[{'row': '0',
  'startdate': '1/15/2009 ',
  'enddate': '6/30/2011 ',
  'empname': 'Any Company ',
  'posheld': 'Assistant baker ',
  'resleave': 'relocated '},
 {'row': '1',
  'startdate': '7/1/2011 ',
  'enddate': '8/10/2013 ',
  'empname': 'Example Corp. ',
  'posheld': 'Baker ',
  'resleave': 'better opp. '},
 {'row': '2',
  'startdate': '8/15/2013 ',
  'enddate': 'Present ',
  'empname': 'AnyCompany ',
  'posheld': 'head baker ',
  'resleave': 'N/A current '}]

Run the following notebook cell to get a list of key-value pairs:

dforighdr = df_form['FormHeader'].to_list()
hdr_list = [{'hdrrow': "{}".format(x), 'orighdr': dforighdr[x]} for x in range(len(df_form))]

Run the following code to create a JSON response for the Amazon A2I loop by combining the key-value and table list from the preceding cells:

ip_content = {"Header": hdr_list,
              'Pairs': item_list,
              'image1': s3_img_url

Start the human loop by running the following notebook cell:

# Activate human loops
import json
humanLoopName = str(uuid.uuid4())

start_loop_response = a2i.start_human_loop(
                "InputContent": json.dumps(ip_content)

Check the status of human loop with the following code:

completed_human_loops = []
resp = a2i.describe_human_loop(HumanLoopName=humanLoopName)
print(f'HumanLoop Name: {humanLoopName}')
print(f'HumanLoop Status: {resp["HumanLoopStatus"]}')
print(f'HumanLoop Output Destination: {resp["HumanLoopOutput"]}')
if resp["HumanLoopStatus"] == "Completed":

You get the following output, which shows the status of the human loop and the output destination S3 bucket:

HumanLoop Name: f69bb14e-3acd-4301-81c0-e272b3c77df0
HumanLoop Status: InProgress
HumanLoop Output Destination: {'OutputS3Uri': 's3://sagemaker-us-east-1-<aws-account-nr>/textract-a2i-handwritten/a2i-results/fd-hw-forms-2021-01-11-16-54-31/2021/01/11/16/58/13/f69bb14e-3acd-4301-81c0-e272b3c77df0/output.json'}

Annotating the results via the worker portal

Run the steps in the notebook to check the status of the human loop. You can use the accompanying SageMaker Jupyter notebook to follow the steps in this post.

  1. Run the following notebook cell to get a login link to navigate to the private workforce portal:
    workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
    print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
    print('https://' + sagemaker_client.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])

  1. Choose the login link to the private worker portal.
  2. Select the human review job.
  3. Choose Start working.

Choose Start working.

You’re redirected to the Amazon A2I console, where you find the original document displayed, your key-value pair, the text responses detected from Amazon Textract, and your table’s responses.

Choose Start working.

Scroll down to find the correction form for key-value pairs and text, where you can verify the results and compare the Amazon Textract response to the original document. You will also find the UI to modify the tabular handwritten and printed content.

You can modify each cell based on the original image response and reenter correct values and submit your response. The labeling workflow is complete when you submit your responses.

Evaluating the results

When the labeling work is complete, your results should be available in the S3 output path specified in the human review workflow definition. The human answers are returned and saved in the JSON file. Run the notebook cell to get the results from Amazon S3:

import re
import pprint

pp = pprint.PrettyPrinter(indent=4)

for resp in completed_human_loops:
    splitted_string = re.split('s3://' +  'a2i-experiments' + '/', resp['HumanLoopOutput']['OutputS3Uri'])
    output_bucket_key = splitted_string[1]

    response = s3.get_object(Bucket='a2i-experiments', Key=output_bucket_key)
    content = response["Body"].read()
    json_output = json.loads(content)

The following code shows a snippet of the Amazon A2I annotation output JSON file:

{   'flowDefinitionArn': 'arn:aws:sagemaker:us-east-1:<aws-account-nr>:flow-definition/fd-hw-invoice-2021-02-22-23-07-53',
    'humanAnswers': [   {   'acceptanceTime': '2021-02-22T23:08:38.875Z',
                            'answerContent': {   'TrueHdr3': 'Full Name: Jane '
                                                 'predicted1': 'relocated',
                                                 'predicted2': 'better opp.',
                                                 'predicted3': 'N/A, current',
                                                 'predictedhdr1': 'Phone '
                                                                  'Number: '
                                                 'predictedhdr2': 'Mailing '
                                                                  'Address: '
                                                                  'same as '
                                                 'predictedhdr3': 'Full Name: '
                                                                  'Jane Doe',
                                                 'predictedhdr4': 'Home '
                                                                  'Address: '
                                                                  '123 Any '
                                                                  'Street, Any '
                                                                  'Town. USA',
                                                 'rating1': {   'agree': True,
                                                                'disagree': False},
                                                 'rating2': {   'agree': True,
                                                                'disagree': False},
                                                 'rating3': {   'agree': False,
                                                                'disagree': True},
                                                 'rating4': {   'agree': True,
                                                                'disagree': False},
                                                 'ratingline1': {   'agree': True,
                                                                    'disagree': False},
                                                 'ratingline2': {   'agree': True,
                                                                    'disagree': False},
                                                 'ratingline3': {   'agree': True,
                                                                    'disagree': False}}

Storing the Amazon A2I annotated results in DynamoDB

We now store the form with the updated contents in a DynamoDB table so downstream applications can use it. To automate the process, simply set up an AWS Lambda trigger with DynamoDB to automatically extract and send information to your API endpoints or applications. For more information, see DynamoDB Streams and AWS Lambda Triggers.

To store your results, complete the following steps:

  1. Get the human answers for the key-values and text into a DataFrame by running the following notebook cell:
    #updated array values to be strings for dataframe assignment
    for i in json_output['humanAnswers']:
        x = i['answerContent']
    for j in range(0, len(df_form)):   [j, 'TrueHeader'] = str(x.get('TrueHdr'+str(j+1)))[j, 'Comments'] = str(x.get('Comments'+str(j+1)))
    df_form = df_form.where(df_form.notnull(), None)

  1. Get the human-reviewed answers for tabular data into a DataFrame by running the following cell:
    #updated array values to be strings for dataframe assignment
    for i in json_output['humanAnswers']:
        x = i['answerContent']
    for j in range(0, len(df_tab)):   [j, 'TrueStartDate'] = str(x.get('TrueStartDate'+str(j+1)))[j, 'TrueEndDate'] = str(x.get('TrueEndDate'+str(j+1)))[j, 'TrueEmpName'] = str(x.get('TrueEmpName'+str(j+1)))   [j, 'TruePosHeld'] = str(x.get('TruePosHeld'+str(j+1)))[j, 'TrueResLeave'] = str(x.get('TrueResLeave'+str(j+1)))[j, 'ChangeComments'] = str(x.get('Change Reason'+str(j+1)))
    df_tab = df_tab.where(df_tab.notnull(), None)You will get below output:

  1. Combine the DataFrames into one DataFrame to save in the DynamoDB table:
    # Join both the dataframes to prep for insert into DynamoDB
    df_doc = df_form.join(df_tab, how='outer')
    df_doc = df_doc.where(df_doc.notnull(), None)

Creating the DynamoDB table

Create your DynamoDB table with the following code:

# Get the service resource.
dynamodb = boto3.resource('dynamodb')
tablename = "emp_history-"+str(uuid.uuid4())

# Create the DynamoDB table.
table = dynamodb.create_table(
'AttributeName': 'line_nr',
'KeyType': 'HASH'
'AttributeName': 'line_nr',
'AttributeType': 'N'
'ReadCapacityUnits': 5,
'WriteCapacityUnits': 5
# Wait until the table exists.
# Print out some data about the table.
print("Table successfully created. Item count is: " + str(table.item_count))

You get the following output:

Table successfully created. Item count is: 0

Uploading the contents of the DataFrame to a DynamoDB table

Upload the contents of your DataFrame to your DynamoDB table with the following code:

Note: When adding contents from multiple documents in your DynamoDB table, please ensure you add a document number as an attribute to differentiate between documents. In the example below we just use the index as the line_nr because we are working with a single document.

for idx, row in df_doc.iterrows():
        'line_nr': idx,
        'orig_hdr': str(row['FormHeader']) ,
        'true_hdr': str(row['TrueHeader']),
        'comments': str(row['Comments']),   
        'start_date': str(row['Start Date ']),
        'end_date': str(row['End Date ']),
        'emp_name': str(row['Employer Name ']),
        'position_held': str(row['Position Held ']),
        'reason_for_leaving': str(row['Reason for leaving']),
        'true_start_date': str(row['TrueStartDate']),
        'true_end_date': str(row['TrueEndDate']),   
        'true_emp_name': str(row['TrueEmpName']),
        'true_position_held': str(row['TruePosHeld']),
        'true_reason_for_leaving': str(row['TrueResLeave']),
        'change_comments': str(row['ChangeComments'])   

To check if the items were updated, run the following code to retrieve the DynamoDB table value:

response = table.get_item(
'line_nr': 2
item = response['Item']

Alternatively, you can check the table on the DynamoDB console, as in the following screenshot.


This post demonstrated how easy it is to use services in the AI layer of the AWS AI/ML stack, such as Amazon Textract and Amazon A2I, to read and process tabular data from handwritten forms, and store them in a DynamoDB table for downstream applications to use. You can also send the augmented form data from Amazon A2I to an S3 bucket to be consumed by your AWS analytics applications.

For video presentations, sample Jupyter notebooks, or more information about use cases like document processing, content moderation, sentiment analysis, text translation, and more, see Amazon Augmented AI Resources. If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!

About the Authors

Prem Ranga is an Enterprise Solutions Architect based out of Atlanta, GA. He is part of the Machine Learning Technical Field Community and loves working with customers on their ML and AI journey. Prem is passionate about robotics, is an autonomous vehicles researcher, and also built the Alexa-controlled Beer Pours in Houston and other locations.


Mona Mona is an AI/ML Specialist Solutions Architect based out of Arlington, VA. She works with the World Wide Public Sector team and helps customers adopt machine learning on a large scale. She is passionate about NLP and ML explainability areas in AI/ML.



Sriharsha M S is an AI/ML specialist solution architect in the Strategic Specialist team at Amazon Web Services. He works with strategic AWS customers who are taking advantage of AI/ML to solve complex business problems. He provides technical guidance and design advice to implement AI/ML applications at scale. His expertise spans application architecture, big data, analytics, and machine learning.

Read More

Feelin’ Like a Million MBUX: AI Cockpit Featured in Popular Mercedes-Benz C-Class

It’s hard not to feel your best when your car makes every commute a VIP experience.

This week, Mercedes-Benz launched the redesigned C-Class sedan and C-Class wagon, packed with new features for the next generation of driving. Both models prominently feature the latest MBUX AI cockpit, powered by NVIDIA, delivering an intelligent user interface for daily driving.

The newest MBUX system debuted with the flagship S-Class sedan in September. With the C-Class, the system is now in Mercedes-Benz’ most popular model in the mid-size sedan segment — the automaker has sold 10.5 million C-Class vehicles since it was first introduced and one in every seven Mercedes-Benz sold is a member of that model line.

NVIDIA and Mercedes-Benz have been working together to drive the future of automotive innovation, starting with the first generation MBUX to the upcoming fleet of software-defined vehicles.

This extension of MBUX to such an appealing model is accelerating the adoption of AI into everyday commutes, ushering in a new generation where the car adapts to the driver, not the other way around.

Uncommon Intelligence

With MBUX, the new C-Class sedan and wagon share much of the innovations that have made the S-Class a standout in its segment.

AI cockpits orchestrate crucial safety and convenience features, constantly learning to continuously deliver joy to the customer. Similarly, the MBUX system serves as the central nervous system of the vehicle, intelligently networking all its functions.

“MBUX combines so many features into one intelligent user interface,” said Georges Massing, vice president of Digital Vehicle and Mobility at Mercedes-Benz. “It makes life much easier for our customers.”

The new MBUX system makes the cutting edge in graphics, passenger detection and natural language processing seem effortless. Like in the S-Class, the C-Class system features a driver and media display with crisp graphics that are easily understandable at a glance. The “Hey Mercedes” voice assistant has become even sharper, can activate online services, and continuously improves over time.

MBUX can even recognize biometric identification to ensure the car is always safe and secure. A fingerprint scanner located beneath the central display allows users to quickly and securely access personalized features.

And with over-the-air updates, MBUX ensures the latest technology will always be at the user’s fingertips, long after they leave the dealership.

A Modern Sedan for the Modern World

With AI at the helm, the C-Class embraces modern and forward-looking technology as the industry enters a new era of mobility.

The redesigned vehicle maintains the Mercedes-Benz heritage of unparalleled driving dynamics while incorporating intelligent features such as headlights that automatically adapt to the surrounding environment for optimal visibility.

Both the sedan and wagon variants come with plug-in hybrid options that offer more than 60 miles of electric range for a luxurious driving experience that’s also sustainable.

These features, combined with the only AI cockpit available today, will have C-Class drivers feeling like a million bucks.

The post Feelin’ Like a Million MBUX: AI Cockpit Featured in Popular Mercedes-Benz C-Class appeared first on The Official NVIDIA Blog.

Read More

The Technology Behind Cinematic Photos

Posted by Per Karlsson and Lucy Yu, Software Engineers, Google Research

Looking at photos from the past can help people relive some of their most treasured moments. Last December we launched Cinematic photos, a new feature in Google Photos that aims to recapture the sense of immersion felt the moment a photo was taken, simulating camera motion and parallax by inferring 3D representations in an image. In this post, we take a look at the technology behind this process, and demonstrate how Cinematic photos can turn a single 2D photo from the past into a more immersive 3D animation.

Camera 3D model courtesy of Rick Reitano.

Depth Estimation
Like many recent computational photography features such as Portrait Mode and Augmented Reality (AR), Cinematic photos requires a depth map to provide information about the 3D structure of a scene. Typical techniques for computing depth on a smartphone rely on multi-view stereo, a geometry method to solve for the depth of objects in a scene by simultaneously capturing multiple photos at different viewpoints, where the distances between the cameras is known. In the Pixel phones, the views come from two cameras or dual-pixel sensors.

To enable Cinematic photos on existing pictures that were not taken in multi-view stereo, we trained a convolutional neural network with encoder-decoder architecture to predict a depth map from just a single RGB image. Using only one view, the model learned to estimate depth using monocular cues, such as the relative sizes of objects, linear perspective, defocus blur, etc.

Because monocular depth estimation datasets are typically designed for domains such as AR, robotics, and self-driving, they tend to emphasize street scenes or indoor room scenes instead of features more common in casual photography, like people, pets, and objects, which have different composition and framing. So, we created our own dataset for training the monocular depth model using photos captured on a custom 5-camera rig as well as another dataset of Portrait photos captured on Pixel 4. Both datasets included ground-truth depth from multi-view stereo that is critical for training a model.

Mixing several datasets in this way exposes the model to a larger variety of scenes and camera hardware, improving its predictions on photos in the wild. However, it also introduces new challenges, because the ground-truth depth from different datasets may differ from each other by an unknown scaling factor and shift. Fortunately, the Cinematic photo effect only needs the relative depths of objects in the scene, not the absolute depths. Thus we can combine datasets by using a scale-and-shift-invariant loss during training and then normalize the output of the model at inference.

The Cinematic photo effect is particularly sensitive to the depth map’s accuracy at person boundaries. An error in the depth map can result in jarring artifacts in the final rendered effect. To mitigate this, we apply median filtering to improve the edges, and also infer segmentation masks of any people in the photo using a DeepLab segmentation model trained on the Open Images dataset. The masks are used to pull forward pixels of the depth map that were incorrectly predicted to be in the background.

Camera Trajectory
There can be many degrees of freedom when animating a camera in a 3D scene, and our virtual camera setup is inspired by professional video camera rigs to create cinematic motion. Part of this is identifying the optimal pivot point for the virtual camera’s rotation in order to yield the best results by drawing one’s eye to the subject.

The first step in 3D scene reconstruction is to create a mesh by extruding the RGB image onto the depth map. By doing so, neighboring points in the mesh can have large depth differences. While this is not noticeable from the “face-on” view, the more the virtual camera is moved, the more likely it is to see polygons spanning large changes in depth. In the rendered output video, this will look like the input texture is stretched. The biggest challenge when animating the virtual camera is to find a trajectory that introduces parallax while minimizing these “stretchy” artifacts.

The parts of the mesh with large depth differences become more visible (red visualization) once the camera is away from the “face-on” view. In these areas, the photo appears to be stretched, which we call “stretchy artifacts”.

Because of the wide spectrum in user photos and their corresponding 3D reconstructions, it is not possible to share one trajectory across all animations. Instead, we define a loss function that captures how much of the stretchiness can be seen in the final animation, which allows us to optimize the camera parameters for each unique photo. Rather than counting the total number of pixels identified as artifacts, the loss function triggers more heavily in areas with a greater number of connected artifact pixels, which reflects a viewer’s tendency to more easily notice artifacts in these connected areas.

We utilize padded segmentation masks from a human pose network to divide the image into three different regions: head, body and background. The loss function is normalized inside each region before computing the final loss as a weighted sum of the normalized losses. Ideally the generated output video is free from artifacts but in practice, this is rare. Weighting the regions differently biases the optimization process to pick trajectories that prefer artifacts in the background regions, rather than those artifacts near the image subject.

During the camera trajectory optimization, the goal is to select a path for the camera with the least amount of noticeable artifacts. In these preview images, artifacts in the output are colored red while the green and blue overlay visualizes the different body regions.

Framing the Scene
Generally, the reprojected 3D scene does not neatly fit into a rectangle with portrait orientation, so it was also necessary to frame the output with the correct right aspect ratio while still retaining the key parts of the input image. To accomplish this, we use a deep neural network that predicts per-pixel saliency of the full image. When framing the virtual camera in 3D, the model identifies and captures as many salient regions as possible while ensuring that the rendered mesh fully occupies every output video frame. This sometimes requires the model to shrink the camera’s field of view.

Heatmap of the predicted per-pixel saliency. We want the creation to include as much of the salient regions as possible when framing the virtual camera.

Through Cinematic photos, we implemented a system of algorithms – with each ML model evaluated for fairness – that work together to allow users to relive their memories in a new way, and we are excited about future research and feature improvements. Now that you know how they are created, keep an eye open for automatically created Cinematic photos that may appear in your recent memories within the Google Photos app!

Cinematic Photos is the result of a collaboration between Google Research and Google Photos teams. Key contributors also include: Andre Le, Brian Curless, Cassidy Curtis, Ce Liu‎, Chun-po Wang, Daniel Jenstad, David Salesin, Dominik Kaeser, Gina Reynolds, Hao Xu, Huiwen Chang, Huizhong Chen‎, Jamie Aspinall, Janne Kontkanen, Matthew DuVall, Michael Kucera, Michael Milne, Mike Krainin, Mike Liu, Navin Sarma, Orly Liba, Peter Hedman, Rocky Cai‎, Ruirui Jiang‎, Steven Hickson, Tracy Gu, Tyler Zhu, Varun Jampani, Yuan Hao, Zhongli Ding.

Read More