Facebook Fellow Spotlight: Empowering women in rural communities through research in HCI

Each year, PhD students from around the world apply for the Facebook Fellowship, a program designed to encourage and support doctoral students engaged in innovative and relevant research in areas related to computer science and engineering at an accredited university.

As a continuation of our Fellowship spotlight series, we’re highlighting 2020 Facebook Fellow Sharifa Sultana.

Sharifa is a PhD candidate in Information Science at Cornell University. Her work focuses on human-computer interaction (HCI) and information and communication technologies for development (ICTD) from a critical computing and feminist HCI perspective.

Raised in Jessore, Bangladesh, Sharifa noticed that women were underrepresented in STEM education and other professions around the world, particularly in Bangladesh. Because of this underrepresentation, many women in rural communities have difficulties accessing, trusting, and using technology. This inspired Sharifa to work towards creating a more inclusive environment in which women would feel empowered to use technology, and where technology could, in turn, help fight the oppression of women in her home country.

“My research asks the questions, ‘Why is tech not working for rural Bangladeshi women? How can we fight against oppression using tech?’” she says. Sharifa’s approach explores how women interact with technology in rural communities in an effort to develop and implement solutions that address their critical needs.

One of these needs is combating gender harassment. “In Bangladesh, women are often harassed by colleagues, friends, family members — people who they want to trust,” she says. “Yet it is often difficult for them to seek legal help for many reasons.”

In order to empower women to counter harassment, Sharifa designed a digital tool – ‘Unmochon’ – to collect evidence of tech-based harassment through Facebook Messenger. Users can install and run it to collect image evidence of harassing messages and the harassers’ Facebook handles. This tool allows users to report the incident to the appropriate authorities and confirm the authenticity of the evidence.

Sharifa’s most recent research focuses on alternative rationalities in computing – namely, exploring how rural communities determine what information is true and how misinformation can prevent women from seeking healthcare. “The aim is to design tech that would actually help [women], that they would actually use,” Sharifa says.

Healthcare misinformation is a serious issue for rural communities in Bangladesh, especially during the COVID-19 pandemic. She hopes to develop technology that will give people access to reliable information and connect them with the healthcare they need.

Sharifa’s research has opened up a new discussion on how HCI design can be used to address online gender harassment and on how studying HCI can help bridge the gap between women accessing life-saving healthcare. Currently, Sharifa is in Bangladesh, collaborating on a local research project to determine what kind of technology and healthcare practices could benefit rural communities.

To learn more about Sharifa Sultana, visit her Fellowship profile.

The post Facebook Fellow Spotlight: Empowering women in rural communities through research in HCI appeared first on Facebook Research.

Read More

Orchestrate XGBoost ML Pipelines with Amazon Managed Workflows for Apache Airflow

The ability to scale machine learning operations (MLOps) at an enterprise is quickly becoming a competitive advantage in the modern economy. When firms started dabbling in ML, only the highest priority use cases were the focus. Businesses are now demanding more from ML practitioners: more intelligent features, delivered faster, and continually maintained over time. An effective MLOps strategy requires a unified platform that can orchestrate and automate complex data processing and ML tasks, and integrates with the latest tooling to best complete those tasks.

This post demonstrates the value of using Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestrate an ML pipeline using the popular XGBoost (eXtreme Gradient Boosting) algorithm. For more advanced and comprehensive MLOps capabilities, including a purpose-built model orchestration framework and a continuous integration and continuous delivery (CI/CD) service for ML, readers are encouraged to check out Amazon SageMaker Pipelines.

Why Airflow for orchestration

Customers choose Apache Airflow and specifically Amazon MWAA for several reasons, but three stand out:

  • Airflow is Python-based – Airflow, as a Python-based tool, enjoys the benefits of an imperative programming paradigm. This enables developers to programmatically define how tasks are to be done. Tools that are declarative, such as AWS Step Functions, only allow you to define what is to be done. When orchestrating ML pipelines, the ability to directly define the control flow is often required to navigate complex workflows.
  • Directed Acyclic Graph (DAG) workflow management – Airflow provides a DAG interface as a simple mechanism for defining and running complex workflows with dependencies. These DAG workflows are visualized through a GUI for operations management.
  • Extensibility – Airflow operators provide a structured way to perform common tasks using reusable modules. This capability is extensible and providers are free to develop custom Airflow operators that integrate with their tools and services. Many cloud-based services are supported. These operators provide useful abstraction, repeatability, and an API. In the context of big data and ML, these operators are especially valuable because they provide a way to orchestrate sometimes very long-running data pipelines or asynchronous ML processes such as model training.

Set up an Amazon MWAA environment

To create your Amazon MWAA environment, complete the following steps:

  1. On the Amazon MWAA console, choose Create environment.
  2. For Name, enter a unique name.
  3. For Airflow version, choose the version to use. For this post, we use Airflow v2.0.2. We also include code for Airflow v1.10.12.

  1. In the Dag code in the Amazon S3 section, specify the Amazon Simple Storage Service (Amazon S3) bucket where Amazon MWAA can find the DAGs, plugins.zip file, and requirements.txt file.

Airflow configuration for XGBoost

An XGBoost model requires a specific configuration in the Managed Airflow environment. The core.enable_xcom_pickling parameter must be set to True. The reason for this is the trained XGBoost model needs to be serialized in order to save it as a file in Amazon S3. Certain Python objects (like datetime) can’t be serialized without converting the Python object hierarchy into a byte stream through a process called pickling.

Requirements.txt file

Upload a requirements.txt file to the Amazon S3 location you specified in the Amazon MWAA setup. To support this demonstration, the requirements.txt file should have the following entries:


Orchestrate an XGBoost ML pipeline

Our ML pipeline is a simplified three-step pipeline:

  1. Data preprocessing using AWS Glue. Real pipelines could require numerous processing steps for data cleaning and featuring engineering. Although Amazon SageMaker Pipelines provides a similar functionality, we use AWS Glue to illustrate how different AWS services or third-party tools and services are orchestrated in a single pipeline.
  2. Train an XGBoost model using a SageMaker training job.
  3. Deploy the trained model as a real-time inference endpoint.

Solution architecture

Our ML pipeline is pictured in the following diagram. We use AWS Lambda to invoke DAGs with a Lambda function. We also use Amazon EventBridge to trigger Lambda functions. For more information, see Tutorial: Schedule AWS Lambda functions using EventBridge.

Stage the AWS Glue script

In our demo, we create the AWS Glue job dynamically using a PySpark script saved in Amazon S3. Copy the glue_etl.py file provided in the source code repo to an Amazon S3 location.

Set DAG configuration values

To keep things simple, we use a config.py file to import any environment-specific configurations rather than define it in the main DAG script. You can view the config.py file in its entirety on GitHub. A best practice is to use AWS Secrets Manager to store configuration and secrets information (as of this writing, AWS Systems Manager Parameter Store isn’t a supported backend on Amazon MWAA). Detailed documentation on how to securely store secrets in AWS Secrets Manager for Amazon MWAA is available here.

Upload the updated config.py file to the DAG directory.

Stage the customer churn training data

The customer churn dataset is mentioned in the book Discovering Knowledge in Data by Daniel T. Larose. It’s attributed by the author to the University of California Irvine Repository of Machine Learning Datasets. The dataset is publicly available and provided in the GitHub repo.

Upload the customer-churn.csv file to the Amazon S3 location you specified in the config.py file.

Construct the DAG

For our demonstration, the DAG consists of four primary sections:

  • Import statements
  • DAG operator configuration
  • DAG task definitions
  • DAG task dependency definition

Import statements

Because Airflow is Python-based, the DAG file is a simple Python file and the modules for Airflow are imported just as they would be for any Python application.

Some services have native Airflow operators available that manage asynchronous API calls and polling to determine success or failure of orchestrated tasks. We recommend using native operators wherever possible. AWS services that don’t have native Airflow operators, like AWS Glue, can still be orchestrated in Airflow using AWS SDKs called from the general PythonOperator.

For nearly all AWS services, the AWS SDK for Python (Boto3) provides service-level access to the APIs. This SDK provides a high degree of control, but also a lower level of abstraction. For ML pipelines using SageMaker, you can use the SageMaker Python SDK. This is a streamlined SDK abstracted specifically for ML experimentation.

The following import statements include general Airflow modules and operators, native Airflow operators for SageMaker, and the Boto3 and SageMaker SDKs:

# Airflow Operators
import airflow
from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python_operator import PythonOperator

# Airflow Sagemaker Operators
from airflow.providers.amazon.aws.operators.sagemaker_training import SageMakerTrainingOperator
from airflow.providers.amazon.aws.operators.sagemaker_endpoint import SageMakerEndpointOperator
from airflow.providers.amazon.aws.hooks.base_aws import AwsBaseHook

# AWS SDK for Python
import boto3

# Amazon SageMaker SDK
import sagemaker
from sagemaker.amazon.amazon_estimator import get_image_uri
from sagemaker.estimator import Estimator
from sagemaker.session import s3_input

# Airflow SageMaker Configuration
from sagemaker.workflow.airflow import training_config
from sagemaker.workflow.airflow import model_config_from_estimator
from sagemaker.workflow.airflow import deploy_config_from_estimator

# Configuration variables
import config

Other import statements are needed to support this demonstration; refer to the GitHub repo for the full code.

DAG operator configuration

The DAG and DAG tasks are defined based on the operators invoked to run each task.

For the AWS Glue task, we invoke the PythonOperator using the SDK for Python to create a client for AWS Glue. To keep the DAG code tidy, we abstract the AWS Glue client code in a helper function called preprocess_glue. We stage the glue_etl.py (referenced in the GitHub repo) in Amazon S3 so it can be loaded when the AWS Glue job is created. See the following code:

def preprocess_glue():
  """preprocess data using glue for etl"""

  # not best practice to hard code location 
  glue_script_location = 's3://{}/{}'.format(config.GLUE_JOB_SCRIPT_S3_BUCKET, config.GLUE_JOB_SCRIPT_S3_KEY)
  glue_client = boto3.client('glue')

  # instantiate the Glue ETL job
  response = glue_client.create_job(
    Description='PySpark job to extract the data and split in to training and validation data sets',
      'MaxConcurrentRuns': 2
      'Name': 'glueetl',
      'ScriptLocation': glue_script_location,
      'PythonVersion': '3'
      '--job-language': 'python'
  # execute the previously instantiated Glue ETL job
  response = glue_client.start_job_run(
      '--S3_SOURCE': config.DATA_S3_SOURCE,
      '--S3_DEST': config.DATA_S3_DEST,
      '--TRAIN_KEY': 'train/',
      '--VAL_KEY': 'validation/' 

We create a helper function that returns the ARN of the SageMaker role:

def get_sagemaker_role_arn(role_name, region_name):
    iam = boto3.client("iam", region_name=region_name)
    response = iam.get_role(RoleName=role_name)
    return response["Role"]["Arn"]

The XGBoost estimator requires the SageMaker role, container image, and hyperparameters, which we collect using a hook into SageMaker:

hook = AwsBaseHook(aws_conn_id="airflow-sagemaker", client_type="sagemaker")
sess = hook.get_session(region_name=config.REGION_NAME)
sagemaker_role = get_sagemaker_role_arn(config.SAGEMAKER_ROLE_NAME, config.REGION_NAME)
container = get_image_uri(sess.region_name, "xgboost")
hyperparameters = {

With the parameters defined, we can create the estimator object:

xgb_estimator = Estimator(

This estimator object is an input parameter into the training configuration. We need to define other training parameters:

# create unique name with guid

# define S3 locations for training & validation data processed using Glue
sagemaker_training_data = s3_input(config.SAGEMAKER_TRAINING_DATA_S3_SOURCE, content_type=config.SAGEMAKER_CONTENT_TYPE)
sagemaker_validation_data = s3_input(config.SAGEMAKER_VALIDATION_DATA_S3_SOURCE, content_type=config.SAGEMAKER_CONTENT_TYPE)

sagemaker_training_inputs = {
  'train': sagemaker_training_data,
  'validation': sagemaker_validation_data

Let’s take a closer look at the arguments for sagemaker_training_inputs. The XGBoost algorithm supports both LIBSVM and CSV text formats for training and validation datasets. However, LIBSVM is supported by default. This means that we must specify CSV explicitly so XGBoost interprets our data correctly. The content type is set as text/csv in our custom DAG configuration file. We use CSV because it’s the most common data file format familiar to all ML practitioners.

With these parameters defined, we can create the training config object:

training_config = training_config(

For native Airflow SageMaker operators, you can construct and reference well-defined configuration objects when invoking the operators.

The next configuration definition is for the SageMaker endpoint:

# create unique name using guid

For this simple pipeline, we use the deploy_config_from_estimator API option in the SageMaker SDK to export an Airflow deploy config directly from the SageMaker XGBoost estimator (the endpoint_name parameter must be 63 characters or less):

endpoint_config = deploy_config_from_estimator(

For more information about how we set up the model training and deployment configuration, including how we used the SageMaker SDK sagemaker.workflow.airflow APIs, see the GitHub repo.

With the operator configuration complete, we’re ready to put it all together to define our DAG.

DAG task definitions

For the XGBoost model training task, we invoke the SageMakerTrainingOperator. For the endpoint deployment task, we invoke the SageMakerEndpointOperator. It’s important to note the separation of concerns: we create a model using the SageMakerModelOperator but configure the SageMaker endpoint using the SageMakerEndpointConfigOperator. This provides added granular control over the creation and deployment of the model. See the following code:

args = {"owner": "airflow", "start_date": airflow.utils.dates.days_ago(2), 'depends_on_past': False}

with DAG(
) as dag:
    process_task = PythonOperator(

    train_task = SageMakerTrainingOperator(
      task_id = "train",
      config = training_config,
      aws_conn_id = "airflow-sagemaker",
      wait_for_completion = True,
      check_interval = 60, #check status of the job every minute
      max_ingestion_time = None, #allow training job to run as long as it needs, change for early stop

    endpoint_deploy_task = SageMakerEndpointOperator(
      task_id = "endpoint-deploy",
      config = endpoint_config,
      aws_conn_id = "sagemaker-airflow",
      wait_for_completion = True,
      check_interval = 60, #check status of endpoint deployment every minute
      max_ingestion_time = None,
      operation = 'create', #change to update if you are updating rather than creating an endpoint

DAG task dependency definition

After we define the tasks, we set the dependencies of the tasks. Airflow implements the right shift logical operator (>>) to define downstream dependencies and the left shift logical operator (<<) to define upstream dependencies. In our example, we only define downstream dependencies:

# set the dependencies between tasks
process_task >> train_task >> endpoint_deploy_task

When the completed DAG is uploaded to the designated Amazon S3 location, Amazon MWAA automatically ingests the DAG. The graph view visually shows the task dependencies. You can trigger the DAG manually from the console during iterative testing, or as we described earlier, from an external source such as EventBridge and a Lambda function. Each task is highlighted depending on the stage of completion, as shown in the following screenshot. Dark green indicates successful completion of the task.

Test the deployed endpoint

After the endpoint-deploy task is complete, we can view the endpoint on the SageMaker console. The SageMaker endpoint is a real-time inference endpoint. SageMaker takes care of deploying, hosting, and exposing the HTTPS endpoint.

We can test the deployed endpoint with a SageMaker notebook.

Follow these steps to set up a SageMaker notebook environment:

  1. Launch a SageMaker notebook instance.
  2. On the Notebook instances page, open your notebook instance by choosing either Open JupyterLab for the JupyterLab interface or Open Jupyter for the classic Jupyter view.
  3. Choose Upload to import the test notebook available in the GitHub repo.

Prepare a test sample

We use Pandas DataFrames to create a test dataset out of the customer churn dataset that was used for training. For the test dataset, we must drop the label column, which is the first column. We also take a random sample of the dataset using the Pandas DataFrame sample method.

Request inferences

Now that we have our sampled test data, we use the Boto3 library to create a SageMaker runtime client. We use the client when we invoke our endpoint, pass it test data, and receive an inference value.


You can use Amazon MWAA to orchestrate and automate complex ML pipelines from the data processing stage through model training and endpoint deployment. You can set special configuration options in the Amazon MWAA environment to support popular ML frameworks like XGBoost.

In this post, we demonstrated how to dynamically create and run an AWS Glue job to preprocess training and validation data. We showed how to construct the DAG to support this ML pipeline, including the import statements, the DAG operator configuration, the DAG task definitions, and the DAG dependency definition. We demonstrated the difference between using native Airflow operators vs. invoking AWS SDK API calls from a generic PythonOperator.

Amazon MWAA is a highly versatile orchestration tool that enterprises can use to operationalize and scale their ML capabilities.

About the authors

Justin Leto is a Sr. Solutions Architect at Amazon Web Services with specialization in big data analytics and machine learning. His passion is helping customers achieve better cloud adoption. In his spare time, he enjoys offshore sailing and playing jazz piano. He lives in Manhattan with his wife Veera.



David Ehrlich is a Machine Learning Specialist at Amazon Web Services. He is passionate about helping customers unlock the true potential of their data. In his spare time, he enjoys exploring the different neighborhoods in New York City, going to comedy clubs, and traveling.




Shreyas Subramanian is a AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges using AWS services.

Read More

Introducing Triton: Open-Source GPU Programming for Neural Networks

Introducing Triton: Open-Source GPU Programming for Neural Networks

We’re releasing Triton 1.0, an open-source Python-like programming language which enables researchers with no CUDA experience to write highly efficient GPU code—most of the time on par with what an expert would be able to produce. Triton makes it possible to reach peak hardware performance with relatively little effort; for example, it can be used to write FP16 matrix multiplication kernels that match the performance of cuBLAS—something that many GPU programmers can’t do—in under 25 lines of code. Our researchers have already used it to produce kernels that are up to 2x more efficient than equivalent Torch implementations, and we’re excited to work with the community to make GPU programming more accessible to everyone.

Novel research ideas in the field of Deep Learning are generally implemented using a combination of native framework operators. While convenient, this approach often requires the creation (and/or movement) of many temporary tensors, which can hurt the performance of neural networks at scale. These issues can be mitigated by writing specialized GPU kernels, but doing so can be surprisingly difficult due to the many intricacies of GPU programming. And, although a variety of systems have recently emerged to make this process easier, we have found them to be either too verbose, lack flexibility or generate code noticeably slower than our hand-tuned baselines. This has led us to extend and improve Triton, a recent language and compiler whose original creator now works at OpenAI.

The Challenges of GPU Programming

The architecture of modern GPUs can be roughly divided into three major components—DRAM, SRAM and ALUs—each of which must be considered when optimizing CUDA code:

  • Memory transfers from DRAM must be coalesced into large transactions to leverage the large bus width of modern memory interfaces.
  • Data must be manually stashed to SRAM prior to being re-used, and managed so as to minimize shared memory bank conflicts upon retrieval.
  • Computations must be partitioned and scheduled carefully, both across and within Streaming Multiprocessors (SMs), so as to promote instruction/thread-level parallelism and leverage special-purpose ALUs (e.g., tensor cores).
Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks
Basic architecture of a GPU.

Reasoning about all these factors can be challenging, even for seasoned CUDA programmers with many years of experience. The purpose of Triton is to fully automate these optimizations, so that developers can better focus on the high-level logic of their parallel code. Triton aims to be broadly applicable, and therefore does not automatically schedule work across SMs — leaving some important algorithmic considerations (e.g. tiling, inter-SM synchronization) to the discretion of developers.

CUDA Triton
Memory Coalescing Manual Automatic
Shared Memory Management Manual Automatic
Scheduling (Within SMs) Manual Automatic
Scheduling (Across SMs) Manual Manual
Compiler optimizations in CUDA vs Triton.

Programming Model

Out of all the Domain Specific Languages and JIT-compilers available, Triton is perhaps most similar to Numba: kernels are defined as decorated Python functions, and launched concurrently with different program_id’s on a grid of so-called instances. However, as shown in the code snippet below, the resemblance stops there: Triton exposes intra-instance parallelism via operations on blocks—small arrays whose dimensions are powers of two—rather than a Single Instruction, Multiple Thread (SIMT) execution model. In doing so, Triton effectively abstracts away all the issues related to concurrency within CUDA thread blocks (e.g., memory coalescing, shared memory synchronization/conflicts, tensor core scheduling).

BLOCK = 512

# This is a GPU kernel in Numba.
# Different instances of this
# function may run in parallel.
def add(X, Y, Z, N):
   # In Numba/CUDA, each kernel 
   # instance itself uses an SIMT execution
   # model, where instructions are executed in
   # parallel for different values of threadIdx
   tid = threadIdx.x
   bid = blockIdx.x
   # scalar index
   idx = bid * BLOCK + tid
   if id < N:
     # There is no pointer in Numba.
     # Z,X,Y are dense tensors
     Z[idx] = X[idx] + Y[idx]

grid = (ceil_div(N, BLOCK),)
block = (BLOCK,)
add[grid, block](x, y, z, x.shape[0])
BLOCK = 512

# This is a GPU kernel in Triton.
# Different instances of this
# function may run in parallel.
def add(X, Y, Z, N):
   # In Triton, each kernel instance
   # executes block operations on a
   # single thread: there is no construct
   # analogous to threadIdx
   pid = program_id(0)
   # block of indices
   idx = pid * BLOCK + arange(BLOCK)
   mask = idx < N
   # Triton uses pointer arithmetics  
   # rather than indexing operators
   x = load(X + idx, mask=mask)
   y = load(Y + idx, mask=mask)
   store(Z + idx, x + y, mask=mask)

grid = (ceil_div(N, BLOCK),)
# no thread-block
add[grid](x, y, z, x.shape[0])
Vector addition in Triton.

While this may not be particularly helpful for embarrassingly parallel (i.e., element-wise) computations, it can greatly simplify the development of more complex GPU programs.

Consider for example the case of a fused softmax kernel (below) in which each instance normalizes a different row of the given input tensor $X in mathbb{R}^{M times N}$. Standard CUDA implementations of this parallelization strategy can be challenging to write, requiring explicit synchronization between threads as they concurrently reduce the same row of $X$. Most of this complexity goes away with Triton, where each kernel instance loads the row of interest and normalizes it sequentially using NumPy-like primitives.

import triton
import triton.language as tl

def softmax(Y, stride_ym, stride_yn, X, stride_xm, stride_xn, M, N):
    # row index
    m = tl.program_id(0)
    # col indices
    # this specific kernel only works for matrices that 
    # have less than BLOCK_SIZE columns
    BLOCK_SIZE = 1024
    n = tl.arange(0, BLOCK_SIZE)
    # the memory address of all the elements
    # that we want to load can be computed as follows
    X = X + m * stride_xm + n * stride_xn
    # load input data; pad out-of-bounds elements with 0 
    x = tl.load(X, mask=n < N, other=-float('inf'))
    # compute numerically-stable softmax
    z = x - tl.max(x, axis=0)
    num = tl.exp(z)
    denom = tl.sum(num, axis=0)
    y = num / denom
    # write back to Y
    Y = Y + m * stride_ym + n * stride_yn
    tl.store(Y, y, mask=n < N)

import torch
# Allocate input/output tensors
X = torch.normal(0, 1, size=(583, 931), device='cuda')
Y = torch.empty_like(X)
# SPMD launch grid
grid = (X.shape[0], )
# enqueue GPU kernel
softmax[grid](Y, Y.stride(0), Y.stride(1), 
              X, X.stride(0), X.stride(1),
              X.shape[0]    , X.shape[1])
Fused softmax in Triton.

Note that the Triton JIT treats X and Y as pointers rather than tensors; we felt like retaining low-level control of memory accesses was important to address more complex data structures (e.g., block-sparse tensors).

Importantly, this particular implementation of softmax keeps the rows of $X$ in SRAM throughout the entire normalization process, which maximizes data reuse when applicable (~<32K columns). This differs from PyTorch’s internal CUDA code, whose use of temporary memory makes it more general but significantly slower (below). The bottom line here is not that Triton is inherently better, but that it simplifies the development of specialized kernels that can be much faster than those found in general-purpose libraries.

A100 performance of fused softmax for M=4096.

The lower performance of the Torch (v1.9) JIT highlights the difficulty of automatic CUDA code generation from sequences of high-level tensor operations.

def softmax(x):
    x_max = x.max(dim=1)[0]
    z = x - x_max[:, None]
    numerator = torch.exp(x)
    denominator = numerator.sum(dim=1)
    return numerator / denominator[:, None]
Fused softmax with the Torch JIT.

Matrix Multiplication

Being able to write fused kernels for element-wise operations and reductions is important, but not sufficient given the prominence of matrix multiplication tasks in neural networks. As it turns out, Triton also works very well for those, achieving peak performance with just ~25 lines of Python code. On the other hand, implementing something similar in CUDA would take a lot more effort and would even be likely to achieve lower performance.

def matmul(A, B, C, M, N, K, stride_am, stride_ak, 
            stride_bk, stride_bn, stride_cm, stride_cn,
    # extract metaparameters
    # programs are grouped together to improve L2 hit rate
    _pid_m = tl.program_id(0)
    _pid_n = tl.program_id(1)
    pid_m = _pid_m // GROUP_M
    pid_n = (_pid_n * GROUP_M) + (_pid_m % GROUP_M)
    # rm (resp. rn) denotes a range of indices
    # for rows (resp. col) of C
    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    # rk denotes a range of indices for columns 
    # (resp. rows) of A (resp. B)
    rk = tl.arange(0, BLOCK_K)
    # the memory addresses of elements in the first block of
    # A and B can be computed using numpy-style broadcasting
    A = A + (rm[:, None] * stride_am + rk[None, :] * stride_ak)
    B = B + (rk [:, None] * stride_bk  + rn[None, :] * stride_bn)
    # initialize and iteratively update accumulator
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
    for k in range(K, 0, -BLOCK_K):
        a = tl.load(A)
        b = tl.load(B)
        # block level matrix multiplication
        acc += tl.dot(a, b)
        # increment pointers so that the next blocks of A and B
        # are loaded during the next iteration
        A += BLOCK_K * stride_ak
        B += BLOCK_K * stride_bk
    # fuse leaky ReLU if desired
    # acc = tl.where(acc >= 0, acc, alpha * acc)
    # write back result
    C = C + (rm[:, None] * stride_cm + rn[None, :] * stride_cn)
    mask = (rm[:, None] < M) & (rn[None, :] < N)
    tl.store(C, acc, mask=mask)
Matrix multiplication in Triton.

One important advantage of handwritten matrix multiplication kernels is that they can be customized as desired to accommodate fused transformations of their inputs (e.g., slicing) and outputs (e.g., Leaky ReLU). Without a system like Triton, non-trivial modifications of matrix multiplication kernels would be out-of-reach for developers without exceptional GPU programming expertise.

V100 tensor-core performance of matrix multiplication with appropriately tuned values for BLOCK$_M$, BLOCK$_N$, BLOCK$_K$, GROUP$_M$.

High-Level System Architecture

The good performance of Triton comes from a modular system architecture centered around Triton-IR, an LLVM-based intermediate representation in which multi-dimensional blocks of values are first-class citizens.

Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks
def add(X, Y, Z, N):
   pid = program_id(0)
   idx= pid * 512 + arange(512)
   mask = idx < N
   x = load(X + idx, mask=mask)
   y = load(Y + idx, mask=mask)
   store(Z + idx, x + y, mask=mask)
Introducing Triton: Open-Source GPU Programming for Neural Networks
def void add(i32* X .aligned(16) , i32* Y .aligned(16) , i32* Z .aligned(16) , i32 N .multipleof(2) )
  %0 = get_program_id[0] i32;
  %1 = mul i32 %0, 512;
  %3 = make_range[0 : 512] i32;
  %4 = splat i32 %1;
  %6 = add i32 %4, %3;
  %9 = splat i32 N;
  %11 = icmp_slt i1 %6, %9;
  %14 = splat i32* X;
  %16 = getelementptr i32* %14, %6;
  %19 = broadcast i1 %11;
  %21 = splat i32 undef;
  %22 = masked_load i32 %16, %19, %21;
  %26 = splat i32* Y;
  %28 = getelementptr i32* %26, %6;
  %31 = broadcast i1 %11;
  %33 = splat i32 undef;
  %34 = masked_load i32 %28, %31, %33;
  %38 = splat i32* Z;
  %40 = getelementptr i32* %38, %6;
  %43 = add i32 %22, %34;
  %46 = broadcast i32 %43;
  %48 = broadcast i1 %11;
  masked_store void %40, %46, %48;
  ret void;
Introducing Triton: Open-Source GPU Programming for Neural Networks
.visible .entry add(
    .param .u64 add_param_0, .param .u64 add_param_1,
    .param .u64 add_param_2, .param .u32 add_param_3
.maxntid 128, 1, 1
    .reg .pred     %p;
    .reg .b32     %r;
    .reg .b64     %rd;
    ld.param.u64     %rd4, [add_param_0];
    ld.param.u64     %rd5, [add_param_1];
    mov.u32     %r13, %tid.x;
    ld.param.u32     %r14, [add_param_3];
    shl.b32     %r15, %r13, 2;
    mov.u32     %r16, %ctaid.x;
    mad.lo.s32     %r17, %r16, 512, %r15;
    setp.ge.s32     %p3, %r17, %r14;
    setp.lt.s32     %p1, %r17, %r14;
    mul.wide.s32     %rd7, %r17, 4;
    add.s64     %rd2, %rd4, %rd7;
    @%p1 ld.global.cg.v4.b32 {%r5,%r6,%r7,%r8}, [ %rd2 + 0];
    add.s64     %rd3, %rd5, %rd7;
    @%p1 ld.global.cg.v4.b32 {%r9,%r10,%r11,%r12}, [ %rd3 + 0];
    @%p3 bra     LBB0_2;
    ld.param.u64     %rd6, [add_param_2];
    add.s64     %rd1, %rd6, %rd7;
    add.s32     %r1, %r5, %r9;
    add.s32     %r2, %r6, %r10;
    add.s32     %r3, %r7, %r11;
    add.s32     %r4, %r8, %r12;
    st.global.v4.u32     [%rd1], {%r1, %r2, %r3, %r4};

Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks

def add(X, Y, Z, N):
   pid = program_id(0)
   idx= pid * 512 + arange(512)
   mask = idx < N
   x = load(X + idx, mask=mask)
   y = load(Y + idx, mask=mask)
   store(Z + idx, x + y, mask=mask)
Introducing Triton: Open-Source GPU Programming for Neural Networks
def void add(i32* X .aligned(16) , i32* Y .aligned(16) , i32* Z .aligned(16) , i32 N .multipleof(2) )
  %0 = get_program_id[0] i32;
  %1 = mul i32 %0, 512;
  %3 = make_range[0 : 512] i32;
  %4 = splat i32 %1;
  %6 = add i32 %4, %3;
  %9 = splat i32 N;
  %11 = icmp_slt i1 %6, %9;
  %14 = splat i32* X;
  %16 = getelementptr i32* %14, %6;
  %19 = broadcast i1 %11;
  %21 = splat i32 undef;
  %22 = masked_load i32 %16, %19, %21;
  %26 = splat i32* Y;
  %28 = getelementptr i32* %26, %6;
  %31 = broadcast i1 %11;
  %33 = splat i32 undef;
  %34 = masked_load i32 %28, %31, %33;
  %38 = splat i32* Z;
  %40 = getelementptr i32* %38, %6;
  %43 = add i32 %22, %34;
  %46 = broadcast i32 %43;
  %48 = broadcast i1 %11;
  masked_store void %40, %46, %48;
  ret void;
Introducing Triton: Open-Source GPU Programming for Neural Networks
.visible .entry add(
    .param .u64 add_param_0, .param .u64 add_param_1,
    .param .u64 add_param_2, .param .u32 add_param_3
.maxntid 128, 1, 1
    .reg .pred     %p;
    .reg .b32     %r;
    .reg .b64     %rd;
    ld.param.u64     %rd4, [add_param_0];
    ld.param.u64     %rd5, [add_param_1];
    mov.u32     %r13, %tid.x;
    ld.param.u32     %r14, [add_param_3];
    shl.b32     %r15, %r13, 2;
    mov.u32     %r16, %ctaid.x;
    mad.lo.s32     %r17, %r16, 512, %r15;
    setp.ge.s32     %p3, %r17, %r14;
    setp.lt.s32     %p1, %r17, %r14;
    mul.wide.s32     %rd7, %r17, 4;
    add.s64     %rd2, %rd4, %rd7;
    @%p1 ld.global.cg.v4.b32 {%r5,%r6,%r7,%r8}, [ %rd2 + 0];
    add.s64     %rd3, %rd5, %rd7;
    @%p1 ld.global.cg.v4.b32 {%r9,%r10,%r11,%r12}, [ %rd3 + 0];
    @%p3 bra     LBB0_2;
    ld.param.u64     %rd6, [add_param_2];
    add.s64     %rd1, %rd6, %rd7;
    add.s32     %r1, %r5, %r9;
    add.s32     %r2, %r6, %r10;
    add.s32     %r3, %r7, %r11;
    add.s32     %r4, %r8, %r12;
    st.global.v4.u32     [%rd1], {%r1, %r2, %r3, %r4};

High-level architecture of Triton.

The @triton.jit decorator works by walking the Abstract Syntax Tree (AST) of the provided Python function so as to generate Triton-IR on-the-fly using a common SSA construction algorithm. The resulting IR code is then simplified, optimized and automatically parallelized by our compiler backend, before being converted into high-quality LLVM-IR—and eventually PTX—for execution on recent NVIDIA GPUs. CPUs and AMD GPUs are not supported at the moment, but we welcome community contributions aimed at addressing this limitation.

Compiler Backend

We have found that the use of blocked program representations via Triton-IR allows our compiler to automatically perform a wide variety of important program optimizations. For example, data can be automatically stashed to shared memory by looking at the operands of computationally intensive block-level operations (e.g., tl.dot)—and allocated/synchronized using standard liveness analysis techniques.

Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks
Introducing Triton: Open-Source GPU Programming for Neural Networks

The Triton compiler allocates shared memory by analyzing the live range of block variables used in computationally intensive operations.

On the other hand, Triton programs can be efficiently and automatically parallelized both (1) across SMs by executing different kernel instances concurrently, and (2) within SMs by analyzing the iteration space of each block-level operation and partitioning it adequately across different SIMD units, as shown below.

S1 float A[4,4] = ...
S2 float B[4,4] = ...
S3 float C[4,4] = A + B
FP16 matrix multiplication
S1 half A[4,2] = ...
S2 half B[2,2] = ...
S3 float C[4,2] = dot(A,B)
  1. Definition of a Triton program P composed of three statements S1, S2S3
Introducing Triton: Open-Source GPU Programming for Neural Networks


Introducing Triton: Open-Source GPU Programming for Neural Networks


Introducing Triton: Open-Source GPU Programming for Neural Networks

  1. Iteration space of S3
Introducing Triton: Open-Source GPU Programming for Neural Networks


Introducing Triton: Open-Source GPU Programming for Neural Networks

  1. Mapping of S3 onto a Stream Multiprocessor (SM)


Introducing Triton: Open-Source GPU Programming for Neural Networks

  1. Mapping of P onto the GPU

S1 float A[4,4] = ...
S2 float B[4,4] = ...
S3 float C[4,4] = A + B
FP16 matrix mult.multiplication
S1 half A[4,2] = ...
S2 half B[2,2] = ...
S3 float C[4,2] = dot(A,B)
Introducing Triton: Open-Source GPU Programming for Neural Networks

Introducing Triton: Open-Source GPU Programming for Neural Networks


Introducing Triton: Open-Source GPU Programming for Neural Networks

Introducing Triton: Open-Source GPU Programming for Neural Networks

Introducing Triton: Open-Source GPU Programming for Neural Networks


Introducing Triton: Open-Source GPU Programming for Neural Networks

  1. Definition of a Triton program P composed of three statements S1, S2S3
  1. Iteration space of S3
  1. Mapping of S3 onto a Stream Multiprocessor (SM)
  1. Mapping of P onto the GPU

Automatic parallelization in Triton. Each block-level operation defines a blocked iteration space that is automatically parallelized to make use of the resources available on a Streaming Multiprocessor (SM).


We intend for Triton to become a community-driven project. Feel free to fork our repository on GitHub!

If you’re interested in joining our team and working on Triton & GPU kernels, we’re hiring!


Using AI to map Africa’s buildings

Between 2020 and 2050, Africa’s population is expected to double, adding 950 million more people to its urban areas alone. However, according to 2018 figures, a scarcity of affordable housing in many African cities has forced over half of the city dwellers in Sub-Saharan Africa to live in informal settlements. And in rural areas, many also occupy makeshift structures due to widespread poverty.

These shelters have remained largely undetectable using traditional monitoring tools. Machine learning, computer vision and remote sensing have come some way in recognizing buildings and roads, but when it comes to denser neighborhoods, it becomes much harder to distinguish small and makeshift buildings. 

Why is this an issue? Because when preparing a humanitarian response, forecasting transportation needs, or planning basic services, being able to accurately map the built environment – which allows us to ascertain population density – is absolutely key.

Enter Google’s Open Buildings

Google’s Open Buildings is a new open access dataset containing the locations and geometry of buildings across most of Africa. From Lagos’ Makoko settlement to Dadaab’s refugee camps, millions of previously invisible buildings have popped up in our dataset. This improved building data helps refine the understanding of where people and communities live, providing actionable information for state and non-state actors looking to provide services from sanitation to education and vaccination.

  • An aerial photograph of a refugee settlement in Uganda

    Refugee settlement in Yumbe district, Uganda

  • An aerial photograph of rural dwellings in Ethiopia

    Rural dwellings in Sholtaka, Ethiopia

  • An aerial photograph of an urban area in DRC

    High density urban area in Kinshasa, DRC

Open Buildings uses AI to provide a digital footprint of buildings. This includes producing polygons with the outlines of at least 500 million buildings across the African continent, the majority of which are less than 20 square meters. The full dataset encompasses 50 countries.

The data provides the exact location and polygon outline of each building, its size, a confidence score for it being detected as a valid building and a Plus Code. There is, however, no information about the type of building, its street address, or any identifying data. We have also excluded sensitive areas such as conflict zones to protect vulnerable populations.

Satellite mapping using AI 

The Open Buildings dataset was generated by using a model trained to detect buildings using satellite imagery from the African continent. The information for the buildings detected is then saved in CSV files which are available to download. The technical details of the Open Buildings dataset, including usage and tutorials, are available on the dataset website and the Google AI blog.

Animation showing landscape in Africa being mapped

How will this improve planning?

There are many important ways in which this data can be used, including — but not limited to — the following:

Population mapping: Building footprints are a key ingredient for estimating population density. This information is vital to planning for services for communities. 

Humanitarian response: To plan the response to a flood, drought, or other natural disaster.

Environmental science: Knowledge of settlement density is useful for understanding the human impact on the natural environment. 

Addressing systems: In many areas, buildings do not have formal addresses. This can make it difficult for people to access social benefits and economic opportunities. Building footprint data can help with the rollout of digital addressing systems such asPlus Codes.

Vaccination planning: Knowing the density of population and settlements helps to anticipate demand for vaccines and the best locations for facilities. This data is also useful for precision epidemiology, as well as prevention efforts such as mosquito net distribution.

Statistical indicators: Buildings data can be used to help calculate statistical indicators for national planning, such as the numbers of houses in the catchment areas of schools and health centers, mean travel distances to the nearest hospital or demand forecast for transportation systems.

Google’s AI Center in Accra

This project was led by our team at the AI Research Center in Accra, Ghana. The center was launched in 2019 to bring together top machine learning researchers and engineers dedicated to AI research and its applications. The research team has already been improving Google Maps with AI, adding 120 million buildings and 228,000 km of roads across Africa to Maps in the last year. This work is part of our broader AI for Social Good efforts.

Read More

Mapping Africa’s Buildings with Satellite Imagery

Posted by John Quinn, Software Engineer, Google Research, Ghana

An accurate record of building footprints is important for a range of applications, from population estimation and urban planning to humanitarian response and environmental science. After a disaster, such as a flood or an earthquake, authorities need to estimate how many households have been affected. Ideally there would be up-to-date census information for this, but in practice such records may be out of date or unavailable. Instead, data on the locations and density of buildings can be a valuable alternative source of information.

A good way to collect such data is through satellite imagery, which can map the distribution of buildings across the world, particularly in areas that are isolated or difficult to access. However, detecting buildings with computer vision methods in some environments can be a challenging task. Because satellite imaging involves photographing the earth from several hundred kilometres above the ground, even at high resolution (30–50 cm per pixel), a small building or tent shelter occupies only a few pixels. The task is even more difficult for informal settlements, or rural areas where buildings constructed with natural materials can visually blend into the surroundings. There are also many types of natural and artificial features that can be easily confused with buildings in overhead imagery.

Objects that can confuse computer vision models for building identification (clockwise from top left) pools, rocks, enclosure walls and shipping containers.

In “Continental-Scale Building Detection from High-Resolution Satellite Imagery”, we address these challenges, using new methods for detecting buildings that work in rural and urban settings across different terrains, such as savannah, desert, and forest, as well as informal settlements and refugee facilities. We use this building detection model to create the Open Buildings dataset, a new open-access data resource containing the locations and footprints of 516 million buildings with coverage across most of the African continent. The dataset will support several practical, scientific and humanitarian applications, ranging from disaster response or population mapping to planning services such as new medical facilities or studying human impact on the natural environment.

Model Development
We built a training dataset for the building detection model by manually labelling 1.75 million buildings in 100k images. The figure below shows some examples of how we labelled images in the training data, taking into account confounding characteristics of different areas across the African continent. In rural areas, for example, it was necessary to identify different types of dwelling places and to disambiguate them from natural features, while in urban areas we needed to develop labelling policies for dense and contiguous structures.

(1) Example of a compound containing both dwelling places as well as smaller outbuildings such as grain stores. (2) Example of a round, thatched-roof structure that can be difficult for a model to distinguish from trees, and where it is necessary to use cues from pathways, clearings and shadows to disambiguate. (3) Example of several contiguous buildings for which the boundaries cannot be easily distinguished.

We trained the model to detect buildings in a bottom-up way, first by classifying each pixel as building or non-building, and then grouping these pixels together into individual instances. The detection pipeline was based on the U-Net model, which is commonly used in satellite image analysis. One advantage of U-Net is that it is a relatively compact architecture, and so can be applied to large quantities of imaging data without a heavy compute burden. This is critical, because the final task of applying this to continental-scale satellite imagery means running the model on many billions of image tiles.

Example of segmenting buildings in satellite imagery. Left: Source image; Center: Semantic segmentation, with each pixel assigned a confidence score that it is a building vs. non-building; Right: Instance segmentation, obtained by thresholding and grouping together connected components.

Initial experiments with the basic model had low precision and recall, for example due to the variety of natural and artificial features with building-like appearance. We found a number of methods that improved performance. One was the use of mixup as a regularisation method, where random training images are blended together by taking a weighted average. Though mixup was originally proposed for image classification, we modified it to be used for semantic segmentation. Regularisation is important in general for this building segmentation task, because even with 100k training images, the training data do not capture the full variation of terrain, atmospheric and lighting conditions that the model is presented with at test time, and hence, there is a tendency to overfit. This is mitigated by mixup as well as random augmentation of training images.

Another method that we found to be effective was the use of unsupervised self-training. We prepared a set of 100 million satellite images from across Africa, and filtered these to a subset of 8.7 million images that mostly contained buildings. This dataset was used for self-training using the Noisy Student method, in which the output of the best building detection model from the previous stage is used as a ‘teacher’ to then train a ‘student’ model that makes similar predictions from augmented images. In practice, we found that this reduced false positives and sharpened the detection output. The student model gave higher confidence to buildings and lower confidence to background.

Difference in model output between the student and teacher models for a typical image. In panel (d), red areas are those that the student model finds more likely to be buildings than the teacher model, and blue areas more likely to be background.

One problem that we faced initially was that our model had a tendency to create “blobby” detections, without clearly delineated edges and with a tendency for neighbouring buildings to be merged together. To address this, we applied another idea from the original U-Net paper, which is to use distance weighting to adapt the loss function to emphasise the importance of making correct predictions near boundaries. During training, distance weighting places greater emphasis at the edges by adding weight to the loss — particularly where there are instances that nearly touch. For building detection, this encourages the model to correctly identify the gaps in between buildings, which is important so that many close structures are not merged together. We found that the original U-Net distance weighting formulation was helpful but slow to compute. So, we developed an alternative based on Gaussian convolution of edges, which was both faster and more effective.

Distance weighting schemes to emphasise nearby edges: U-Net (left) and Gaussian convolution of edges (right).

Our technical report has more details on each of these methods.

We evaluated the performance of the model on several different regions across the continent, in different categories: urban, rural, and medium-density. In addition, with the goal of preparing for potential humanitarian applications, we tested the model on regions with displaced persons and refugee settlements. Precision and recall did vary between regions, so achieving consistent performance across the continent is an ongoing challenge.

Precision-recall curves, measured at 0.5 intersection-over-union threshold.

When visually inspecting the detections for low-scoring regions, we noted various causes. In rural areas, label errors were problematic. For example, single buildings within a mostly-empty area can be difficult for labellers to spot. In urban areas, the model had a tendency to split large buildings into separate instances. The model also underperformed in desert terrain, where buildings were hard to distinguish against the background.

We carried out an ablation study to understand which methods contributed most to the final performance, measured in mean average precision (mAP). Distance weighting, mixup and the use of ImageNet pre-training were the biggest factors for the performance of the supervised learning baseline. The ablated models that did not use these methods had a mAP difference of -0.33, -0.12 and -0.07 respectively. Unsupervised self-training gave a further significant boost of +0.06 mAP.

Ablation study of training methods. The first row shows the mAP performance of the best model combined with self-training, and the second row shows the best model with supervised learning only (the baseline). By disabling each training optimization from the baseline in turn, we observe the impact on mAP test performance. Distance weighting has the most significant effect.

Generating the Open Buildings Dataset
To create the final dataset, we applied our best building detection model to satellite imagery across the African continent (8.6 billion image tiles covering 19.4 million km2, 64% of the continent), which resulted in the detection of 516M distinct structures.

Each building’s outline was simplified as a polygon and associated with a Plus Code, which is a geographic identifier made up of numbers and letters, akin to a street address, and useful for identifying buildings in areas that don’t have formal addressing systems. We also include confidence scores and guidance on suggested thresholds to achieve particular precision levels.

The sizes of the structures vary as shown below, tending towards small footprints. The inclusion of small structures is important, for example, to support analyses of informal settlements or refugee facilities.

Distribution of building footprint sizes.

The data is freely available and we look forward to hearing how it is used. In the future, we may add new features and regions, depending on usage and feedback.

This work is part of our AI for Social Good efforts and was led by Google Research, Ghana. Thanks to the co-authors of this work: Wojciech Sirko, Sergii Kashubin, Marvin Ritter, Abigail Annkah, Yasser Salah Eddine Bouchareb, Yann Dauphin, Daniel Keysers, Maxim Neumann and Moustapha Cisse. We are grateful to Abdoulaye Diack, Sean Askay, Ruth Alcantara and Francisco Moneo for help with coordination. Rob Litzke, Brian Shucker, Yan Mayster and Michelina Pallone provided valuable assistance with geo infrastructure.

Read More

An AI a Day Keeps Dr.Fill at Play: Matt Ginsberg on Building GPU-Powered Crossword Solver

9 Down, 14 letters: Someone skilled in creating and solving crossword puzzles.

This April, the fastest “cruciverbalist” at the ​​American Crossword Puzzle Tournament was Dr.Fill, a crossword puzzle-solving AI program created by Matt Ginsberg.

Dr.Fill perfectly solved the championship puzzle in 49 seconds. The first human champion, Tyler Hinman, filled the 15×15 crossword in exactly three minutes.

Though Ginsberg has published crossword puzzles for the New York Times, he has trouble solving puzzles, even his own. After attending a crossword tournament over a decade ago, Ginsberg decided to create a crossword-solving program to compete against top-tier word nerds.

Ginsberg spoke with NVIDIA AI Podcast host Noah Kravitz about his decade-long journey creating Dr.Fill and where he envisions it going in the future.

Key Points From This Episode:

  • Ginsberg partnered with UC Berkeley’s natural language processing team. By combining their code bases, Dr.Fill outperformed the human competitors at this year’s tournament.
  • Dr.Fill performs better on unthemed rather than themed puzzles. This is where the natural language group comes into play, helping Dr.Fill interpret clues and difficult crossword themes.


“I’m looking forward to the crossword tournament next year because I know we’re going to be working hard to make the program better.” — Matt Ginsberg [11:02]

You Might Also Like:

Otter.ai CEO Sam Liang on Bringing Live Captions to a Meeting Near You

Sam Liang is making things easier for the creators of the NVIDIA AI Podcast — and just about every remote worker. He’s the CEO and co-founder of Otter.ai, which uses AI to produce speech-to-text transcriptions in real time or from recording uploads.

How Deep Learning Can Translate American Sign Language

Rochester Institute of Technology computer engineering major Syed Ahmed, a research assistant at the National Technical Institute for the Deaf, uses AI to translate between American Sign Language and English. Ahmed trained his algorithm on 1,700 sign language videos.

Pod Squad: Descript Uses AI to Make Managing Podcasts Quicker, Easier

Serial entrepreneur Andrew Mason is making podcast editing easier and more collaborative with his company, Descript Podcast Studio, which uses AI, natural language processing and automatic speech synthesis.

Subscribe to the AI Podcast

Get the AI Podcast through iTunes, Google Podcasts, Google Play, Castbox, DoggCatcher, Overcast, PlayerFM, Pocket Casts, Podbay, PodBean, PodCruncher, PodKicker, Soundcloud, Spotify, Stitcher and TuneIn. If your favorite isn’t listed here, drop us a note.

Make the AI Podcast Better

Have a few minutes to spare? Fill out this listener survey. Your answers will help us make a better podcast.


The post An AI a Day Keeps Dr.Fill at Play: Matt Ginsberg on Building GPU-Powered Crossword Solver appeared first on The Official NVIDIA Blog.

Read More

Our quantum processor at the Deutsches Museum

In 2019, our Quantum AI team achieved a beyond-classical computation by outperforming the world’s fastest classical computer. Today, a quantum processor from the Sycamore generation that accomplished this important computing milestone will be donated to the Deutsches Museum of Masterpieces of Science and Technology in Munich, Germany. 

The Deutsches Museum has one of the largest collections of science and technology artifacts in the world. This means that the Sycamore will share the same exhibition space as some of the world’s most important technological achievements: like the roundest object in the world – a silicon sphere that gives the kilogram a new definition; the Z3, one of the earliest computers; the Wright Flyer, considered the first serial motor plane; and automotive history from the first diesel engine to the Waymo Firefly. The museum has a long history of preserving artifacts that mark the start of new eras in science and technology, which is why we’re honored to have the Sycamore processor among these exhibits. The beyond-classical experiment ushered in a new era for exploring near-term quantum algorithms that could have tangible benefits to society, for example—design more efficient batteries, create fertilizer using less energy, and figure out what molecules might make effective medicines.

A picture of the Sycamore processor being handed over by the Google team and Deutsches Museum team involved in the project, in front of Zuse's Z3 computer

Handover of the Sycamore processor in front of Zuse‘s Z3 computer. Luise Allendorf-Hoefer,  Curator electronics, Deutsches Museum, Wolfgang M. Heckl, Director General, Deutsches Museum, Markus Hoffmann, Google Quantum AI Partnerships and Hartmut Neven, Director, Google Quantum AI.

This also marks an important milestone in the collaboration between Google Research and Germany’s burgeoning quantum community.  Since Google has a research presence in Munich and Berlin, it has given us the opportunity to partner with several German organisations to explore the future of quantum computing. For example, the Sycamore processor has already been used by some of our industrial research partners, like Volkswagen and Mercedes-Benz, and will be the foundation for experiments designed with Boehringer Ingelheim, Covestro and BASF. 

If you can’t travel to Munich to visit the Deutsches Museum in person, don’t forget that you can take a virtual trip through Google Arts & Culture.

Read More

How Was NVIDIA’s 2021 GTC Keynote Made? Step Inside Our Kitchen Aug. 11 to Find Out

Ever see a virtual kitchen materialize in real-time? If you caught NVIDIA CEO Jensen Huang’s keynote for our March 2021 GPU Technology Conference you’re no doubt wondering about more than a few of the presentation’s magic tricks.

With the premiere of “Connecting in the Metaverse: The Making of the GTC Keynote,” Wednesday, Aug. 11, at 11 a.m. Pacific time, NVIDIA team members will reveal the story behind the story told at GTC.

Designed to entertain and inform, GTC keynotes are always filled with cutting-edge demos highlighting NVIDIA’s advancements in supercomputing, deep learning and graphics.

This year, however, with NVIDIA’s team working remotely to create a presentation for attendees flocking to an entirely virtual gathering, those technologies were crucial for making the keynote itself.

The highlight: the reveal of Huang’s virtual kitchen, complete with a digital clone of the man himself.

The half-hour documentary film debuting Wednesday tells the tale of the small team of remote artists who were able to blur the line between real and rendered on a tight deadline with NVIDIA Omniverse, a platform for connecting 3D worlds into a shared virtual space.

Go to our SIGGRAPH 2021 landing page to watch the video, and browse all our events at this year’s conference.

Immediately after the video’s debut, those who are registered for SIGGRAPH can join the team at NVIDIA who will dig into the story told in the video for a live Q&A at 11:30 a.m. Pacific time.

To participate in the panel, register for SIGGRAPH using the code “NVIDIA21” for a free basic pass or $25 off an enhanced or ultimate pass.

You’ll be able to find all the details you’ll need to join the panel once they’re posted on SIGGRAPH’s website.

The post How Was NVIDIA’s 2021 GTC Keynote Made? Step Inside Our Kitchen Aug. 11 to Find Out appeared first on The Official NVIDIA Blog.

Read More

Advances in TF-Ranking

Posted by Michael Bendersky and Xuanhui Wang, Software Engineers, Google Research

In December 2018, we introduced TF-Ranking, an open-source TensorFlow-based library for developing scalable neural learning-to-rank (LTR) models, which are useful in settings where users expect to receive an ordered list of items in response to their query. LTR models — unlike standard classification models that classify one item at a time — receive an entire list of items as an input, and learn an ordering that maximizes the utility of the entire list. While search and recommendation systems are the most common applications of LTR models, since its release, we have seen TF-Ranking being applied in diverse domains beyond search, including e-commerce, SAT solvers, and smart city planning.

The goal of learning-to-rank (LTR) is to learn a function f() that takes as an input a list of items (documents, products, movies, etc.) and outputs the list of items in the optimal order (descending order of relevance). Here, green shade indicates item relevance level, and the red item marked with ‘x’ is non-relevant.

In May 2021, we published a major release of TF-Ranking that enables full support for natively building LTR models using Keras, a high-level API of TensorFlow 2. Our native Keras ranking model has a brand-new workflow design, including a flexible ModelBuilder, a DatasetBuilder to set up training data, and a Pipeline to train the model with the provided dataset. These components make building a customized LTR model easier than ever, and facilitate rapid exploration of new model structures for production and research. If RaggedTensors are your tool of choice, TF-Ranking is now working with them as well. In addition, our most recent release, which incorporates the Orbit training library, contains a long list of advances — the culmination of two and half years of neural LTR research. Below we share a few of the key improvements available in the latest TF-Ranking version.

Workflow to build and train a native Keras ranking model. Blue modules are provided by TF-Ranking, and green modules are customizable.

Learning-to-Rank with TFR-BERT
Recently, pretrained language models like BERT have achieved state-of-the-art performance on various language understanding tasks. To capture the expressiveness of these models, TF-Ranking implements a novel TFR-BERT architecture that couples BERT with the power of LTR to optimize the ordering of list inputs. As an example, consider a query and a list of n documents that one might like to rank in response to this query. Instead of learning an independent BERT representation for each <query, document> pair, LTR models apply a ranking loss to jointly learn a BERT representation that maximizes the utility of the entire ranked list with respect to the ground-truth labels.

The figure below illustrates this process. First, we flatten a list of n documents to rank in response to a query into a list <query, document> tuples. These tuples are fed into a pre-trained language model (e.g., BERT). The pooled BERT outputs for the entire document list are then jointly fine-tuned with one of the specialized ranking losses available in TF-Ranking. Our experience shows that this TFR-BERT architecture delivers significant improvements in pretrained language model performance, leading to state-of-the-art performance for several popular ranking tasks, especially when multiple pretrained language models are ensembled. Our users can now get started with TFR-BERT using this simple example.

An illustration of the TFR-BERT architecture, in which a joint LTR model over a list of n documents is constructed using BERT representations of individual <query, document> pairs.

Interpretable Learning-to-Rank
Transparency and interpretability are important factors in deploying LTR models in ranking systems that can be involved in determining the outcomes of processes such as loan eligibility assessment, advertisement targeting, or guiding medical treatment decisions. In such cases, the contribution of each individual feature to the final ranking should be examinable and understandable to ensure transparency, accountability and fairness of the outcomes.

One possible way to achieve this is using generalized additive models (GAMs) — intrinsically interpretable machine learning models that are linearly composed of smooth functions of individual features. However, while GAMs have been extensively studied on regression and classification tasks, it is less clear how to apply them in a ranking setting. For instance, while GAMs can be straightforwardly applied to model each individual item in the list, modeling both item interactions and the context in which these items are ranked is a more challenging research problem. To this end, we have developed a neural ranking GAM — an extension of generalized additive models to ranking problems.

Unlike standard GAMs, a neural ranking GAM can take into account both the features of the ranked items and the context features (e.g., query or user profile) to derive an interpretable, compact model. This ensures that not only the contribution of each item-level feature is interpretable, but also the contribution of the context features. For example, in the figure below, using a neural ranking GAM makes visible how distance, price, and relevance, in the context of a given user device, contribute to the final ranking of the hotel. Neural ranking GAMs are now available as a part of TF-Ranking,

An example of applying neural ranking GAM for local search. For each input feature (e.g., price, distance), a sub-model produces a sub-score that can be examined, providing transparency. Context features (e.g., user device type) can be utilized to derive importance weights of submodels.

Neural Ranking or Gradient Boosting?
While neural models have achieved state of the art performance in multiple domains, specialized gradient boosted decision trees (GBDTs) like LambdaMART remained the baseline to beat in a variety of open LTR datasets. The success of GBDTs in open datasets is due to several reasons. First, due to their relatively small size, neural models are prone to overfitting on these datasets. Second, since GBDTs partition their input feature space using decision trees, they are naturally more resilient to variations in numerical scales in ranking data, which often contain features with Zipfian or otherwise skewed distributions. However, GBDTs do have their limitations in more realistic ranking scenarios, which often combine both textual and numerical features. For instance, GBDTs cannot be directly applied to large discrete feature spaces, such as raw document text. They are also, in general, less scalable than neural ranking models.

Therefore, since the TF-Ranking release, our team has significantly deepened the understanding of how best to leverage neural models in ranking with numerical features. This culminated in a Data Augmented Self-Attentive Latent Cross (DASALC) model, described in an ICLR 2021 paper, which is the first to establish parity, and in some cases statistically significant improvements, of neural ranking models over strong LambdaMART baselines on open LTR datasets. This achievement is made possible through a combination of techniques, which include data augmentation, neural feature transformation, self-attention for modeling document interactions, listwise ranking loss, and model ensembling similar to boosting in GBDTs. The architecture of the DASALC model was entirely implemented using the TF-Ranking library.

All in all, we believe that the new Keras-based TF-Ranking version will make it easier to conduct neural LTR research and deploy production-grade ranking systems. We encourage everyone to try out the latest version and follow this introductory example for a hands-on experience. While we are very excited about this new release, our research and development journey is far from over, so we will continue to advance our understanding of learning-to-rank problems and share these advances with our users.

This project was only possible thanks to the current and past members of the TF-Ranking team: Honglei Zhuang, ‎Le Yan, Rama Pasumarthi, Rolf Jagerman, Zhen Qin, Shuguang Han, Sebastian Bruch, Nathan Cordeiro, Marc Najork and Patrick McGregor. We also extend special thanks to our collaborators from the Tensorflow team: Zhenyu Tan, Goldie Gadde, Rick Chao, Yuefeng Zhou‎, Hongkun Yu, and Jing Li.

Read More

Announcing specialized support for extracting data from invoices and receipts using Amazon Textract

Receipts and invoices are documents that are critical to small and medium businesses (SMBs), startups, and enterprises for managing their accounts payable processes. These types of documents are difficult to process at scale because they follow no set design rules, yet any individual customer encounters thousands of distinct types of these documents.

In this post, we show how you can use Amazon Textract’s new Analyze Expense API to extract line item details in addition to key-value pairs from invoices and receipts, which is a frequent request we hear from customers. Amazon Textract uses machine learning (ML) to understand the context of invoices and receipts, and automatically extracts specific information like vendor name, price, and payment terms. In this post, we walk you through processing an invoice/receipt using Amazon Textract and extracting a set of fields and line-item details. While AWS takes care of building, training, and deploying advanced ML models in a highly available and scalable environment, you take advantage of these models with simple-to-use API actions.

We cover the following topics in this post:

  • How Amazon Textract processes invoices and receipts
  • A walkthrough of the Amazon Textract console
  • Anatomy of the Amazon Textract AnalyzeExpense API response
  • How to process the response with the Amazon Textract parser library
  • Sample solution architecture to automate invoice and receipts processing
  • How to deploy and use the solution

Invoice and receipt processing using Amazon Textract

SMBs, startups, and enterprises process paper-based invoices and receipts as part of their accounts payable process to reconcile their goods received and for auditing purposes. Employees who submit expense reports also submit scans or images of the associated receipts. Companies try to standardize electronic invoicing, but some vendors only offer paper invoices, and some countries legally require paper invoices.

The peculiarities of invoices and receipts mean it’s also a difficult problem to solve at scale—invoices and receipts all look different, because each vendor designs its own documents independently. The labels are imperfect and inconsistent. Vendor name is often not explicitly labeled and has to be interpreted based on context. Other important information such as customer number, customer ID, or account ID are labeled differently from document to document.

To solve this problem, you can use Amazon Textract to process invoices and receipts at scale. Amazon Textract works with any style of invoice or receipt, no templates or configuration required, and extracts relevant data that can be tricky to extract such as contact information, items purchased, and vendor name from those documents. That includes the line-item details, not just the headline amounts.

Amazon Textract also identifies vendor names that are critical for your workflows but may not be explicitly labeled. For example, Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo at the top of the page without an explicit key-value pair combination.

Amazon Textract also makes it easy to consolidate input from diverse receipts and invoices. Different documents use different words for the same concept. For example, Amazon Textract maps relationships between field names in different documents such as customer no., customer number, and account ID, and outputs standard taxonomy (in this case, INVOICE_RECEIPT_ID), thereby representing data consistently across document types.

Amazon Textract console walkthrough

Before we get started with the API and code samples, let’s review the Amazon Textract console. The following images show examples of both an invoice and a receipt document on the Analyze Expense output tab of the Amazon Textract console.

Amazon Textract automatically detects the vendor name, invoice number, ship to address, and more from the sample invoice and displays them on the Summary Fields tab. It also represents the standard taxonomy of fields in brackets next to the actual value on the document. For example, it identifies “INVOICE #” as the standard field INVOICE_RECEIPT_ID.

Additionally, Amazon Textract detects the items purchased and displays them on the Line Item Fields tab.

The following is a similar example of a receipt. Amazon Textract detects “Whole Foods Market” as VENDOR_NAME even though the receipt doesn’t explicitly mention it as the vendor name.

The Amazon Textract AnalyzeExpense API response

In this section, we explain the AnalyzeExpense API response structure using sample images. The following is a sample receipt and the corresponding AnalyzeExpense response JSON.

Sample store receipt image

AnalyzeExpense JSON response of SummaryFields :

    "DocumentMetadata": {
        "Pages": 1
    "ExpenseDocuments": [
            "ExpenseIndex": 1,
            "SummaryFields": [
                    "Type": {
                        "Text": "VENDOR_NAME",
                        "Confidence": 97.0633544921875
                    "ValueDetection": {
                        "Text": "New Store X1",
                        "Geometry": {
                        "Confidence": 96.65239715576172
                    "PageNumber": 1
                    "Type": {
                        "Text": "OTHER",
                        "Confidence": 81.0
                    "LabelDetection": {
                        "Text": "Order type:",
                        "Geometry": {
                        "Confidence": 80.8591079711914
                    "ValueDetection": {
                        "Text": "Quick Sale",
                        "Geometry": {
                        "Confidence": 80.82302856445312
                    "PageNumber": 1

AnalyzeExpense JSON response for LineItemGroups:

"LineItemGroups": [
                    "LineItemGroupIndex": 1,
                    "LineItems": [
                            "LineItemExpenseFields": [
                                    "Type": {
                                        "Text": "ITEM",
                                        "Confidence": 99.95216369628906
                                    "ValueDetection": {
                                        "Text": "Red Banana is innbusiness ",
                                        "Geometry": {
                                        "Confidence": 99.81525421142578
                                    "PageNumber": 1
                                    "Type": {
                                        "Text": "PRICE",
                                        "Confidence": 99.95216369628906
                                    "ValueDetection": {
                                        "Text": "$66.96",
                                        "Geometry": {

The AnalyzeExpense JSON output contains ExpenseDocuments, and each ExpenseDocument contains SummaryFields and LineItemGroups. The ExpenseIndex field uniquely identifies the expense, and associates the appropriate SummaryFields or LineItemGroups detected to that expense.

The most granular level of data in the AnalyzeExpense response consists of Type, ValueDetection, and LabelDetection (optional). Let’s call this set of data an AnalyzeExpense element. The preceding example illustrates an AnalyzeExpense element that contains Type, ValueDetection, and LabelDetection.

In the preceding example, Amazon Textract detected 16 SummaryField key-value pairs, including VENDOR_NAME: New Store X1 and Order type:Quick Sale. AnalyzeExpense detects this key-value pair and displays it as shown in the preceding example. The individual entities are as follows:

  • LabelDetection – The optional key of the key-value pair. In the Order type: Quick Sale example, it’s Order type:. For implied values such as Vendor Name, where the key isn’t explicitly shown in the receipt, LabelDetection isn’t included in the AnalyzeExpense element. In the preceding example, “New Store X1” at the top of the receipt is the vendor name without an explicit key. The AnalyzeExpense element for “New Store X1” has a type of VENDOR_NAME and ValueDetection of New Store X1, but doesn’t have a LabelDetection.
  • Type – This is the normalized type of the key-value pair. Because Order type isn’t a normalized taxonomy value, it’s classified as OTHER. Examples of normalized values are Vendor Name, Receiver Address, and Payment Terms. For a full list of normalized taxonomy values, see the Amazon Textract Developer Guide.
  • ValueDetection – The value of the key-value pair. In the example of Order type: Quick Sale, it’s Quick Sale.

The AnalyzeExpense API also detects ITEM, QUANTITY, and PRICE within line items as normalized fields. If other text is in a line item on the receipt image, such as SKU or a detailed description, it’s included in the JSON as EXPENSE_ROW, as shown in the following example:

                                    "Type": {
                                        "Text": "EXPENSE_ROW",
                                        "Confidence": 99.95216369628906
                                    "ValueDetection": {
                                        "Text": "Red Banana is in x3 $66.96nbusiness ",
                                        "Geometry": {
                                        "Confidence": 98.11214447021484

In addition to the detected content, the AnalyzeExpense API provides information like confidence scores and bounded boxes for detected elements. It gives you control of how you consume extracted content and integrate it into your applications. For example, you can flag any elements that have a confidence score under a certain threshold for manual review.

The input document is either bytes or an Amazon Simple Storage Service (Amazon S3) object. You pass image bytes to an Amazon Textract API operation by using the Bytes property. For example, you use the Bytes property to pass a document loaded from a local file system.

Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.

You can call the AnalyzeExpense API using the AWS Command Line Interface (AWS CLI), as shown in the following code. Make sure you have the latest AWS CLI version installed.

aws textract analyze-expense --document '{"S3Object": {"Bucket": "&lt;Your Bucket&gt;","Name": "Invoice/Receipts S3 Objects"}}'

Process the response with the Amazon Textract parser library

Apart from working with the JSON output as-is, you can use the Amazon Textract response parser library to parse the JSON returned by the AnalyzeExpense API. The library parses JSON and provides programming language-specific constructs to work with different parts of the document. For more details, refer to the GitHub repo. Using the Amazon Textract response parser makes it easier to deserialize the JSON response and use it in your application in a similar way that the Amazon Textract PrettyPrinter library allows you to print the parsed response in different formats. The following GitHub repository shows examples for parsing the Amazon Textract responses. You can parse SummaryFields and LineItemGroups for every ExpenseDocument in the AnalyzeExpense response JSON using the AnalyzeExpense parser as shown in the following code:

Install the latest boto3 python SDK -
python3 -m pip install boto3 –-upgrade 

Install the latest version of amazon textract response parser 
python3 -m pip install amazon-textract-response-parser --upgrade

client = boto3.client(
         region_name= 'us-east-1',

with open(documentName, 'rb') as file:
        img_test = file.read()
        bytes_test = bytearray(img_test)
        print('Image loaded', documentName)

# process using image bytes
response = client.analyze_expense(Document={'Bytes': bytes_test})

You can further use the serializer and deserializer to validate the response JSON and convert it into the Python object representation, and vice versa.

The following code deserializes the response JSON:

# j holds the Textract JSON
from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().load(json.loads(j))

The following code serializes the response JSON:

from trp.trp2_expense import TAnalyzeExpenseDocument, TAnalyzeExpenseDocumentSchema
t_doc = TAnalyzeExpenseDocumentSchema().dump(t_doc)

You can also convert the output to formats like CSV, Presto, TSV, HTML, LaTeX, and more by using the Amazon Textract PrettyPrinter library.

Install the PrettyPrinter library with the following code:

python3 -m pip install amazon-textract-prettyprinter --upgrade

Call the get_string method of textractprettyprinter.t_pretty_print_expense with the output_type as SUMMARY or LINEITEMGROUPS and table_format set to whichever format you want to output. The following example code outputs both SUMMARY and LINEITEMGROUPS in the fancy grid format:

import os
import boto3
from textractprettyprinter.t_pretty_print_expense import get_string
from textractprettyprinter.t_pretty_print_expense import Textract_Expense_Pretty_Print, Pretty_Print_Table_Format

boto3 client for Amazon Texract
textract = boto3.client(service_name='textract')

Set the S3 Bucket Name and File name 
Please set the below variables to your S3 Location
s3_source_bucket_name = "YOUR S3 BUCKET NAME"
s3_request_file_name = "YOUR S3 EXPENSE IMAGE FILENAME "
Call the Textract AnalyzeExpense API with the input Expense Image in Amazon S3
    response = textract.analyze_expense(
            'S3Object': {
                'Bucket': s3_source_bucket_name,
                'Name': s3_request_file_name
    Call Amazon Pretty Printer get_string method to parse the response and print it in fancy_grid format. 
    You can set pretty print format to other types as well like csv, latex etc.
    pretty_printed_string = get_string(textract_json=response, output_type=[Textract_Expense_Pretty_Print.SUMMARY, Textract_Expense_Pretty_Print.LINEITEMGROUPS], table_format=Pretty_Print_Table_Format.fancy_grid)
    Use the pretty printed string to save the response in storage of your choice. 
    Below is just printing it on stdout.

except Exception as e_raise:
    raise e_raise

The following is the PrettyPrinter output for a sample receipt.

The following is another example of detecting structured data from an invoice.

AnalyzeExpense detects the various normalized summary fields like PAYMENT_TERMS, INVOICER_RECEIPT_ID, TOTAL, TAX, and RECEIVER_ADDRESS.

It also detected one LineItemGroup with one LineItem having DESCRIPTION, QUANTITY, and PRICE, as shown in the following PrettyPrinter output.

Solution architecture overview

The following diagram is a common solution architecture pattern you can use to process documents using Amazon Textract. The solution uses the new AnalyzeExpense API to process receipts and invoices on Amazon S3 and stores the results back in Amazon S3.

The solution architecture includes the following steps:

  1. The input and output S3 buckets store the input expense documents (images) in PNG and JPEG formats and the AnalyzeExpense PrettyPrinter outputs, respectively.
  2. An event rule based on an event pattern in Amazon EventBridge matches incoming S3 PutObject events in the input S3 bucket containing the raw expense document images.
  3. The configured EventBridge rule sends the event to an AWS Lambda function for further processing.
  4. The Lambda function reads the images from the input S3 bucket, calls the AnalyzeExpense API, uses the Amazon Textract response parser to deserialize the JSON response, uses Amazon Textract PrettyPrinter to easily print the parsed response, and stores the results back to the S3 bucket in different formats.

Deploy and use the solution

You can deploy the solution architecture using an AWS CloudFormation template that performs much of the setup work for you.

  1. Choose Launch Stack to deploy the solution architecture in the US East (N. Virginia) Region.

  1. Don’t make any changes to stack name or parameters.
  2. In the Capabilities section, select I acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Create stack.

To use the solution, upload the receipt and invoice images in the S3 bucket referred by SourceBucket in the CloudFormation template. This triggers an event to invoke the Lambda function that calls the AnalyzeExpense API and parses the response JSON, converts the parsed response into CSV or fancy_grid format, and stores it back to another S3 bucket (referred by OutputBucket in the CloudFormation template).

You can extend the provided Lambda function further based on your requirements and also change the output format to other types like TSV, grid, LaTex, and many more by setting the appropriate value of output_type when calling the get_string method of textractprettyprinter.t_pretty_print_expense in Amazon Textract PrettyPrinter.

The sample Lambda function deployment package included in this CloudFormation template consists of the Boto3 SDK as well. If you want to upgrade the Boto3 SDK in future, you either need to create a new deployment package with the upgraded Boto3 SDK or use the latest Boto3 SDK provided by the Lambda Python runtime.

Clean up resources

To delete the resources that the CloudFormation template created, complete the following steps:

  1. Delete the Input, Output and Logging Amazon S3 Buckets created by the CloudFormation template.
  2. On the AWS CloudFormation console, select the stack that you created.
  3. On the Actions menu, choose Delete.


In this post, we provided an overview of the new Amazon Textract AnalyzeExpense API to quickly and easily retrieve structured data from receipts and invoices. We also described how you can parse the AnalyzeExpense response JSON using the Amazon Textract parser library and save the output in different formats using Amazon Textract PrettyPrinter. Finally, we provided a solution architecture for processing invoices and receipts using Amazon S3, EventBridge, and a Lambda function.

For more information, see the Amazon Textract Developer Guide.

About the Authors

Dhawalkumar Patel is a Sr. Startups Machine Learning Solutions Architect at AWS with expertise in Machine Learning and Serverless domains. He has worked with organizations ranging from large enterprises to startups on problems related to distributed computing and artificial intelligence



Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.




Manish Chugh is a Sr. Solutions Architect at AWS based in San Francisco, CA. He has worked with organizations ranging from large enterprises to early-stage startups. He is responsible for helping customers architect scalable, secure, and cost-effective workloads on AWS. In his free time, he enjoys hiking East Bay trails, road biking, and watching (and playing) cricket.

Read More