RLiable: Towards Reliable Evaluation & Reporting in Reinforcement Learning

Posted by Rishabh Agarwal, Research Scientist and Pablo Samuel Castro, Staff Software Engineer, Google Research, Brain Team

Reinforcement learning (RL) is an area of machine learning that focuses on learning from experiences to solve decision making tasks. While the field of RL has made great progress, resulting in impressive empirical results on complex tasks, such as playing video games, flying stratospheric balloons and designing hardware chips, it is becoming increasingly apparent that the current standards for empirical evaluation might give a false sense of fast scientific progress while slowing it down.

To that end, in “Deep RL at the Edge of the Statistical Precipice”, accepted as an oral presentation at NeurIPS 2021, we discuss how statistical uncertainty of results needs to be considered, especially when using only a few training runs, in order for evaluation in deep RL to be reliable. Specifically, the predominant practice of reporting point estimates ignores this uncertainty and hinders reproducibility of results. Related to this, tables with per-task scores, as are commonly reported, can be overwhelming beyond a few tasks and often omit standard deviations. Furthermore, simple performance metrics like the mean can be dominated by a few outlier tasks, while the median score would remain unaffected even if up to half of the tasks had performance scores of zero. Thus, to increase the field’s confidence in reported results with a handful of runs, we propose various statistical tools, including stratified bootstrap confidence intervals, performance profiles, and better metrics, such as interquartile mean and probability of improvement. To help researchers incorporate these tools, we also release an easy-to-use Python library RLiable with a quickstart colab.

Statistical Uncertainty in RL Evaluation
Empirical research in RL relies on evaluating performance on a diverse suite of tasks, such as Atari 2600 video games, to assess progress. Published results on deep RL benchmarks typically compare point estimates of the mean and median scores aggregated across tasks. These scores are typically relative to some defined baseline and optimal performance (e.g., random agent and “average” human performance on Atari games, respectively) so as to make scores comparable across different tasks.

In most RL experiments, there is randomness in the scores obtained from different training runs, so reporting only point estimates does not reveal whether similar results would be obtained with new independent runs. A small number of training runs, coupled with the high variability in performance of deep RL algorithms, often leads to large statistical uncertainty in such point estimates.

The distribution of median human normalized scores on the Atari 100k benchmark, which contains 26 games, for five recently published algorithms, DER, OTR, CURL, two variants of DrQ, and SPR. The reported point estimates of median scores based on a few runs in publications, as shown by dashed lines, do not provide information about the variability in median scores and typically overestimate (e.g., CURL, SPR, DrQ) or underestimate (e.g., DER) the expected median, which can result in erroneous conclusions.

As benchmarks become increasingly more complex, evaluating more than a few runs will be increasingly demanding due to the increased compute and data needed to solve such tasks. For example, five runs on 50 Atari games for 200 million frames takes 1000+ GPU days. Thus, evaluating more runs is not a feasible solution for reducing statistical uncertainty on computationally demanding benchmarks. While prior work has recommended statistical significance tests as a solution, such tests are dichotomous in nature (either “significant” or “not significant”), so they often lack the granularity needed to yield meaningful insights and are widely misinterpreted.

Number of runs in RL papers over the years. Beginning with the Arcade Learning Environment (ALE), the shift toward computationally-demanding benchmarks has led to the practice of evaluating only a handful of runs per task, increasing the statistical uncertainty in point estimates.

Tools for Reliable Evaluation
Any aggregate metric based on a finite number of runs is a random variable, so to take this into account, we advocate for reporting stratified bootstrap confidence intervals (CIs), which predict the likely values of aggregate metrics if the same experiment were repeated with different runs. These CIs allow us to understand the statistical uncertainty and reproducibility of results. Such CIs use the scores on combined runs across tasks. For example, evaluating 3 runs each on Atari 100k, which contains 26 tasks, results in 78 sample scores for uncertainty estimation.

In each task, colored balls denote scores on different runs. To compute statified bootstrap CIs using the percentile method, bootstrap samples are created by randomly sampling scores with replacement proportionately from each task. Then, the distribution of aggregate scores on these samples is the bootstrapping distribution, whose spread around the center gives us the confidence interval.

Most deep RL algorithms often perform better on some tasks and training runs, but aggregate performance metrics can conceal this variability, as shown below.

Data with varied appearance but identical aggregate statistics. Source: Same Stats, Different Graphs.

Instead, we recommend performance profiles, which are typically used for comparing solve times of optimization software. These profiles plot the score distribution across all runs and tasks with uncertainty estimates using stratified bootstrap confidence bands. These plots show the total runs across all tasks that obtain a score above a threshold (𝝉) as a function of the threshold.

Performance profiles correspond to the empirical tail distribution of scores on runs combined across all tasks. Shaded regions show 95% stratified bootstrap confidence bands.

Such profiles allow for qualitative comparisons at a glance. For example, the curve for one algorithm above another means that one algorithm is better than the other. We can also read any score percentile, e.g., the profiles intersect y = 0.5 (dotted line above) at the median score. Furthermore, the area under the profile corresponds to the mean score.

While performance profiles are useful for qualitative comparisons, algorithms rarely outperform other algorithms on all tasks and thus their profiles often intersect, so finer quantitative comparisons require aggregate performance metrics. However, existing metrics have limitations: (1) a single high performing task may dominate the task mean score, while (2) the task median is unaffected by zero scores on nearly half of the tasks and requires a large number of training runs for small statistical uncertainty. To address the above limitations, we recommend two alternatives based on robust statistics: the interquartile mean (IQM) and the optimality gap, both of which can be read as areas under the performance profile, below.

IQM (red) corresponds to the area under the performance profile, shown in blue, between the 25 and 75 percentile scores on the x-axis. Optimality gap (yellow) corresponds to the area between the profile and horizontal line at y = 1 (human performance), for scores less than 1.

As an alternative to median and mean, IQM corresponds to the mean score of the middle 50% of the runs combined across all tasks. It is more robust to outliers than mean, a better indicator of overall performance than median, and results in smaller CIs, and so, requires fewer runs to claim improvements. Another alternative to mean, optimality gap measures how far an algorithm is from optimal performance.

IQM discards the lowest 25% and highest 25% of the combined scores (colored balls) and computes the mean of the remaining 50% scores.

For directly comparing two algorithms, another metric to consider is the average probability of improvement, which describes how likely an improvement over baseline is, regardless of its size. This metric is computed using the Mann-Whitney U-statistic, averaged across tasks.

Re-evaluating Evaluation
Using the above tools for evaluation, we revisit performance evaluations of existing algorithms on widely used RL benchmarks, revealing inconsistencies in prior evaluation. For example, in the Arcade Learning Environment (ALE), a widely recognized RL benchmark, the performance ranking of algorithms changes depending on the choice of aggregate metric. Since performance profiles capture the full picture, they often illustrate why such inconsistencies exist.

Median (left) and IQM (right) human normalized scores on the ALE as a function of the number of environment frames seen during training. IQM results in significantly smaller CIs than median scores.

On DM Control, a popular continuous control benchmark, there are large overlaps in 95% CIs of mean normalized scores for most algorithms.

DM Control Suite results, averaged across six tasks, on the 100k and 500k step benchmark. Since scores are normalized using maximum performance, mean scores correspond to one minus the optimality gap. The ordering of the algorithms is based on their claimed relative performance — all algorithms except Dreamer claimed improvement over at least one algorithm placed below them. Shaded regions show 95% CIs.

Finally, on Procgen, a benchmark for evaluating generalization in RL, the average probability of improvement shows that some claimed improvements are only 50-70% likely, suggesting that some reported improvements could be spurious.

Each row shows the probability that the algorithm X on the left outperforms algorithm Y on the right, given that X was claimed to be better than Y. Shaded region denotes 95% stratified bootstrap CIs.

Conclusion
Our findings on widely-used deep RL benchmarks show that statistical issues can have a large influence on previously reported results. In this work, we take a fresh look at evaluation to improve the interpretation of reported results and standardize experimental reporting. We’d like to emphasize the importance of published papers providing results for all runs to allow for future statistical analyses. To build confidence in your results, please check out our open-source library RLiable and the quickstart colab.

Acknowledgments
This work was done in collaboration with Max Schwarzer, Aaron Courville and Marc G. Bellemare. We’d like to thank Tom Small for an animated figure used in this post. We are also grateful for feedback by several members of the Google Research, Brain Team and DeepMind.

Read More

MIT Lincoln Laboratory wins nine R&D 100 Awards for 2021

Nine technologies developed at MIT Lincoln Laboratory have been selected as R&D 100 Award winners for 2021. Since 1963, this awards program has recognized the 100 most significant technologies transitioned to use or introduced into the marketplace over the past year. The winners are selected by an independent panel of expert judges. R&D World, an online publication that serves research scientists and engineers worldwide, announces the awards.

The winning technologies are diverse in their applications. One technology empowers medics to initiate life-saving interventions at the site of an emergency; another could help first responders find survivors buried under rubble. Others present new approaches to building motors at the microscale, combining arrays of optical fibers, and reducing electromagnetic interference in circuit boards. A handful of the awardees leverage machine learning to enable novel capabilities.

Field-programmable imaging array

Advanced imagers, such as lidars and high-resolution wide-field-of-view sensors, need the ability to process huge amounts of data directly in the system, or “on chip.” However, developing this capability for novel or niche applications is prohibitively expensive. To help designers overcome this barrier, Lincoln Laboratory developed a field-programmable imaging array to make high-performance on-chip digital processing available to a broad spectrum of new imaging applications.

The technology serves as a universal digital back end, adaptable to any type of optical detector. Once a front end for a specific detector type is integrated, the design cycle for new applications of that detector type can be greatly shortened.

Free-space Quantum Network Link Architecture

The Free-space Quantum Network Link Architecture enables the generation, distribution, and interaction of entangled photons across free-space links. These capabilities are crucial for the development of emerging quantum network applications, such as networked computing and distributed sensing.

Three primary technologies make up this system: a gigahertz clock-rate, three-stage pump laser system; a source of spectrally pure and long-duration entangled photons; and a pump-forwarding architecture that synchronizes quantum systems across free-space links with high precision. This architecture was successfully demonstrated over a 3.2-kilometer free-space atmospheric link between two buildings on Hanscom Air Force Base.

Global Synthetic Weather Radar

The Global Synthetic Weather Radar (GSWR) provides radar-like weather imagery and radar-forward forecasts for regions where actual weather radars are not deployed or are limited in range. The technology generates these synthetic images by using advanced machine learning techniques that combine satellite, lightning, numerical weather model, and radar truth data to produce its predictions.

The laboratory collaborated with the U.S. Air Force on this technology, which will help mission planners schedule operations in remote regions of the world. GSWR’s reliable imagery and forecasts can also provide decision-making guidance for emergency responders and for the transportation, agriculture, and tourism industries.

Guided Ultrasound Intervention Device

The Guided Ultrasound Intervention Device (GUIDE) is the first technology to enable a medic or emergency medical technician to catheterize a major blood vessel in a pre-hospital environment. This procedure can save lives from hemorrhage after traumatic injury.

To use GUIDE, a medic scans a target area of a patient with an ultrasound probe integrated with the device. The device then uses artificial intelligence software to locate a femoral vessel in real time and direct the medic to it via a gamified display. Once in position, the device inserts a needle and guide wire into the vessel, after which the medic can easily complete the process of catheterization. Similar to the impact of automated external defibrillators, GUIDE can empower non-experts to take life-saving measures at the scene of an emergency.

Microhydraulic motors

Microhydraulic motors provide a new way of making things move on a microscale. These tiny actuators are constructed by layering thin, disc-shaped polymer sheets on top of microfabricated electrodes and inserting droplets of water and oil in between the layers. A voltage applied to the electrodes distorts the surface tension of the droplets, causing them to move and rotate the entire disk with them.

These precise, powerful, and efficient motors could enable shape-changing materials, self-folding displays, or microrobots for medical procedures.

Monolithic fiber array launcher

A fiber array launcher is a subsystem that holds an array of optical fibers in place and shapes the laser beams emanating from the fibers. Traditional launchers are composed of many small components, which can become misaligned with vibration and are made of inefficient materials that absorb light. To address these problems, the laboratory developed a monolithic fiber array launcher.

Built out of a single piece of glass, this launcher is one-tenth the volume of traditional arrays and less susceptible to thermo-optic effects to allow for scaling to much higher laser powers and channel counts.

Motion Under Rubble Measured Using Radar

The Motion Under Rubble Measured Using Radar (MURMUR) technology was created to help rescue teams save lives in complex disaster environments. This remote-controlled system is mounted on a robotic ground vehicle for rapid deployment and uses radar to transmit low-frequency signals that penetrate walls, rubble, and debris. 

Signals that reflect back to the radar are digitized and then processed using both classical signal processing techniques and novel machine learning algorithms to determine the range in depth at which there is life-indicating motion, such as breathing, from someone buried under the rubble. Search-and-rescue personnel monitor these detections in real time on a mobile device, reducing time-consuming search efforts and enabling timely recovery of survivors.

Spectrally Efficient Digital Logic

Spectrally Efficient Digital Logic (SEDL) is a set of digital logic building blocks that operate with intrinsically low electromagnetic interference (EMI) emissions.

EMI emissions cause interference between electrical components and present security risks. These emission levels are often discovered late in the electronics development process, once all the pieces are put together, and are thus costly to fix. SEDL is designed to reduce EMI problems while being compatible with traditional logic, giving designers the freedom to construct systems composed of SEDL components entirely or a hybrid of traditional logic and SEDL. It also comes at comparable size, cost, and clock speed with respect to traditional logic.

Traffic Flow Impact Tool

Developed in collaboration with the Federal Aviation Administration, the Traffic Flow Impact Tool helps air traffic control managers handle disruptions to air traffic caused by dangerous weather, such as thunderstorms.

The tool uses a novel machine learning technique to fuse multiple convective weather forecast models and compute a metric called permeability, a measure of the amount of usable airspace in a given area. These permeability predictions are displayed on a user interface and allow managers to plan ahead for weather impacts to air traffic.

Since 2010, Lincoln Laboratory has received 75 R&D 100 Awards. The awards are a recognition of the laboratory’s transfer of unclassified technologies to industry and government. Each year, many technology transitions also occur for classified projects. This transfer of technology is central to the laboratory’s role as a federally funded research and development center.

“Our R&D 100 Awards recognize the significant, ongoing technology development and transition success at the laboratory. We have had much success with our classified work as well,” says Eric Evans, the director of Lincoln Laboratory. “We are very proud of everyone involved in these programs.”

Editors of R&D World announced the 2021 R&D 100 Award winners at virtual ceremonies broadcast on October 19, 20, and 21.

Read More

ML Community Day 2021 Recap

Posted by the TensorFlow Team

Thanks to everyone who joined our inaugural virtual ML Community Day! It was so great to get the community together and hear incredible talks like how JAX and TPUs make AlphaFold possible from the DeepMind team, and how Edge Impulse makes it easy for developers to work with TinyML using TensorFlow.

We also celebrated TensorFlow’s 6th birthday! The TensorFlow ecosystem has come a long way in 6 years, and we love seeing what you all achieve with our tools. From using machine learning to help advance access to human rights information, to creating a custom, TensorFlow-powered drumming arm.

In this article are a few of the updates and topics we shared during the event. You can watch the keynote below, and you can find recordings of every talk on the TensorFlow YouTube channel.


Model building

TensorFlow 2.7 is here! This release offers performance and usability improvements, including TFLite use of XNNPack for mobile inference performance boosts, training improvements on GPUs, and a dramatic improvement in debugging efficiency in Keras and TF.

Keras has been modularized as a separate pip package on top of TensorFlow (installed by default) and now lives in a separate GitHub repository. This will make it much easier for the community to contribute to the development of Keras. We welcome your PRs!

Responsible AI

The Responsible AI team also announced v0.4 of our Language Interpretability Tool (LIT). LIT is an open-source platform for visualization and understanding of NLP models. This new release includes new interpretability techniques like TCAV, Targeted Concept activation Vector. TCAV is an interpretability method for ML models that shows the importance of high level conceptsfor a predicted class.

Mobile

We recently launched on-device training in TensorFlow Lite. When deploying TensorFlow Lite machine learning model to a mobile app, you may want to enable the model to be improved or personalized based on input from the device or end user. Using on-device training techniques allows you to update a model without data leaving your users’ devices, improving user privacy, and without requiring users to update the device software. It’s currently available on Android.

And we continue to work on making performance better on TensorFlow Lite. As mentioned above, XNNPACK, a library for faster floating point ops, is now turned on by default in TensorFlow Lite. This allows your models to run on an average 2.3x faster on the CPU.

Find all the talks here

You can find all of the content in this playlist, and for your convenience here are direct links to each of the sessions also:

Read More

MLPerf HPC Benchmarks Show the Power of HPC+AI 

NVIDIA-powered systems won four of five tests in MLPerf HPC 1.0, an industry benchmark for AI performance on scientific applications in high performance computing.

They’re the latest results from MLPerf, a set of industry benchmarks for deep learning first released in May 2018. MLPerf HPC addresses a style of computing that speeds and augments simulations on supercomputers with AI.

Recent advances in molecular dynamics, astronomy and climate simulation all used HPC+AI to make scientific breakthroughs. It’s a trend driving the adoption of exascale AI for users in both science and industry.

What the Benchmarks Measure

MLPerf HPC 1.0 measured training of AI models in three typical workloads for HPC centers.

  • CosmoFlow estimates details of objects in images from telescopes.
  • DeepCAM tests detection of hurricanes and atmospheric rivers in climate data.
  • OpenCatalyst tracks how well systems predict forces among atoms in molecules.

Each test has two parts. A measure of how fast a system trains a model is called strong scaling. Its counterpart, weak scaling, is a measure of maximum system throughput, that is, how many models a system can train in a given time.

Compared to the best results in strong scaling from last year’s MLPerf 0.7 round, NVIDIA delivered 5x better results for CosmoFlow. In DeepCAM, we delivered nearly 7x more performance.

The Perlmutter Phase 1 system at Lawrence Berkeley National Lab led in strong scaling in the OpenCatalyst benchmark using 512 of its 6,144 NVIDIA A100 Tensor Core GPUs.

In the weak-scaling category, we led DeepCAM using 16 nodes per job and 256 simultaneous jobs. All our tests ran on NVIDIA Selene (pictured above), our in-house system and the world’s largest industrial supercomputer.

NVIDIA wins MLPerf HPC, Nov 2021
NVIDIA delivered leadership results in both the speed of training a model and per-chip efficiency.

The latest results demonstrate another dimension of the NVIDIA AI platform and its performance leadership. It marks the eighth straight time NVIDIA delivered top scores in MLPerf benchmarks that span AI training and inference in the data center, the cloud and the network’s edge.

A Broad Ecosystem

Seven of the eight participants in this round submitted results using NVIDIA GPUs.

They include the Jülich Supercomputing Centre in Germany, the Swiss National Supercomputing Centre and, in the U.S., the Argonne and Lawrence Berkeley National Laboratories, the National Center for Supercomputing Applications and the Texas Advanced Computing Center.

“With the benchmark test, we have shown that our machine can unfold its potential in practice and contribute to keeping Europe on the ball when it comes to AI,” said Thomas Lippert, director of the Jülich Supercomputing Centre in a blog.

The MLPerf benchmarks are backed by MLCommons, an industry group led by Alibaba, Google, Intel, Meta, NVIDIA and others.

How We Did It

The strong showing is the result of a mature NVIDIA AI platform that includes a full stack of software.

In this round, we tuned our code with tools available to everyone, such as NVIDIA DALI to accelerate data processing and CUDA Graphs to reduce small-batch latency for efficiently scaling up to 1,024 or more GPUs.

We also applied NVIDIA SHARP, a key component within NVIDIA MagnumIO. It provides in-network computing to accelerate communications and offload data operations to the NVIDIA Quantum InfiniBand switch.

For a deeper dive into how we used these tools see our developer blog.

All the software we used for our submissions is available from the MLPerf repository. We regularly add such code to the NGC catalog, our software hub for pretrained AI models, industry application frameworks, GPU applications and other software resources.

The post MLPerf HPC Benchmarks Show the Power of HPC+AI  appeared first on The Official NVIDIA Blog.

Read More

Postprocessing with Amazon Textract: Multi-page table handling

Amazon Textract is a machine learning (ML) service that automatically extracts printed text, handwriting, and other data from scanned documents that goes beyond simple optical character recognition (OCR) to identify and extract data from forms and tables.

Currently, thousands of customers are using Amazon Textract to process different types of documents. Many include tables across one or multiple pages, such as bank statements and financial reports.

Many developers expressed interest in merging Amazon Textract responses where tables exist across multiple pages. This post demonstrates how you can use the amazon-textract-response-parser utility to accomplish this and highlights a few tricks to optimize the process.

Solution overview

When tables span multiple pages, a series of steps and validations are required to determine the linkage across pages correctly.

These include analyzing the table structure similarities across pages (columns, headers, margins) and determining if any additional contents like headers or footers exist that may logically break the tables. These logical steps are separated into two major groups (page context and table structure), and you can adjust and optimize each logical step according to your use case.

This solution runs these tasks in series and only merges the results when all checks are completed and passed. The following diagram shows the solution workflow.

Implement the solution

To get started, you must install the amazon-textract-response-parser, and amazon-textract-helper libraries. The Amazon Textract response parser library enables us to easily parse the Amazon Textract JSON response and provides constructs to work with different parts of the document effectively. This post focuses on the merge/link tables feature. Amazon-textract-helper is another useful library that provides a collection of ready-to-use functions and sample implementations to speed up the evaluation and development of any project using Amazon Textract.

  1. Install the libraries with the following code:
!pip install amazon-textract-response-parser
!pip install amazon-textract-helper
  1. The postprocessing step to identify related tables and merge them is part of the trp.trp2 library, which you must import into your notebook:
import trp.trp2 as t2
from trp.t_pipeline import pipeline_merge_tables
from textractcaller.t_call import call_textract, Textract_Features
from trp.trp2 import TDocument, TDocumentSchema
from trp.t_tables import MergeOptions, HeaderFooterType
  1. Next, call Amazon Textract to process the document:
textract_json = call_textract(input_document=s3_uri_of_documents, features=[Textract_Features.TABLES], boto3_textract_client = textract_client)
  1. Finally, load the response JSON into a document and run the pipeline. The footer and header heights are configurable by the user. There are three default values can be used for HeaderFooterType: None, Narrow, and Normal.
t_document: t2.TDocument = t2.TDocumentSchema().load(textract_json)
t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, None, HeaderFooterType.NONE)

Pipeline_merge_tables takes a merge option parameter that can be either .MERGE or .LINK.

MergeOptions.MERGE combines the tables and makes them appear as one for postprocessing, with the drawback that the geometry information is no longer in the correct location because you now have cells and tables from subsequent pages moved to the page with the first part of the table.

MergeOptions.LINK maintains the geometric structure and enriches the table information with links between the table elements. A custom['previous_table'] and custom['next_table'] attribute is added to the TABLE blocks in the Amazon Textract JSON schema.

The following image represents a sample PDF file with a table that spans over two pages.

The following shows the Amazon Textract response without table merge postprocessing (left) and the response with table merge postprocessing (right).

Define a custom table merge validation function

The provided postprocessing API works for the majority of use cases; however, based on your specific use case, you can define a custom merge function to improve its accuracy.

This custom function is passed to the CustomTableDetectionFunction parameter of the pipeline_merge_tables function to overwrite the existing logic of identifying the tables to merge. The following steps represent the existing logic.

  1. Validate context between tables. Check if there are any line items between the first and second table except in the footer and header area. If there are any line items, tables are considered separate tables.
  2. Compare the column numbers. If the two tables don’t have the same number of columns, this is an indicator of separate logical tables.
  3. Compare the headers. If the two tables have the exact same columns (same cell number and cell labels), this is a very strong indication of the same logical table.
  4. Compare table dimensions. Verify that the two tables have the same left and right margin. An accuracy percentage parameter can be passed to allow for some degree of error (for example, if the pages are scanned from papers, consequent tables on different pages may have different weights).

If you have a different requirement, you can pass your own custom table detection function to the pipeline_merge_tables API as follows:

def CustomTableDetectionFunction(t_document) -> List[List[str]])
    table_ids_merge_list = []
    ordered_doc = order_blocks_by_geo(t_document)
    trp_doc = Document(TDocumentSchema().dump(ordered_doc))
    for current_page in trp_doc.pages:
        for table in current_page.tables:
        # Provide your custom logic here to determine which tableids should merge to one table
        # if(custom logic)
        #   table_ids_merge_list.append(>tableid1, tableid2, tableid3, ...etc.) 
    return table_ids_merge_list

t_document = pipeline_merge_tables(t_document, MergeOptions.MERGE, CustomTableDetectionFunction, HeaderFooterType.NORMAL)

Our current implementation for the table detection function and pipeline_merge_tables function in our Amazon Textract response parser library is available on GitHub. The customTableDetection function returns a list of lists (of strings), which is required by the merge_table or link_table functions (based on the MergeOptions parameter) called internally by the pipeline_merge_tables API.

Run sample code

The Amazon Textract multi-page tables processing repository provides sample code on how to use the merge tables feature and covers common scenarios that you may encounter in your documents. To try the sample code, you first launch an Amazon SageMaker notebook instance with the code repository, then you can access the notebook to review the code samples.

Launch a SageMaker notebook instance with the code repository

To launch a SageMaker notebook instance, complete the following steps:

  1. Choose the following link to launch an AWS CloudFormation template that deploys a SageMaker notebook instance along with the sample code repository:

  1. Sign in to the AWS Management Console with your AWS Identity and Access Management (IAM) user name and password.

You arrive at the Create Stack page on the Specify Template step.

  1. Choose Next.
  2. For Specify Stack Name, enter a stack name.
  3. Choose Next.
  4. Choose Next
  5. On the review page, acknowledge the IAM resource creation and choose Create stack.

Access the SageMaker notebook and review the code samples

When the stack creation is complete, you can access the notebook and review the code samples.

  1. On the Outputs tab of the stack, choose the link corresponding to the value of the NotebookInstanceName key.
  2. Choose Open Jupyter.
  3. Go to the home page of your Jupyter notebook and browse to the amazon-textract-multipage-tables-processing directory.
  4. Open the Jupyter notebook inside this directory and the sample code provided.

Conclusion

This post demonstrated how to use the Amazon Textract response parser component to identify and merge tables that span multiple pages. You walked through generic checks that you can use to identify a multi-page table, learned how to build your own custom function, and reviewed the two options to merge tables in the Amazon Textract response JSON.

If this post helps you or inspires you to solve a problem, we would love to hear about it! The code for this solution is available on the GitHub repo for you to use and extend. Contributions are always welcome!


About the Authors

 Mehran Najafi, PhD, is a Senior Solutions Architect for AWS focused on AI/ML solutions and architectures at scale.

Keith Mascarenhas is a Solutions Architect and works with our small and medium sized customers in central Canada to help them grow and achieve outcomes faster with AWS. He is also passionate about machine learning and is a member of the Amazon Computer Vision Hero program.

Yuan Jiang is a Sr Solutions Architect with a focus in machine learning. He’s a member of the Amazon Computer Vision Hero program and the Amazon Machine Learning Technical Field Community.

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions, and joined AWS in 2014. He has guided some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. He is currently obsessed with extracting information from documents.

Read More

Machine learning inference at scale using AWS serverless

With the growing adoption of Machine Learning (ML) across industries, there is an increasing demand for faster and easier ways to run ML inference at scale. ML use cases, such as manufacturing defect detection, demand forecasting, fraud surveillance, and many others, involve tens or thousands of datasets, including images, videos, files, documents, and other artifacts. These inference use cases typically require the workloads to scale to tens of thousands of parallel processing units. The simplicity and automated scaling offered by AWS serverless solutions makes it a great choice for running ML inference at scale. Using serverless, inferences can be run without provisioning or managing servers and while only paying for the time it takes to run. ML practitioners can easily bring their own ML models and inference code to AWS by using containers.

This post shows you how to run and scale ML inference using AWS serverless solutions: AWS Lambda and AWS Fargate.

Solution overview

The following diagram illustrates the solutions architecture for both batch and real-time inference options. The solution is demonstrated using a sample image classification use case. Source code for this sample is available on GitHub.

The diagram illustrates the solutions architecture for batch and real-time inferences. Batch inference uses AWS Fargate and AWS Batch, along with Amazon S3 and Amazon ECR. Real-time inference uses AWS Lambda and Amazon API Gateway.

AWS Fargate: Lets you run batch inference at scale using serverless containers. Fargate task loads the container image with the inference code for image classification.

AWS Batch: Provides job orchestration for batch inference by dynamically provisioning Fargate containers as per job requirements.

AWS Lambda: Lets you run real-time ML inference at scale. The Lambda function loads the inference code for image classification. Lambda function is also used to submit batch inference jobs.

Amazon API Gateway: Provides a REST API endpoint for the inference Lambda function.

Amazon Simple Storage Service (S3): Stores input images and inference results for batch inference.

Amazon Elastic Container Registry (ECR): Stores the container image with inference code for Fargate containers.

Deploying the solution

We have created an AWS Cloud Development Kit (CDK) template to define and configure the resources for the sample solution. CDK lets you provision the infrastructure and build deployment packages for both the Lambda Function and Fargate container. The packages include commonly used ML libraries, such as Apache MXNet and Python, along with their dependencies. The solution is running the inference code using a ResNet-50 model trained on the ImageNet dataset to recognize objects in an image. The model can classify images into 1000 object categories, such as keyboard, pointer, pencil, and many animals. The inference code downloads the input image and performs the prediction with the five classes that the image most relates with the respective probability.

To follow along and run the solution, you need access to:

To deploy the solution, open your terminal window and complete the following steps.

  1. Clone the GitHub repo
    $ git clone https://github.com/aws-samples/aws-serverless-for-machine-learning-inference

  2. Navigate to the project directory and deploy the CDK application.
$ ./install.sh
or
$ ./cloud9_install.sh #If you are using AWS Cloud9

Enter Y to proceed with the deployment.

This performs the following steps to deploy and configure the required resources in your AWS account. It may take around 30 minutes for the initial deployment, as it builds the Docker image and other artifacts. Subsequent deployments typically complete within a few minutes.

  • Creates a CloudFormation stack (“MLServerlessStack”).
  • Creates a container image from the Dockerfile and the inference code for batch inference.
  • Creates an ECR repository and publishes the container image to this repo.
  • Creates a Lambda function with the inference code for real-time inference.
  • Creates a batch job configuration with Fargate compute environment in AWS Batch.
  • Creates an S3 bucket to store inference images and results.
  • Creates a Lambda function to submit batch jobs in response to image uploads to S3 bucket.

Running inference

The sample solution lets you get predictions for either a set of images using batch inference or for a single image at a time using real-time API endpoint. Complete the following steps to run inferences for each scenario.

Batch inference

Get batch predictions by uploading image files to Amazon S3.

  1. Using Amazon S3 console or using AWS CLI, upload one or more image files to the S3 bucket path ml-serverless-bucket-<acct-id><aws-region>/input.
    $ aws s3 cp <path to jpeg files> s3://ml-serverless-bucket-<acct-id>-<aws-region>/input/ --recursive

  2. This will trigger the batch job, which will spin-off Fargate tasks to run the inference. You can monitor the job status in AWS Batch console.
  3. Once the job is complete (this may take a few minutes), inference results can be accessed from the ml-serverless-bucket-<acct-id><aws-region>/output path.

Real-time inference

Get real-time predictions by invoking the REST API endpoint with an image payload.

  1. Navigate to the CloudFormation console and find the API endpoint URL (httpAPIUrl) from the stack output.
  2. Use an API client, like Postman or curl command, to send a POST request to the /predict API endpoint with image file payload.
    $ curl --request POST -H "Content-Type: application/jpeg" --data-binary @<your jpg file name> <your-api-endpoint-url>/predict

  3. Inference results are returned in the API response.

Additional recommendations and tips

Here are some additional recommendations and options to consider for fine-tuning the sample to meet your specific requirements:

  • Scaling – Update AWS Service Quotas in your account and Region as per your scaling and concurrency needs to run the solution at scale. For example, if your use case requires scaling beyond the default Lambda concurrent executions limit, then you must increase this limit to reach the desired concurrency. You also need to size your VPC and subnets with a wide enough IP address range to allow the required concurrency for Fargate tasks.
  • Performance – Perform load tests and fine tune performance across each layer to meet your needs.
  • Use container images with Lambda – This lets you use containers with both AWS Lambda and AWS Fargate, and you can simplify source code management and packaging.
  • Use AWS Lambda for batch inferences – You can use Lambda functions for batch inferences as well if the inference storage and processing times are within Lambda limits.
  • Use Fargate Spot – This lets you run interruption tolerant tasks at a discounted rate compared to the Fargate price, and reduce the cost for compute resources.
  • Use Amazon ECS container instances with Amazon EC2 – For use cases that need a specific type of compute, you can make use of EC2 instances instead of Fargate.

Cleaning up

Navigate to the project directory from the terminal window and run the following command to destroy all resources and avoid incurring future charges.

$ cdk destroy

Conclusion

This post demonstrated how to bring your own ML models and inference code and run them at scale using serverless solutions in AWS. The solution made it possible to deploy your inference code in AWS Fargate and AWS Lambda. Moreover, it also deployed an API endpoint using Amazon API Gateway for real-time inferences and batch job orchestration using AWS Batch for batch inferences. Effectively, this solution lets you focus on building ML models by providing an efficient and cost-effective way to serve predictions at scale.

Try it out today, and we look forward to seeing the exciting machine learning applications that you bring to AWS Serverless!

Additional Reading:


About the Authors

Poornima Chand is a Senior Solutions Architect in the Strategic Accounts Solutions Architecture team at AWS. She works with customers to help solve their unique challenges using AWS technology solutions. She focuses on Serverless technologies and enjoys architecting and building scalable solutions.

Greg Medard is a Solutions Architect with AWS Business Development and Strategic Industries. He helps customers with the architecture, design, and development of cloud-optimized infrastructure solutions. His passion is to influence cultural perceptions by adopting DevOps concepts that withstand organizational challenges along the way. Outside of work, you may find him spending time with his family, playing with a new gadget, or traveling to explore new places and flavors.

Mani Khanuja is an Artificial Intelligence and Machine Learning Specialist SA at Amazon Web Services (AWS). She helps customers using machine learning to solve their business challenges using the AWS. She spends most of her time diving deep and teaching customers on AI/ML projects related to computer vision, natural language processing, forecasting, ML at the edge, and more. She is passionate about ML at edge, therefore, she has created her own lab with self-driving kit and prototype manufacturing production line, where she spends lot of her free time.

Vasu Sankhavaram is a Senior Manager of Solutions Architecture in Amazon Web Services (AWS). He leads Solutions Architects dedicated to Hitech accounts. Vasu holds an MBA from U.C. Berkeley, and a Bachelor’s degree in Engineering from University of Mysore, India. Vasu and his wife have their hands full with a son who’s a sophomore at Purdue, twin daughters in third grade, and a golden doodle with boundless energy.

Chitresh Saxena is a Senior Technical Account Manager at Amazon Web Services. He has a strong background in ML, Data Analytics and Web technologies. His passion is solving customer problems, building efficient and effective solutions on the cloud with AI, Data Science and Machine Learning.

Read More

A Revolution in the Making: How AI and Science Can Mitigate Climate Change

A partial differential equation is “the most powerful tool humanity has ever created,” Cornell University mathematician Steven Strogatz wrote in a 2009 New York Times opinion piece.

This quote opened last week’s GTC talk AI4Science: The Convergence of AI and Scientific Computing, presented by Anima Anandkumar, director of machine learning research at NVIDIA and professor of computing at the California Institute of Technology.

Anandkumar explained that partial differential equations are the foundation for most scientific simulations. And she showcased how this historic tool is now being made all the more powerful with AI.

“The convergence of AI and scientific computing is a revolution in the making,” she said.

Using new neural operator-based frameworks to learn and solve partial differential equations, AI can help us model weather forecasting 100,000x quicker — and carbon dioxide sequestration 60,000x quicker — than traditional models.

Speeding Up the Calculations

Anandkumar and her team developed the Fourier Neural Operator (FNO), a framework that allows AI to learn and solve an entire family of partial differential equations, rather than a single instance.

It’s the first machine learning method to successfully model turbulent flows with zero-shot super-resolution — which means that FNOs enable AI to make high-resolution inferences without high-resolution training data, which would be necessary for standard neural networks.

FNO-based machine learning greatly reduces the costs of obtaining information for AI models, improves their accuracy and speeds up inference by three orders of magnitude compared with traditional methods.

Mitigating Climate Change

FNOs can be applied to make real-world impact in countless ways.

For one, they offer a 100,000x speedup over numerical methods and unprecedented fine-scale resolution for weather prediction models. By accurately simulating and predicting extreme weather events, the AI models can allow planning to mitigate the effects of such disasters.

The FNO model, for example, was able to accurately predict the trajectory and magnitude of Hurricane Matthew from 2016.

In the video below, the red line represents the observed track of the hurricane. The white cones show the National Oceanic and Atmospheric Administration’s hurricane forecasts based on traditional models. The purple contours mark the FNO-based AI forecasts.

As shown, the FNO model follows the trajectory of the hurricane with improved accuracy compared with the traditional method — and the high-resolution simulation of this weather event took just a quarter of a second to process on NVIDIA GPUs.

In addition, Anandkumar’s talk covered how FNO-based AI can be used to model carbon dioxide sequestration — capturing carbon dioxide from the atmosphere and storing it underground, which scientists have said can help mitigate climate change.

Researchers can model and study how carbon dioxide would interact with materials underground using FNOs 60,000x faster than with traditional methods.

Anandkumar said the FNO model is also a significant step toward building a digital twin of Earth.

The new NVIDIA Modulus framework for training physics-informed machine learning models and NVIDIA Quantum-2 InfiniBand networking platform equip researchers and developers with the tools to combine the powers of AI, physics and supercomputing — to help solve the world’s toughest problems.

“I strongly believe this is the future of science,” Anandkumar said.

She’ll delve into these topics further at a SC21 plenary talk, taking place on Nov. 18 at 10:30 a.m. Central time.

Watch her full GTC session on demand, here.

Watch NVIDIA founder and CEO Jensen Huang’s GTC keynote below.

The post A Revolution in the Making: How AI and Science Can Mitigate Climate Change appeared first on The Official NVIDIA Blog.

Read More

Electrochemistry, from batteries to brains

Bilge Yildiz’s research impacts a wide range of technologies. The members of her lab study fuel cells, which convert hydrogen and oxygen into electricity (and water). They study electrolyzers, which go the other way, using electricity to convert water into hydrogen and oxygen. They study batteries. They study corrosion. They even study computers that attempt to mimic the way the brain processes information in learning. What brings all this together in her lab is the electrochemistry of ionic-electronic oxides and their interfaces.

“It may seem like we’ve been contributing to different technologies,” says Yildiz, MIT’s Breene M. Kerr (1951) Professor in the Department of Nuclear Science and Engineering (NSE) and the Department of Materials Science and Engineering, who was recently named a fellow of the American Physical Society. “It’s true. But fundamentally, it’s the same phenomena that we’re after in all these.” That is, the behavior of ions — charged atoms — in materials, particularly on surfaces and interfaces.

Yildiz’s comfort crossing scientific borders may come from her trek to where she is — or vice versa. She grew up in the seaside city of Izmir, Turkey, the daughter of two math teachers. She spent a lot of fun time by the sea, and also tinkered with her dad on repair and construction projects at home. She enjoyed studying and attended a science-focused high school, where she vividly recalls a particular two-year project. The city sat on a polluted bay, and her biology teacher connected her and a friend with a university professor who got them working on ways to clean the water using algae. “We had a lot of fun in the lab with limited supplies, collecting samples from the bay, and oxygenating them in the lab with algae,” she says. They wrote a report for the municipality. She’s no longer in biology, but “it made me aware of the research process and the importance of the environment,” she says, “that still stays.”

Before entering university, Yildiz decided to study nuclear energy engineering, because it sounded interesting, although she didn’t yet know the field’s importance for mitigating global warming. She ended up enjoying the combination of math, physics, and engineering. Turkey didn’t have much of a nuclear energy program, so she ventured to MIT for her PhD in nuclear engineering, studying artificial intelligence for the safe operation of nuclear power plants. She liked applying computer science to nuclear systems, but came to realize she preferred the physical sciences over algorithms.

Yildiz stayed at MIT for a postdoctoral fellowship, between the nuclear engineering and mechanical engineering departments, studying electrochemistry in fuel cells. “My postdoc advisors at the time were, I think, taking a risk by hiring me, because I really didn’t know anything” about electrochemistry, she says. “It was an extremely helpful and defining experience for me — eye-opening — and allowed me to move in the direction of electrochemistry and materials.” She then headed in another new direction, at Argonne National Laboratory in Illinois, learning about X-ray spectroscopy, blasting materials with powerful synchrotron X-rays to probe their structure and chemistry.

At MIT, to where Yildiz returned in 2007, she still uses Argonne’s instruments, as well as other synchrotrons in the United States and abroad. In a typical experiment, she and her group might first create a material that could be used, for example, in a fuel cell. They’ll then use X-rays in her lab or at synchrotrons to characterize its surface under various operational conditions. They’ll build computational models on the atomic or electron level to help interpret the results, and to guide the next experiment. In fuel cells, this work allowed to identify and circumvent a surface degradation problem. Connecting the dots between surface chemistry and performance allows her to predict better material surfaces to increase the efficiency and durability of fuel cells and batteries. “These are findings that we have built over many years,” she says, “from having identified the problem to identifying the reasons for that problem, then to proposing some solutions for that problem.”

Solid oxide fuel cells use materials called perovskite oxides to catalyze reactions with oxygen. Substitutions — for instance, strontium atoms — added to the crystal enhance its ability to transport electrons and oxygen ions. But these atoms, also called dopants, often precipitate at the surface of the material, reducing both its stability and its performance. Yildiz’s group uncovered the reason: The negatively charged dopants migrate toward positively charged oxygen vacancies near the crystal’s surface. They then engineered a solution. Removing some of the excess oxygen vacancies by oxidizing the surface with another element, hafnium, prevented the movement of strontium to the surface, keeping the fuel cell functioning longer and more efficiently.

“The coupling of mechanics to chemistry has also been a very exciting theme in our research,” she says. She has investigated the effects of strain on materials’ ion transport and surface catalytic activity properties. She’s found that certain types of elastic strain can facilitate diffusion of ions as well as surface reactivity. Accelerating ion transport and surface reactions improves the performance of solid oxide fuel cells and batteries.

In her recent work, she considers analog, brain-guided computing. Most computers we use daily are digital, flipping electrical switches on and off, but the brain operates with many orders of magnitude more energy efficiency, in part because it stores and processes information in the same location, and does so by varying the local electrical properties on a continuum. Yildiz is using small ions to vary the resistance of a given material continuously, as ions enter or exit the material. She controls the ions electrochemically, similar to a process in the brain. In effect, she’s replicating some functionality of biological synapses, in particular the strengthening and weakening of synapses, by creating tiny, energy-efficient batteries.

She is collaborating with colleagues across the Institute — Ju Li from NSE, Jesus del Alamo from the Department of Electrical Engineering and Computer Science, and Michale Fee and Ila Fiete from the Department of Brain and Cognitive Sciences. Their team is investigating different ions, materials, and device geometries, and is working with the MIT Quest for Intelligence to translate learning rules from brain studies to the design of brain-guided machine intelligence hardware.

In retrospect, Yildiz says, the leap from her formal training on nuclear engineering into electrochemistry and materials was a big one. “I work on a research problem, because it sparks my curiosity, I am very motivated and excited to work on it and it makes me happy. I never think whether this problem is easy or difficult when I am working on it. I really just want to do it, no matter what. When I look back now, I notice this leap was not trivial.” She adds, “But now I also see that we do this in our faculty work all the time. We identify new questions that are not necessarily in our direct expertise. And we learn, contribute, and evolve.”

Describing her return to MIT, after an “exciting and gratifying” time at Argonne, Yildiz says she preferred the intellectual flexibility of having her own academic lab — as well as the chance to teach and mentor her students and postdocs. “We get to work with young students who are energetic, motivated, smart, hardworking,” she says. “Luckily, they don’t know what’s difficult. Like I didn’t.”

Read More