Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction

Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction

Skycatch, a San Francisco-based startup, has been helping companies mine both data and minerals for nearly a decade.

The software-maker is now digging into the creation of digital twins, with an initial focus on the mining and construction industry, using the NVIDIA Omniverse platform for connecting and building custom 3D pipelines.

SkyVerse, which is a part of Skycatch’s vision AI platform, is a combination of computer vision software and custom Omniverse extensions that enables users to enrich and animate virtual worlds of mines and other sites with near-real-time geospatial data.

“With Omniverse, we can turn massive amounts of non-visual data into dynamic visual information that’s easy to contextualize and consume,” said Christian Sanz, founder and CEO of Skycatch. “We can truly recreate the physical world.”

SkyVerse can help industrial sites simulate variables such as weather, broken machines and more up to five years into the future — while learning from happenings up to five years in the past, Sanz said.

The platform automates the entire visualization pipeline for mining and construction environments.

First, it processes data from drones, lidar and other sensors across the environment, whether at the edge using the NVIDIA Jetson platform or in the cloud.

It then creates 3D meshes from 2D images, using neural networks built from NVIDIA’s pretrained models to remove unneeded objects like dump trucks and other equipment from the visualizations.

Next, SkyVerse stitches this into a single 3D model that’s converted to the Universal Scene Description (USD) framework. The master model is then brought into Omniverse Enterprise for the creation of a digital twin that’s live-synced with real-world telemetry data.

“The simulation of machines in the environment, different weather conditions, traffic jams — no other platform has enabled this, but all of it is possible in Omniverse with hyperreal physics and object mass, which is really groundbreaking,” Sanz said.

Skycatch is a Premier partner in NVIDIA Inception, a free, global program that nurtures startups revolutionizing industries with cutting-edge technologies. Premier partners receive additional go-to-market support, exposure to venture capital firms and technical expertise to help them scale faster.

Processing and Visualizing Data

Companies have deployed Skycatch’s fully automated technologies to gather insights from aerial data across tens of thousands of sites at several top mining companies.

The Skycatch team first determines optimal positioning of the data-collection sensors across mine vehicles using the NVIDIA Isaac Sim platform, a robotics simulation and synthetic data generation (SDG) tool for developing, testing and training AI-based robots.

“Isaac Sim has saved us a year’s worth of testing time — going into the field, placing a sensor, testing how it functions and repeating the process,” Sanz said.

The team also plans to integrate the Omniverse Replicator software development kit into SkyVerse to generate physically accurate 3D synthetic data and build SDG tools to accelerate the training of perception networks beyond the robotics domain.

Once data from a site is collected, SkyVerse uses edge devices powered by the NVIDIA Jetson Nano and Jetson AGX Xavier modules to automatically process up to terabytes of it per day and turn it into kilobyte-size analytics that can be easily transferred to frontline users.

This data processing was sped up 3x by the NVIDIA CUDA parallel computing platform, according to Sanz. The team is also looking to deploy the new Jetson Orin modules for next-level performance.

“It’s not humanly possible to go through tens of thousands of images a day and extract critical analytics from them,” Sanz said. “So we’re helping to expand human eyesight with neural networks.”

Using pretrained models from the NVIDIA TAO Toolkit, Skycatch also built neural networks that can remove extraneous objects and vehicles from the visualizations, and texturize over these spots in the 3D mesh.

The digital terrain model, which has sub-five-centimeter precision, can then be brought into Omniverse for the creation of a digital twin using the easily extensible USD framework, custom SkyVerse Omniverse extensions and NVIDIA RTX GPUs.

“It took just around three months to build the Omniverse extensions, despite the complexity of our extensions’ capabilities, thanks to access to technical experts through NVIDIA Inception,” Sanz said.

Skycatch is working with one of Canada’s leading mining companies, Teck Resources, to implement the use of Omniverse-based digital twins for its project sites.

“Teck Resources has been using Skycatch’s compute engine across all of our mine sites globally and is now expanding visualization and simulation capabilities with SkyVerse and our own digital twin strategy,” said Preston Miller, lead of technology and innovation at Teck Resources. “Delivering near-real-time visual data will allow Teck teams to quickly contextualize mine sites and make faster operational decisions on mission-critical, time-sensitive projects.”

The Omniverse extensions built by Skycatch will be available soon — learn more.

Safety and Sustainability

AI-powered data analysis and digital twins can make operational processes for mining and construction companies safer, more sustainable and more efficient.

For example, according to Sanz, mining companies need the ability to quickly locate the toe and crest (or bottom and top) of “benches,” narrow strips of land beside an open-pit mine. When a machine is automated to go in and out of a mine, it must be programmed to stay 10 meters away from the crest at all times to avoid the risk of sliding, Sanz said.

Previously, surveying and analyzing landforms to determine precise toes and crests typically took up to five days. With the help of NVIDIA AI, SkyVerse can now generate this information within minutes, Sanz said.

In addition, SkyVerse eliminates 10,000 open-pit interactions for customers per year, per site, Sanz said. These are situations in which humans and vehicles can intersect within a mine, posing a safety threat.

“At its core, Skycatch’s goal is to provide context and full awareness for what’s going on at a mining or construction site in near-real time — and better environmental context leads to enhanced safety for workers,” Sanz said.

Skycatch aims to boost sustainability efforts for the mining industry, too.

“In addition to mining companies, governmental organizations want visibility into how mines are operating — whether their surrounding environments are properly taken care of — and our platform offers these insights,” Sanz said.

Plus, minerals like cobalt, nickel and lithium are required for electrification and the energy transition. These all come from mine sites, Sanz said, which can become safer and more efficient with the help of SkyVerse’s digital twins and vision AI.

Dive deeper into technology for a sustainable future with Skycatch and other Inception partners in the on-demand webinar, Powering Energy Startup Success With NVIDIA Inception.

Creators and developers across the world can download NVIDIA Omniverse for free, and enterprise teams can use the platform for their 3D projects.

Learn more about and apply to join NVIDIA Inception.

The post Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction appeared first on NVIDIA Blog.

Read More

Check Out 26 New Games Streaming on GeForce NOW in November

Check Out 26 New Games Streaming on GeForce NOW in November

It’s a brand new month, which means this GFN Thursday is all about the new games streaming from the cloud.

In November, 26 titles will join the GeForce NOW library. Kick off with 11 additions this week, like Total War: THREE KINGDOMS and new content updates for Genshin Impact and Apex Legends.

Plus, leading 5G provider Rain has announced it will be introducing “GeForce NOW powered by Rain” to South Africa early next year. Look forward to more updates to come.

And don’t miss out on the 40% discount for GeForce NOW 6-month Priority memberships. This offer is only available for a limited time.

Build Your Empire

Lead the charge this week with Creative Assembly and Sega’s Total War: THREE KINGDOMS, a turn-based, empire-building strategy game and the 13th entry in the award-winning Total War franchise. Become one of many great leaders from history and conquer enemies to build a formidable empire.

Total War Three Kingdoms
Resolve an epic conflict in ancient China to unify the country and rebuild the empire.

The game is set in ancient China, and gamers must save the country from the oppressive rule of a warlord. Choose from a cast of a dozen legendary heroic characters to unify the nation and dominate enemies. Each has their own agenda, and there are plenty of different tactics for players to employ.

Extend your campaign with up to six-hour gaming sessions at 1080p 60 frames per second for Priority members. With an RTX 3080 membership, gain support for 1440p 120 FPS streaming and up to 8-hour sessions, with performance that will bring foes to their knees.

Sega Row on GeForce NOW
Members can find ‘Total War: THREE KINGDOMSand other Sega games from a dedicated row in the GeForce NOW app.

More to Explore

Alongside the 11 new games streaming this week, members can jump into updates for the hottest free-to-play titles on GeForce NOW.

Genshin Impact Version 3.2, “Akasha Pulses, the Kalpa Flame Rises,” is available to stream on the cloud. This latest update introduces the last chapter of the Sumeru Archon Quest, two new playable characters — Nahida and Layla — as well as new events and game play. Stream it now from devices, whether PC, Mac, Chromebook or on mobile with enhanced touch controls.

Genshin Impact 3.2 on GeForce NOW
Catch the conclusion of the main storyline for Sumeru, the newest region added to ‘Genshin Impact.’

Or squad up in Apex Legends: Eclipse, available to stream now on the cloud. Season 15 brings with it the new Broken Moon map, the newest defensive Legend — Catalyst — and much more.

Apex Legends on GeForce NOW
Don’t mess-a with Tressa.

Also, after working closely with Square Enix, we’re happy to share that members can stream STAR OCEAN THE DIVINE FORCE on GeForce NOW beginning this week.

Here’s the full list of games joining this week:

  • Against the Storm (Epic Games and New release on Steam)
  • Horse Tales: Emerald Valley Ranch (New release on Steam, Nov. 3)
  • Space Tail: Every Journey Leads Home (New release on Steam, Nov. 3)
  • The Chant (New release on Steam, Nov. 3)
  • The Entropy Centre (New release on Steam, Nov. 3)
  • WRC Generations — The FIA WRC Official Game (New Release on Steam, Nov. 3)
  • Filament (Free on Epic Games, Nov. 3-10)
  • STAR OCEAN THE DIVINE FORCE (Steam)
  • PAGUI (Steam)
  • RISK: Global Domination (Steam)
  • Total War: THREE KINGDOMS (Steam)

Arriving in November

But wait, there’s more! Among the total 26 games joining GeForce NOW in November is the highly anticipated Warhammer 40,000: Darktide, with support for NVIDIA RTX and DLSS.

Here’s a sneak peak:

  • The Unliving (New release on Steam, Nov. 7)
  • TERRACOTTA (New release on Steam and Epic Games, Nov. 7)
  • A Little to the Left (New Release on Steam, Nov. 8)
  • Yum Yum Cookstar (New Release on Steam, Nov. 11)
  • Nobody — The Turnaround (New release on Steam, Nov. 17)
  • Goat Simulator 3 (New release on Epic Games, Nov. 17)
  • Evil West (New release on Steam, Nov. 22)
  • Colortone: Remixed (New Release on Steam, Nov. 30)
  • Warhammer 40,000: Darktide (New Release on Steam, Nov. 30)
  • Heads Will Roll: Downfall (Steam)
  • Guns Gore and Cannoli 2 (Steam)
  • Hidden Through TIme (Steam)
  • Cave Blazers (Steam)
  • Railgrade (Epic Games)
  • The Legend of Tianding (Steam)

While The Unliving was originally announced in October, the release date of the game shifted to Monday, Nov. 7.

Howlin’ for More

October brought more treats for members. Don’t miss the 14 extra titles added last month. 

With all of these sweet new titles coming to the cloud, getting your game on is as easy as pie. Speaking of pie, we’ve got a question for you. Let us know your answer on Twitter or in the comments below.

The post Check Out 26 New Games Streaming on GeForce NOW in November appeared first on NVIDIA Blog.

Read More

Extending TorchVision’s Transforms to Object Detection, Segmentation & Video tasks

TorchVision is extending its Transforms API! Here is what’s new:

  • You can use them not only for Image Classification but also for Object Detection, Instance & Semantic Segmentation and Video Classification.
  • You can import directly from TorchVision several SoTA data-augmentations such as MixUp, CutMix, Large Scale Jitter and SimpleCopyPaste.
  • You can use new functional transforms for transforming Videos, Bounding Boxes and Segmentation Masks.

The interface remains the same to assist the migration and adoption. The new API is currently in Prototype and we would love to get early feedback from you to improve its functionality. Please reach out to us if you have any questions or suggestions.

Limitations of current Transforms

The stable Transforms API of TorchVision (aka V1) only supports single images. As a result it can only be used for classification tasks:

from torchvision import transforms
trans = transforms.Compose([
   transforms.ColorJitter(contrast=0.5),
   transforms.RandomRotation(30),
   transforms.CenterCrop(480),
])
imgs = trans(imgs)

The above approach doesn’t support Object Detection, Segmentation or Classification transforms that require the use of Labels (such as MixUp & CutMix). This limitation made any non-classification Computer Vision tasks second-class citizens as one couldn’t use the Transforms API to perform the necessary augmentations. Historically this made it difficult to train high-accuracy models using TorchVision’s primitives and thus our Model Zoo lagged by several points from SoTA.

To circumvent this limitation, TorchVision offered custom implementations in its reference scripts that show-cased how one could perform augmentations in each task. Though this practice enabled us to train high accuracy classification, object detection & segmentation models, it was a hacky approach which made those transforms impossible to import from the TorchVision binary.

The new Transforms API

The Transforms V2 API supports videos, bounding boxes, labels and segmentation masks meaning that it offers native support for many Computer Vision tasks. The new solution is a drop-in replacement:

from torchvision.prototype import transforms
# Exactly the same interface as V1:
trans = transforms.Compose([
    transforms.ColorJitter(contrast=0.5),
    transforms.RandomRotation(30),
    transforms.CenterCrop(480),
])
imgs, bboxes, labels = trans(imgs, bboxes, labels)

The new Transform Classes can receive any arbitrary number of inputs without enforcing specific order or structure:

# Already supported:
trans(imgs)  # Image Classification
trans(videos)  # Video Tasks
trans(imgs_or_videos, labels)  # MixUp/CutMix-style Transforms
trans(imgs, bboxes, labels)  # Object Detection
trans(imgs, bboxes, masks, labels)  # Instance Segmentation
trans(imgs, masks)  # Semantic Segmentation
trans({"image": imgs, "box": bboxes, "tag": labels})  # Arbitrary Structure
# Future support:
trans(imgs, bboxes, labels, keypoints)  # Keypoint Detection
trans(stereo_images, disparities, masks)  # Depth Perception
trans(image1, image2, optical_flows, masks)  # Optical Flow

The Transform Classes make sure that they apply the same random transforms to all the inputs to ensure consistent results.

The functional API has been updated to support all necessary signal processing kernels (resizing, cropping, affine transforms, padding etc) for all inputs:

from torchvision.prototype.transforms import functional as F
# High-level dispatcher, accepts any supported input type, fully BC
F.resize(inpt, resize=[224, 224])
# Image tensor kernel
F.resize_image_tensor(img_tensor, resize=[224, 224], antialias=True)
# PIL image kernel
F.resize_image_pil(img_pil, resize=[224, 224], interpolation=BILINEAR)
# Video kernel
F.resize_video(video, resize=[224, 224], antialias=True)
# Mask kernel
F.resize_mask(mask, resize=[224, 224])
# Bounding box kernel
F.resize_bounding_box(bbox, resize=[224, 224], spatial_size=[256, 256])

The API uses Tensor subclassing to wrap input, attach useful meta-data and dispatch to the right kernel. Once the Datasets V2 work is complete, which makes use of TorchData’s Data Pipes, the manual wrapping of input won’t be necessary. For now, users can manually wrap the input by:

from torchvision.prototype import features
imgs = features.Image(images, color_space=ColorSpace.RGB)
vids = features.Video(videos, color_space=ColorSpace.RGB)
masks = features.Mask(target["masks"])
bboxes = features.BoundingBox(target["boxes"], format=BoundingBoxFormat.XYXY, spatial_size=imgs.spatial_size)
labels = features.Label(target["labels"], categories=["dog", "cat"])

In addition to the new API, we now provide importable implementations for several data augmentations that are used in SoTA research such as MixUp, CutMix, Large Scale Jitter, SimpleCopyPaste, AutoAugmentation methods and several new Geometric, Colour and Type Conversion transforms.

The API continues to support both PIL and Tensor backends for Images, single or batched input and maintains JIT-scriptability on the functional API. It allows deferring the casting of images from uint8 to float which can lead to performance benefits. It is currently available in the prototype area of TorchVision and can be imported from the nightly builds. The new API has been verified to achieve the same accuracy as the previous implementation.

Current Limitations

Though the functional API (kernels) remain JIT-scriptable and fully-BC, the Transform Classes, though they offer the same interface, can’t be scripted. This is because they use Tensor Subclassing and receive arbitrary number of inputs which are not supported by JIT. We are currently working to reduce the dispatching overhead of the new API and to improve the speed of existing kernels.

An end-to-end example

Here is an example of the new API using the following image. It works both with PIL images and Tensors:

import PIL
from torchvision import io, utils
from torchvision.prototype import features, transforms as T
from torchvision.prototype.transforms import functional as F
# Defining and wrapping input to appropriate Tensor Subclasses
path = "COCO_val2014_000000418825.jpg"
img = features.Image(io.read_image(path), color_space=features.ColorSpace.RGB)
# img = PIL.Image.open(path)
bboxes = features.BoundingBox(
    [[2, 0, 206, 253], [396, 92, 479, 241], [328, 253, 417, 332],
     [148, 68, 256, 182], [93, 158, 170, 260], [432, 0, 438, 26],
     [422, 0, 480, 25], [419, 39, 424, 52], [448, 37, 456, 62],
     [435, 43, 437, 50], [461, 36, 469, 63], [461, 75, 469, 94],
     [469, 36, 480, 64], [440, 37, 446, 56], [398, 233, 480, 304],
     [452, 39, 463, 63], [424, 38, 429, 50]],
    format=features.BoundingBoxFormat.XYXY,
    spatial_size=F.get_spatial_size(img),
)
labels = features.Label([59, 58, 50, 64, 76, 74, 74, 74, 74, 74, 74, 74, 74, 74, 50, 74, 74])
# Defining and applying Transforms V2
trans = T.Compose(
    [
        T.ColorJitter(contrast=0.5),
        T.RandomRotation(30),
        T.CenterCrop(480),
    ]
)
img, bboxes, labels = trans(img, bboxes, labels)
# Visualizing results
viz = utils.draw_bounding_boxes(F.to_image_tensor(img), boxes=bboxes)
F.to_pil_image(viz).show()

Development milestones and future work

Here is where we are in development:

  • Design API
  • Write Kernels for transforming Videos, Bounding Boxes, Masks and Labels
  • Rewrite all existing Transform Classes (stable + references) on the new API:
    • Image Classification
    • Video Classification
    • Object Detection
    • Instance Segmentation
    • Semantic Segmentation
  • Verify the accuracy of the new API for all supported Tasks and Backends
  • Speed Benchmarks and Performance Optimizations (in progress – planned for Dec)
  • Graduate from Prototype (planned for Q1)
  • Add support of Depth Perception, Keypoint Detection, Optical Flow and more (future)

We are currently in the process of Benchmarking each Transform Class and Functional Kernel in order to measure and improve their performance. The scope includes optimizing existing kernels which will be adopted from V1. Early findings indicate that some improvements might need to be upstreamed on the C++ kernels of PyTorch Core. Our plan is to continue iterating throughout Q4 to improve the speed performance of the new API and enhance it with additional SoTA transforms with the help of the community.

We would love to get early feedback from you to improve its functionality. Please reach out to us if you have any questions or suggestions.

Read More

In machine learning, synthetic data can offer real performance improvements

In machine learning, synthetic data can offer real performance improvements

Teaching a machine to recognize human actions has many potential applications, such as automatically detecting workers who fall at a construction site or enabling a smart home robot to interpret a user’s gestures.

To do this, researchers train machine-learning models using vast datasets of video clips that show humans performing actions. However, not only is it expensive and laborious to gather and label millions or billions of videos, but the clips often contain sensitive information, like people’s faces or license plate numbers. Using these videos might also violate copyright or data protection laws. And this assumes the video data are publicly available in the first place — many datasets are owned by companies and aren’t free to use.

So, researchers are turning to synthetic datasets. These are made by a computer that uses 3D models of scenes, objects, and humans to quickly produce many varying clips of specific actions — without the potential copyright issues or ethical concerns that come with real data.

But are synthetic data as “good” as real data? How well does a model trained with these data perform when it’s asked to classify real human actions? A team of researchers at MIT, the MIT-IBM Watson AI Lab, and Boston University sought to answer this question. They built a synthetic dataset of 150,000 video clips that captured a wide range of human actions, which they used to train machine-learning models. Then they showed these models six datasets of real-world videos to see how well they could learn to recognize actions in those clips.

The researchers found that the synthetically trained models performed even better than models trained on real data for videos that have fewer background objects.

This work could help researchers use synthetic datasets in such a way that models achieve higher accuracy on real-world tasks. It could also help scientists identify which machine-learning applications could be best-suited for training with synthetic data, in an effort to mitigate some of the ethical, privacy, and copyright concerns of using real datasets.

“The ultimate goal of our research is to replace real data pretraining with synthetic data pretraining. There is a cost in creating an action in synthetic data, but once that is done, then you can generate an unlimited number of images or videos by changing the pose, the lighting, etc. That is the beauty of synthetic data,” says Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab, and co-author of a paper detailing this research.

The paper is authored by lead author Yo-whan “John” Kim ’22; Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and seven others. The research will be presented at the Conference on Neural Information Processing Systems.   

Building a synthetic dataset

The researchers began by compiling a new dataset using three publicly available datasets of synthetic video clips that captured human actions. Their dataset, called Synthetic Action Pre-training and Transfer (SynAPT), contained 150 action categories, with 1,000 video clips per category.

They selected as many action categories as possible, such as people waving or falling on the floor, depending on the availability of clips that contained clean video data.

Once the dataset was prepared, they used it to pretrain three machine-learning models to recognize the actions. Pretraining involves training a model for one task to give it a head-start for learning other tasks. Inspired by the way people learn — we reuse old knowledge when we learn something new — the pretrained model can use the parameters it has already learned to help it learn a new task with a new dataset faster and more effectively.

They tested the pretrained models using six datasets of real video clips, each capturing classes of actions that were different from those in the training data.

The researchers were surprised to see that all three synthetic models outperformed models trained with real video clips on four of the six datasets. Their accuracy was highest for datasets that contained video clips with “low scene-object bias.”

Low scene-object bias means that the model cannot recognize the action by looking at the background or other objects in the scene — it must focus on the action itself. For example, if the model is tasked with classifying diving poses in video clips of people diving into a swimming pool, it cannot identify a pose by looking at the water or the tiles on the wall. It must focus on the person’s motion and position to classify the action.

“In videos with low scene-object bias, the temporal dynamics of the actions is more important than the appearance of the objects or the background, and that seems to be well-captured with synthetic data,” Feris says.

“High scene-object bias can actually act as an obstacle. The model might misclassify an action by looking at an object, not the action itself. It can confuse the model,” Kim explains.

Boosting performance

Building off these results, the researchers want to include more action classes and additional synthetic video platforms in future work, eventually creating a catalog of models that have been pretrained using synthetic data, says co-author Rameswar Panda, a research staff member at the MIT-IBM Watson AI Lab.

“We want to build models which have very similar performance or even better performance than the existing models in the literature, but without being bound by any of those biases or security concerns,” he adds.

They also want to combine their work with research that seeks to generate more accurate and realistic synthetic videos, which could boost the performance of the models, says SouYoung Jin, a co-author and CSAIL postdoc. She is also interested in exploring how models might learn differently when they are trained with synthetic data.

“We use synthetic datasets to prevent privacy issues or contextual or social bias, but what does the model actually learn? Does it learn something that is unbiased?” she says.

Now that they have demonstrated this use potential for synthetic videos, they hope other researchers will build upon their work.

“Despite there being a lower cost to obtaining well-annotated synthetic data, currently we do not have a dataset with the scale to rival the biggest annotated datasets with real videos. By discussing the different costs and concerns with real videos, and showing the efficacy of synthetic data, we hope to motivate efforts in this direction,” adds co-author Samarth Mishra, a graduate student at Boston University (BU).

Additional co-authors include Hilde Kuehne, professor of computer science at Goethe University in Germany and an affiliated professor at the MIT-IBM Watson AI Lab; Leonid Karlinsky, research staff member at the MIT-IBM Watson AI Lab; Venkatesh Saligrama, professor in the Department of Electrical and Computer Engineering at BU; and Kate Saenko, associate professor in the Department of Computer Science at BU and a consulting professor at the MIT-IBM Watson AI Lab.

This research was supported by the Defense Advanced Research Projects Agency LwLL, as well as the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside.

Read More

Improve data extraction and document processing with Amazon Textract

Improve data extraction and document processing with Amazon Textract

Intelligent document processing (IDP) has seen widespread adoption across enterprise and government organizations. Gartner estimates the IDP market will grow more than 100% year over year, and is projected to reach $4.8 billion in 2022.

IDP helps transform structured, semi-structured, and unstructured data from a variety of document formats into actionable information. Processing unstructured data has become much easier with the advancements in optical character recognition (OCR), machine learning (ML), and natural language processing (NLP).

IDP techniques have grown tremendously, allowing us to extract, classify, identify, and process unstructured data. With AI/ML powered services such as Amazon Textract, Amazon Transcribe, and Amazon Comprehend, building an IDP solution has become much easier and doesn’t require specialized AI/ML skills.

In this post, we demonstrate how to use Amazon Textract to extract meaningful, actionable data from a wide range of complex multi-format PDF files. PDF files are challenging; they can have a variety of data elements like headers, footers, tables with data in multiple columns, images, graphs, and sentences and paragraphs in different formats. We explore the data extraction phase of IDP, and how it connects to the steps involved in a document process, such as ingestion, extraction, and postprocessing.

Solution overview

Amazon Textract provides various options for data extraction, based on your use case. You can use forms, tables, query-based extractions, handwriting recognition, invoices and receipts, identity documents, and more. All the extracted data is returned with bounding box coordinates. This solution uses Amazon Textract IDP CDK constructs to build the document processing workflow that handles Amazon Textract asynchronous invocation, raw response extraction, and persistence in Amazon Simple Storage Service (Amazon S3). This solution adds an Amazon Textract postprocessing component to the base workflow to handle paragraph-based text extraction.

The following diagram shows the document processing flow.

The document processing flow contains the following steps:

  1. The document extraction flow is initiated when a user uploads a PDF document to Amazon S3.
  2. An S3 object notification event triggered by new the S3 object with an uploads/ prefix, which triggers the AWS Step Functions asynchronous workflow.
  3. The AWS Lambda function SimpleAsyncWorkflow Decider validates the PDF document. This step prevents processing invalid documents.
  4. TextractAsync is an IDP CDK construct that abstracts the invocation of the Amazon Textract Async API, handling Amazon Simple Notification Service (Amazon SNS) messages and workflow processing. The following are some high-level steps:
    1. The construct invokes the asynchronous Amazon Textract StartDocumentTextDetection API.
    2. Amazon Textract processes the PDF file and publishes a completion status event to an Amazon SNS topic.
    3. Amazon Textract stores the paginated results in Amazon S3.
    4. Construct handles the Amazon Textract completion event, returns the paginated results output prefix to the main workflow.
  5. The Textract Postprocessor Lambda function uses the extracted content in the results Amazon S3 bucket to retrieve the document data. This function iterates through all the files, and extracts data using bounding boxes and other metadata. It performs various postprocessing optimizations to aggregate paragraph data, identify and ignore headers and footers, combine sentences spread across pages, process data in multiple columns, and more.
  6. The Textract Postprocessor Lambda function persists the aggregated paragraph data as a CSV file in Amazon S3.

Deploy the solution with the AWS CDK

To deploy the solution, launch the AWS Cloud Development Kit (AWS CDK) using AWS Cloud9 or from your local system. If you’re launching from your local system, you need to have the AWS CDK and Docker installed. Follow the instructions in the GitHub repo for deployment.

The stack creates the key components depicted in the architecture diagram.

Test the solution

The GitHub repo contains the following sample files:

  • sample_climate_change.pdf – Contains headers, footers, and sentences flowing across pages
  • sample_multicolumn.pdf – Contains data in two columns, headers, footers, and sentences flowing across pages

To test the solution, complete the following steps:

  1. Upload the sample PDF files to the S3 bucket created by the stack: The file upload triggers the Step Functions workflow via S3 event notification.
    aws s3 cp sample_climate_change.pdf s3://{bucketname}/uploads/sample_climate_change.pdf
    
    aws s3 cp sample_ multicolumn.pdf s3://{bucketname}/uploads/ sample_climate_ multicolumn.pdf

  2.  Open the Step Functions console to view the workflow status. You should find one workflow instance per document.
  3. Wait for all three steps to complete.
  4. On the Amazon S3 console, browse to the S3 prefix mentioned in the JSON path TextractTempOutputJsonPath. The below screenshot of the Amazon S3 console shows the Amazon Textract paginated results (in this case objects 1 and 2) created by Amazon Textract. The postprocessing task stores the extracted paragraphs from the sample PDF as extracted-text.csv.
  5. Download the extracted-text.csv file to view the extracted content.

The sample_climate_change.pdf file has sentences flowing across pages, as shown in the following screenshot.

The postprocessor identifies and ignores the header and footer, and combines the text across pages into one paragraph. The extracted text for the combined paragraph should look like:

“Impacts on this scale could spill over national borders, exacerbating the damage further. Rising sea levels and other climate-driven changes could drive millions of people to migrate: more than a fifth of Bangladesh could be under water with a 1m rise in sea levels, which is a possibility by the end of the century. Climate-related shocks have sparked violent conflict in the past, and conflict is a serious risk in areas such as West Africa, the Nile Basin and Central Asia.”

The sample_multi_column.pdf file has two columns of text with headers and footers, as shown in the following screenshot.

The postprocessor identifies and ignores the header and footer, processes the text in the columns from left to right, and combines incomplete sentences across pages. The extracted text should construct paragraphs from text in the left column and separate paragraphs from text in the right column. The last line in the right column is incomplete on that page and continues in the left column of the next page; the postprocessor should combine them as one paragraph.

Cost

With Amazon Textract, you pay as you go based on the number of pages in the document. Refer to Amazon Textract pricing for actual costs.

Clean up

When you’re finished experimenting with this solution, clean up your resources by using the AWS CloudFormation console to delete all the resources deployed in this example. This helps you avoid continuing costs in your account.

Conclusion

You can use the solution presented in this post to build an efficient document extraction workflow and process the extracted document according to your needs. If you’re building an intelligent document processing system, you can further process the extracted document using Amazon Comprehend to get more insights about the document.

For more information about Amazon Textract, visit Amazon Textract resources to find video resources and blog posts, and refer to Amazon Textract FAQs. For more information about the IDP reference architecture, refer to Intelligent Document Processing. Please share your thoughts with us in the comments section, or in the issues section of the project’s GitHub repository.


About the Author

Sathya Balakrishnan is a Sr. Customer Delivery Architect in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.

Read More

3 ways AI is scaling helpful technologies worldwide

3 ways AI is scaling helpful technologies worldwide

I was first introduced to neural networks as an undergraduate in 1990. Back then, many people in the AI community were excited about the potential of neural networks, which were impressive, but couldn’t yet accomplish important, real-world tasks. I was excited, too! I did my senior thesis on using parallel computation to train neural networks, thinking we only needed 32X more compute power to get there. I was way off. At that time, we needed 1 million times as much computational power.

A short 21 years later, with exponentially more computational power, it was time to take another crack at neural networks. In 2011, I and a few others at Google started training very large neural networks using millions of randomly selected frames from videos online. The results were remarkable. Without explicit training, the system automatically learned to recognize different objects (especially cats, the Internet is full of cats). This was one transformational discovery in AI among a long string of successes that is still ongoing — at Google and elsewhere.

I share my own history of neural networks to illustrate that, while progress in AI might feel especially fast right now, it’s come from a long arc of progress. In fact, prior to 2012, computers had a really difficult time seeing, hearing, or understanding spoken or written language. Over the past 10 years we’ve made especially rapid progress in AI.

Today, we’re excited about many recent advances in AI that Google is leading — not just on the technical side, but in responsibly deploying it in ways that help people around the world. That means deploying AI in Google Cloud, in our products from Pixel phones to Google Search, and in many fields of science and other human endeavors.

We’re aware of the challenges and risks that AI poses as an emerging technology. We were the first major company to release and operationalize a set of AI Principles, and following them has actually (and some might think counterintuitively) allowed us to focus on making rapid progress on technologies that can be helpful to everyone. Getting AI right needs to be a collective effort — involving not just researchers, but domain experts, developers, community members, businesses, governments and citizens.

I’m happy to make announcements in three transformative areas of AI today: first, using AI to make technology accessible in many more languages. Second, exploring how AI might bolster creativity. And third, in AI for Social Good, including climate adaptation.

1. Supporting 1,000 languages with AI

Language is fundamental to how people communicate and make sense of the world. So it’s no surprise it’s also the most natural way people engage with technology. But more than 7,000 languages are spoken around the world, and only a few are well represented online today. That means traditional approaches to training language models on text from the web fail to capture the diversity of how we communicate globally. This has historically been an obstacle in the pursuit of our mission to make the world’s information universally accessible and useful.

That’s why today we’re announcing the 1,000 Languages Initiative, an ambitious commitment to build an AI model that will support the 1,000 most spoken languages, bringing greater inclusion to billions of people in marginalized communities all around the world. This will be a many years undertaking – some may even call it a moonshot – but we are already making meaningful strides here and see the path clearly. Technology has been changing at a rapid clip – from the way people use it to what it’s capable of. Increasingly, we see people finding and sharing information via new modalities like images, videos, and speech. And our most advanced language models are multimodal – meaning they’re capable of unlocking information across these many different formats. With these seismic shifts come new opportunities.

spinning globe with languages

As part of our this initiative and our focus on multimodality, we’ve developed a Universal Speech Model — or USM — that’s trained on over 400 languages, making it the largest language coverage seen in a speech model to date. As we expand on this work, we’re partnering with communities across the world to source representative speech data. We recently announced voice typing for 9 more African languages on Gboard by working closely with researchers and organizations in Africa to create and publish data. And in South Asia, we are actively working with local governments, NGOs, and academic institutions to eventually collect representative audio samples from across all the regions’ dialects and languages.

2. Empowering creators and artists with AI

AI-powered generative models have the potential to unlock creativity, helping people across cultures express themselves using video, imagery, and design in ways that they previously could not.

Our researchers have been hard at work developing models that lead the field in terms of quality, generating images that human raters prefer over other models. We recently shared important breakthroughs, applying our diffusion model to video sequences and generating long coherent videos for a sequence of text prompts. We can combine these techniques to produce video — for the first time, today we’re sharing AI-generated super-resolution video:

We’ll soon be bringing our text-to-image generation technologies to AI Test Kitchen, which provides a way for people to learn about, experience, and give feedback on emerging AI technology. We look forward to hearing feedback from users on these demos in AI Test Kitchen Season 2. You’ll be able to build themed cities with “City Dreamer” and design friendly monster characters that can move, dance, and jump with “Wobble” — all by using text prompts.

In addition to 2D images, text-to-3D is now a reality with DreamFusion, which produces a three-dimensional model that can be viewed from any angle and can be composited into any 3D environment. Researchers are also making significant progress in the audio generation space with AudioLM, a model that learns to generate realistic speech and piano music by listening to audio only. In the same way a language model might predict the words and sentences that follow a text prompt, AudioLM can predict which sounds should follow after a few seconds of an audio prompt.

We’re collaborating with creative communities globally as we develop these tools. For example, we’re working with writers using Wordcraft, which is built on our state-of-the-art dialog system LaMDA, to experiment with AI-powered text generation. You can read the first volume of these stories at the Wordcraft Writers Workshop.

3. Addressing climate change and health challenges with AI

AI also has great potential to address the effects of climate change, including helping people adapt to new challenges. One of the worst is wildfires, which affect hundreds of thousands of people today, and are increasing in frequency and scale.

Today, I’m excited to share that we’ve advanced our use of satellite imagery to train AI models to identify and track wildfires in real time, helping predict how they will evolve and spread. We’ve launched this wildfire tracking system in the U.S., Canada, Mexico, and are rolling out in parts of Australia, and since July we’ve covered more than 30 big wildfire events in the U.S. and Canada, helping inform our users and firefighting teams with over 7 million views in Google Search and Maps.

wildfire alert on phone

We’re also using AI to forecast floods, another extreme weather pattern exacerbated by climate change. We’ve already helped communities to predict when floods will hit and how deep the waters will get — in 2021, we sent 115 million flood alert notifications to 23 million people over Google Search and Maps, helping save countless lives. Today, we’re sharing that we’re now expanding our coverage to more countries in South America (Brazil and Colombia), Sub-Saharan Africa (Burkina Faso, Cameroon, Chad, Democratic Republic of Congo, Ivory Coast, Ghana, Guinea, Malawi, Nigeria, Sierra Leone, Angola, South Sudan, Namibia, Liberia, and South Africa), and South Asia (Sri Lanka). We’ve used an AI technique called transfer learning to make it work in areas where there’s less data available. We’re also announcing the global launch of Google FloodHub, a new platform that displays when and where floods may occur. We’ll also be bringing this information to Google Search and Maps in the future to help more people to reach safety in flooding situations.

flood alert on a phone

Finally, AI is helping provide ever more access to healthcare in under-resourced regions. For example, we’re researching ways AI can help read and analyze outputs from low-cost ultrasound devices, giving parents the information they need to identify issues earlier in a pregnancy. We also plan to continue to partner with caregivers and public health agencies to expand access to diabetic retinopathy screening through our Automated Retinal Disease Assessment tool (ARDA). Through ARDA, we’ve successfully screened more than 150,000 patients in countries like India, Thailand, Germany, the United States, and the United Kingdom across deployed use and prospective studies — more than half of those in 2022 alone. Further, we’re exploring how AI can help your phone detect respiratory and heart rates. This work is part of Google Health’s broader vision, which includes making healthcare more accessible for anyone with a smartphone.

AI in the years ahead

Our advancements in neural network architectures, machine learning algorithms and new approaches to hardware for machine learning have helped AI solve important, real-world problems for billions of people. Much more is to come. What we’re sharing today is a hopeful vision for the future — AI is letting us reimagine how technology can be helpful. We hope you’ll join us as we explore these new capabilities and use this technology to improve people’s lives around the world.

Read More

Robots That Write Their Own Code

Robots That Write Their Own Code

<!––><!––>

A common approach used to control robots is to program them with code to detect objects, sequencing commands to move actuators, and feedback loops to specify how the robot should perform a task. While these programs can be expressive, re-programming policies for each new task can be time consuming, and requires domain expertise.

What if when given instructions from people, robots could autonomously write their own code to interact with the world? It turns out that the latest generation of language models, such as PaLM, are capable of complex reasoning and have also been trained on millions of lines of code. Given natural language instructions, current language models are highly proficient at writing not only generic code but, as we’ve discovered, code that can control robot actions as well. When provided with several example instructions (formatted as comments) paired with corresponding code (via in-context learning), language models can take in new instructions and autonomously generate new code that re-composes API calls, synthesizes new functions, and expresses feedback loops to assemble new behaviors at runtime. More broadly, this suggests an alternative approach to using machine learning for robots that (i) pursues generalization through modularity and (ii) leverages the abundance of open-source code and data available on the Internet.

Given code for an example task (left), language models can re-compose API calls to assemble new robot behaviors for new tasks (right) that use the same functions but in different ways.

To explore this possibility, we developed Code as Policies (CaP), a robot-centric formulation of language model-generated programs executed on physical systems. CaP extends our prior work, PaLM-SayCan, by enabling language models to complete even more complex robotic tasks with the full expression of general-purpose Python code. With CaP, we propose using language models to directly write robot code through few-shot prompting. Our experiments demonstrate that outputting code led to improved generalization and task performance over directly learning robot tasks and outputting natural language actions. CaP allows a single system to perform a variety of complex and varied robotic tasks without task-specific training.

We demonstrate, across several robot systems, including a robot from Everyday Robots, that language models can autonomously interpret language instructions to generate and execute CaPs that represent reactive low-level policies (e.g., proportional-derivative or impedance controllers) and waypoint-based policies (e.g., vision-based pick and place, trajectory-based control).

A Different Way to Think about Robot Generalization

To generate code for a new task given natural language instructions, CaP uses a code-writing language model that, when prompted with hints (i.e., import statements that inform which APIs are available) and examples (instruction-to-code pairs that present few-shot “demonstrations” of how instructions should be converted into code), writes new code for new instructions. Central to this approach is hierarchical code generation, which prompts language models to recursively define new functions, accumulate their own libraries over time, and self-architect a dynamic codebase. Hierarchical code generation improves state-of-the-art on both robotics as well as standard code-gen benchmarks in natural language processing (NLP) subfields, with 39.8% pass@1 on HumanEval, a benchmark of hand-written coding problems used to measure the functional correctness of synthesized programs.

Code-writing language models can express a variety of arithmetic operations and feedback loops grounded in language. Pythonic language model programs can use classic logic structures, e.g., sequences, selection (if/else), and loops (for/while), to assemble new behaviors at runtime. They can also use third-party libraries to interpolate points (NumPy), analyze and generate shapes (Shapely) for spatial-geometric reasoning, etc. These models not only generalize to new instructions, but they can also translate precise values (e.g., velocities) to ambiguous descriptions (“faster” and “to the left”) depending on the context to elicit behavioral commonsense.

Code as Policies uses code-writing language models to map natural language instructions to robot code to complete tasks. Generated code can call existing perception action APIs, third party libraries, or write new functions at runtime.

CaP generalizes at a specific layer in the robot: interpreting natural language instructions, processing perception outputs (e.g., from off-the-shelf object detectors), and then parameterizing control primitives. This fits into systems with factorized perception and control, and imparts a degree of generalization (acquired from pre-trained language models) without the magnitude of data collection needed for end-to-end robot learning. CaP also inherits language model capabilities that are unrelated to code writing, such as supporting instructions with non-English languages and emojis.

CaP inherits the capabilities of language models, such as multilingual and emoji support.

By characterizing the types of generalization encountered in code generation problems, we can also study how hierarchical code generation improves generalization. For example, “systematicity” evaluates the ability to recombine known parts to form new sequences, “substitutivity” evaluates robustness to synonymous code snippets, while “productivity” evaluates the ability to write policy code longer than those seen in the examples (e.g., for new long horizon tasks that may require defining and nesting new functions). Our paper presents a new open-source benchmark to evaluate language models on a set of robotics-related code generation problems. Using this benchmark, we find that, in general, bigger models perform better across most metrics, and that hierarchical code generation improves “productivity” generalization the most.

Performance on our RoboCodeGen Benchmark across different generalization types. The larger model (Davinci) performs better than the smaller model (Cushman), with hierarchical code generation improving productivity the most.

We’re also excited about the potential for code-writing models to express cross-embodied plans for robots with different morphologies that perform the same task differently depending on the available APIs (perception action spaces), which is an important aspect of any robotics foundation model.

Language model code-generation exhibits cross-embodiment capabilities, completing the same task in different ways depending on the available APIs (that define perception action spaces).

Limitations

Code as policies today are restricted by the scope of (i) what the perception APIs can describe (e.g., few visual-language models to date can describe whether a trajectory is “bumpy” or “more C-shaped”), and (ii) which control primitives are available. Only a handful of named primitive parameters can be adjusted without over-saturating the prompts. Our approach also assumes all given instructions are feasible, and we cannot tell if generated code will be useful a priori. CaPs also struggle to interpret instructions that are significantly more complex or operate at a different abstraction level than the few-shot examples provided to the language model prompts. Thus, for example, in the tabletop domain, it would be difficult for our specific instantiation of CaPs to “build a house with the blocks” since there are no examples of building complex 3D structures. These limitations point to avenues for future work, including extending visual language models to describe low-level robot behaviors (e.g., trajectories) or combining CaPs with exploration algorithms that can autonomously add to the set of control primitives.

Open-Source Release

We have released the code needed to reproduce our experiments and an interactive simulated robot demo on the project website, which also contains additional real-world demos with videos and generated code.

Conclusion

Code as policies is a step towards robots that can modify their behaviors and expand their capabilities accordingly. This can be enabling, but the flexibility also raises potential risks since synthesized programs (unless manually checked per runtime) may result in unintended behaviors with physical hardware. We can mitigate these risks with built-in safety checks that bound the control primitives that the system can access, but more work is needed to ensure new combinations of known primitives are equally safe. We welcome broad discussion on how to minimize these risks while maximizing the potential positive impacts towards more general-purpose robots.

Acknowledgements

This research was done by Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng. Special thanks to Vikas Sindhwani, Vincent Vanhoucke for helpful feedback on writing, Chad Boodoo for operations and hardware support. An early preprint is available on arXiv.

Read More