Reduce costs and complexity of ML preprocessing with Amazon S3 Object Lambda

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Often, customers have objects in S3 buckets that need further processing to be used effectively by consuming applications. Data engineers must support these application-specific data views with trade-offs between persisting derived copies or transforming data at the consumer level. Neither solution is ideal because it introduces operational complexity, causes data consistency challenges, and wastes more expensive computing resources.

These trade-offs broadly apply to many machine learning (ML) pipelines that train on unstructured data, such as audio, video, and free-form text, among other sources. In each example, the training job must download data from S3 buckets, prepare an application-specific view, and then use an AI algorithm. This post demonstrates a design pattern for reducing costs, complexity, and centrally managing this second step. It uses the concrete example of image processing, though the approach broadly applies to any workload. The economic benefits are also most pronounced when the transformation step doesn’t require GPU, but the AI algorithm does.

The proposed solution also centralizes data transformation code and enables just-in-time (JIT) transformation. Furthermore, the approach uses a serverless infrastructure to reduce operational overhead and undifferentiated heavy lifting.

Solution overview

When ML algorithms process unstructured data like images and video, it requires various normalization tasks (such as grey-scaling and resizing). This step exists to accelerate model convergence, avoid overfitting, and improve prediction accuracy. You often perform these preprocessing steps on instances that later run the AI training. That approach creates inefficiencies, because those resources typically have more expensive processors (for example, GPUs) than these tasks require. Instead, our solution externalizes those operations across economic, horizontally scalable Amazon S3 Object Lambda functions.

This design pattern has three critical benefits. First, it centralizes the shared data transformation steps, such as image normalization and removing ML pipeline code duplication. Next, S3 Object Lambda functions avoid data consistency issues in derived data through JIT conversions. Third, the serverless infrastructure reduces operational overhead, increases access time, and limits costs to the per-millisecond time running your code.

An elegant solution exists in which you can centralize these data preprocessing and data conversion operations with S3 Object Lambda. S3 Object Lambda enables you to add code that modifies data from Amazon S3 before returning it to an application. The code runs within an AWS Lambda function, a serverless compute service. Lambda can instantly scale to tens of thousands of parallel runs while supporting dozens of programming languages and even custom containers. For more information, see Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3.

The following diagram illustrates the solution architecture.

Normalize datasets used to train machine learning model

In this solution, you have an S3 bucket that contains the raw images to be processed. Next, you create an S3 Access Point for these images. If you build multiple ML models, you can create separate S3 Access Points for each model. Alternatively, AWS Identity and Access Management (IAM) policies for access points support sharing reusable functions across ML pipelines. Then you attach a Lambda function that has your preprocessing business logic to the S3 Access Point. After you retrieve the data, you call the S3 Access Point to perform JIT data transformations. Finally, you update your ML model to use the new S3 Object Lambda Access Point to retrieve data from Amazon S3.

Create the normalization access point

This section walks through the steps to create the S3 Object Lambda access point.

Raw data is stored in an S3 bucket. To provide the user with the right set of permissions to access this data, while avoiding complex bucket policies that can cause unexpected impact to another application, you need to create S3 Access Points. S3 Access Points are unique host names that you can use to reach S3 buckets. With S3 Access Points, you can create individual access control policies for each access point to control access to shared datasets easily and securely.

  1. Create your access point.
  2. Create a Lambda function that performs the image resizing and conversion. See the following Python code:
import boto3
import cv2
import numpy as np
import requests
import io

def lambda_handler(event, context):
    print(event)
    object_get_context = event["getObjectContext"]
    request_route = object_get_context["outputRoute"]
    request_token = object_get_context["outputToken"]
    s3_url = object_get_context["inputS3Url"]

    # Get object from S3
    response = requests.get(s3_url)
    nparr = np.fromstring(response.content, np.uint8)
    img = cv2.imdecode(nparr, flags=1)

    # Transform object
    new_shape=(256,256)
    resized = cv2.resize(img, new_shape, interpolation= cv2.INTER_AREA)
    gray_scaled = cv2.cvtColor(resized,cv2.COLOR_BGR2GRAY)

    # Transform object
    is_success, buffer = cv2.imencode(".jpg", gray_scaled)
    if not is_success:
       raise ValueError('Unable to imencode()')

    transformed_object = io.BytesIO(buffer).getvalue()

    # Write object back to S3 Object Lambda
    s3 = boto3.client('s3')
    s3.write_get_object_response(
        Body=transformed_object,
        RequestRoute=request_route,
        RequestToken=request_token)

    return {'status_code': 200}
  1. Create an Object Lambda access point using the supporting access point from Step 1.

The Lambda function uses the supporting access point to download the original objects.

  1. Update Amazon SageMaker to use the new S3 Object Lambda access point to retrieve data from Amazon S3. See the following bash code:
aws s3api get-object --bucket arn:aws:s3-object-lambda:us-west-2:12345678901:accesspoint/image-normalizer --key images/test.png

Cost savings analysis

Traditionally, ML pipelines copy images and other files from Amazon S3 to SageMaker instances and then perform normalization. However, transforming these actions on training instances has inefficiencies. First, Lambda functions horizontally scale to handle the burst then elastically shrink, only charging per millisecond when the code is running. Many preprocessing steps don’t require GPUs and can even use ARM64. That creates an incentive to move that processing to more economical compute such as Lambda functions powered by AWS Graviton2 processors.

Using an example from the Lambda pricing calculator, you can configure the function with 256 MB of memory and compare the costs for both x86 and Graviton (ARM64). We chose this size because it’s sufficient for many single-image data preparation tasks. Next, use the SageMaker pricing calculator to compute expenses for an ml.p2.xlarge instance. This is the smallest supported SageMaker training instance with GPU support. These results show up to 90% compute savings for operations that don’t use GPUs and can shift to Lambda. The following table summarizes these findings.

 

Lambda with x86 Lambda with Graviton2 (ARM) SageMaker ml.p2.xlarge
Memory (GB) 0.25 0.25 61
CPU  —  — 4
GPU  —  — 1
Cost/hour $0.061 $0.049 $0.90

Conclusion

You can build modern applications to unlock insights into your data. These different applications have unique data view requirements, such as formatting and preprocessing actions. Addressing these other use cases can result in data duplication, increasing costs, and more complexity to maintain consistency. This post offers a solution for efficiently handling these situations using S3 Object Lambda functions.

Not only does this remove the need for duplication, but it also forms a path to scale these actions across less expensive compute horizontally! Even optimizing the transformation code for the ml.p2.xlarge instance would still be significantly more costly because of the idle GPUs.

For more ideas on using serverless and ML, see Machine learning inference at scale using AWS serverless and Deploying machine learning models with serverless templates.


About the Authors

Nate Bachmeier is an AWS Senior Solutions Architect nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers’ workloads. Besides this, Nate is a full-time student and has two kids.

Marvin Fernandes is a Solutions Architect at AWS, based in the New York City area. He has over 20 years of experience building and running financial services applications. He is currently working with large enterprise customers to solve complex business problems by crafting scalable, flexible, and resilient cloud architectures.

Read More

Play PC Games on Your Phone With GeForce NOW This GFN Thursday

Who says you have to put your play on pause just because you’re not at your PC?

This GFN Thursday takes a look at how GeForce NOW makes PC gaming possible on Android and iOS mobile devices to support gamers on the go.

This week also comes with sweet in-game rewards for members playing Eternal Return, with two unique skins, including Military Doctor Cathy and an NVIDIA exclusive GeForce NOW Silvia.

Plus, there’s the newest content on the cloud with season 12 of Apex Legends, the ‘Joseph: Collapse’ Far Cry 6 DLC and 10 games joining the GeForce NOW library this week.

GTG: Good-to-Go to Game on the Go

Thanks to the power of the cloud, GeForce NOW transforms nearly any Android or iOS mobile device into a powerful gaming rig. With gamers looking for more ways to play from their phones, GeForce NOW has full or partial controller support when connected to a phone for the real PC versions of over 800 games, including 35 free-to-play titles.

Mobile gaming on GeForce NOW
Take your games with you when you stream from the cloud to mobile devices.

GeForce NOW levels up gameplay on phones by processing all of your gaming in the cloud and streaming it straight to your device.

Members playing on iOS can get all of the benefits of PC gaming without leaving the Apple ecosystem. They can enjoy top PC picks like Apex Legends or Rocket League for competitive fun or AAA titles like Far Cry 6 or Assassin’s Creed: Valhalla. They can also party up with friends in popular online multiplayer games like Destiny 2 and ARK: Survival Evolved or take on a classic arcade challenge with Overcooked! or Cuphead.

Gamers playing on Android devices can also play these titles from the GeForce NOW library and take their mobile gameplay even further with the perks of the GeForce NOW RTX 3080 membership. RTX 3080 memberships enable PC gaming at 120 frames per second on select 120Hz Android phones, such as the Samsung S21.

RTX 3080 members also stream with the maximized eight-hour session lengths on the service and can turn RTX ON for cinematic graphics and real-time ray tracing in supported games like Ghostrunner and Dying Light 2 Stay Human.

And, mobile gamers on AT&T’s network can take advantage of a special promotion. New or existing customers with a 5G device on a 5G unlimited plan — or another qualifying unlimited plan — can score a six-month GeForce NOW Priority membership at no charge (a $49.99 value). Priority members can stream at up to 60 FPS with priority access to GeForce NOW servers, extended session lengths and RTX ON in supported games. For more info on this special offer, read here.

Gamepads to Support Play

Enjoy gaming on the go and comfort during long-term play with no extra software needed using one of GeForce NOW’s recommended gamepads to improve your cloud gaming experience.

Destiny 2 Witch Queen Mobile on GeForce NOW
You’re in control. Play your way on mobile devices with these top picks from GeForce NOW.

Gamers playing on an iPhone can consider the Backbone One as well as the Razer Kishi. Android users can also enjoy the Razer Kishi, or the Razer Raiju and the Razer Junglecat.

Other great options for Android gamers include the Steelseries Stratus Duo and the GLAP controller to provide high-performance gameplay.

For more details and to pick out the best option for your GeForce NOW mobile experience, check out these recommended devices.

Eternal Rewards

GeForce NOW isn’t just about enabling great gaming experiences. It’s also about enriching those experiences for members playing on the cloud.

Eternal Return Rewards on GeForce NOW
Look your best as you battle against the rest in Eternal Return with these two skins.

To celebrate the second anniversary of GeForce NOW, GFN Thursdays in February are full of rewards.

Starting today, members can get rewarded in the free-to-play, multiplayer arena game Eternal Return. Redeem your rewards in-game to show off your style with a Military Doctor Cathy skin and represent your love for the cloud with a custom GeForce NOW Silvia skin.

Getting membership rewards for streaming games on the cloud is easy. Log in to your NVIDIA account and select “GEFORCE NOW” from the header, then scroll down to “REWARDS” and click the “UPDATE REWARDS SETTINGS” button. Check the box in the dialogue window that shows up and start receiving special offers and in-game goodies.

Sign up for the GeForce NOW newsletter, including notifications when rewards are available, by logging into your NVIDIA account and selecting “PREFERENCES” from the header. Check the “Gaming & Entertainment” box, and “GeForce NOW” under topic preferences to receive the latest updates.

For more updates on rewards coming in February, stay tuned to upcoming GFN Thursdays.

‘Fortnite’ Mobile Closed Beta Update

Over the past few weeks, multiple waves of invites have been sent to members to join the Fortnite limited time closed beta on iOS Safari and Android — and the feedback gathered thus far has been incredibly helpful.

Some gamers have come across occasional, unexpected in-game framerate dips below 30 FPS when using touch controls on iPhone, iPad and Android devices. It’s most commonly encountered when using the ADS/First Person shooting mode, but has also been observed while spectating, during rapid camera movements and other scenarios.

Throughout the beta we’ll continue to improve the experience — including working on a solution to improve frame rates — and will have more information when an update is available.

In the meantime, we’re looking forward to inviting more gamers into the Fortnite mobile closed beta on GeForce NOW and will continue to optimize and improve the experience.

More Gaming Goodness, Anywhere

This week also brings new content to the cloud, expanding two titles.

Rise to the challenge in Apex Legends Defiance, the newest season of the popular free-to-play battle royale, hero shooter. Season 12 comes with a new limited time mode and battle pass, rewards, map updates to Olympus and the newest Legend –  Mad Maggie.

Take things to the next level in Far Cry 6 with the release of the third DLC, ‘Joseph: Collapse,’ telling a new story about the villainous Joseph Seed.

Sifu on GeForce NOW
You know kung fu. Play as a young kung fu student on a path of revenge, hunting for the murderers of his family in Sifu.

Plus, GFN Thursday comes with a new batch of games joining the GeForce NOW library. Check out these 10 titles ready to stream this week:

We make every effort to launch games on GeForce NOW as close to their release as possible, but, in some instances, games may not be available immediately.

Finally, due to extended maintenance, Myth of Empires will be removed from the GeForce NOW library.

What device are you playing on this weekend? Let us know on Twitter and check out our setup while you’re at it.

The post Play PC Games on Your Phone With GeForce NOW This GFN Thursday appeared first on The Official NVIDIA Blog.

Read More

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient, and Interpretable Visual Understanding

In visual understanding, the Visual Transformer (ViT) and its variants have received significant attention recently due to their superior performance on many core visual applications, such as image classification, object detection, and video understanding. The core idea of ViT is to utilize the power of self-attention layers to learn global relationships between small patches of images. However, the number of connections between patches increases quadratically with image size. Such a design has been observed to be data inefficient — although the original ViT can perform better than convolutional networks with hundreds of millions of images for pre-training, such a data requirement is not always practical, and it still underperforms compared to convolutional networks when given less data. Many are exploring to find more suitable architectural re-designs that can learn visual representations effectively, such as by adding convolutional layers and building hierarchical structures with local self-attention.

The principle of hierarchical structure is one of the core ideas in vision models, where bottom layers learn more local object structures on the high-dimensional pixel space and top layers learn more abstracted and high-level knowledge at low-dimensional feature space. Existing ViT-based methods focus on designing a variety of modifications inside self-attention layers to achieve such a hierarchy, but while these offer promising performance improvements, they often require substantial architectural re-designs. Moreover, these approaches lack an interpretable design, so it is difficult to explain the inner-workings of trained models.

To address these challenges, in “Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding”, we present a rethinking of existing hierarchical structure–driven designs, and provide a novel and orthogonal approach to significantly simplify them. The central idea of this work is to decouple feature learning and feature abstraction (pooling) components: nested transformer layers encode visual knowledge of image patches separately, and then the processed information is aggregated. This process is repeated in a hierarchical manner, resulting in a pyramid network structure. The resulting architecture achieves competitive results on ImageNet and outperforms results on data-efficient benchmarks. We have shown such a design can meaningfully improve data efficiency with faster convergence and provide valuable interpretability benefits. Moreover, we introduce GradCAT, a new technique for interpreting the decision process of a trained model at inference time.

Architecture Design
The overall architecture is simple to implement by adding just a few lines of Python code to the source code of the original ViT. The original ViT architecture divides an input image into small patches, projects pixels of each patch to a vector with predefined dimension, and then feeds the sequences of all vectors to the overall ViT architecture containing multiple stacked identical transformer layers. While every layer in ViT processes information of the whole image, with this new method, stacked transformer layers are used to process only a region (i.e., block) of the image containing a few spatially adjacent image patches. This step is independent for each block and is also where feature learning occurs. Finally, a new computational layer called block aggregation then combines all of the spatially adjacent blocks. After each block aggregation, the features corresponding to four spatially adjacent blocks are fed to another module with a stack of transformer layers, which then process those four blocks jointly. This design naturally builds a pyramid hierarchical structure of the network, where bottom layers can focus on local features (such as textures) and upper layers focus on global features (such as object shape) at reduced dimensionality thanks to the block aggregation.

A visualization of the network processing an image: Given an input image, the network first partitions images into blocks, where each block contains 4 image patches. Image patches in every block are linearly projected as vectors and processed by a stack of identical transformer layers. Then the proposed block aggregation layer aggregates information from each block and reduces its spatial size by 4 times. The number of blocks is reduced to 1 at the top hierarchy and classification is conducted after the output of it.

Interpretability
This architecture has a non-overlapping information processing mechanism, independent at every node. This design resembles a decision tree-like structure, which manifests unique interpretability capabilities because every tree node contains independent information of an image block that is being received by its parent nodes. We can trace the information flow through the nodes to understand the importance of each feature. In addition, our hierarchical structure retains the spatial structure of images throughout the network, leading to learned spatial feature maps that are effective for interpretation. Below we showcase two kinds of visual interpretability.

First, we present a method to interpret the trained model on test images, called gradient-based class-aware tree-traversal (GradCAT). GradCAT traces the feature importance of each block (a tree node) from top to bottom of the hierarchy structure. The main idea is to find the most valuable traversal from the root node at the top layer to a child node at the bottom layer that contributes the most to the classification outcomes. Since each node processes information from a certain region of the image, such traversal can be easily mapped to the image space for interpretation (as shown by the overlaid dots and lines in the image below).

The following is an example of the model’s top-4 predictions and corresponding interpretability results on the left input image (containing 4 animals). As shown below, GradCAT highlights the decision path along the hierarchical structure as well as the corresponding visual cues in local image regions on the images.

Given the left input image (containing four objects), the figure visualizes the interpretability results of the top-4 prediction classes. The traversal locates the model decision path along the tree and simultaneously locates the corresponding image patch (shown by the dotted line on images) that has the highest impact to the predicted target class.

Moreover, the following figures visualize results on the ImageNet validation set and show how this approach enables some intuitive observations. For instance, the example of the lighter below (upper left panel) is particularly interesting because the ground truth class — lighter/matchstick — actually defines the bottom-right matchstick object, while the most salient visual features (with the highest node values) are actually from the upper-left red light, which conceptually shares visual cues with a lighter. This can also be seen from the overlaid red lines, which indicate the image patches with the highest impact on the prediction. Thus, although the visual cue is a mistake, the output prediction is correct. In addition, the four child nodes of the wooden spoon below have similar feature importance values (see numbers visualized in the nodes; higher indicates more importance), which is because the wooden texture of the table is similar to that of the spoon.

Visualization of the results obtained by the proposed GradCAT. Images are from the ImageNet validation dataset.

Second, different from the original ViT, our hierarchical architecture retains spatial relationships in learned representations. The top layers output low-resolution features maps of input images, enabling the model to easily perform attention-based interpretation by applying Class Attention Map (CAM) on the learned representations at the top hierarchical level. This enables high-quality weakly-supervised object localization with just image-level labels. See the following figure for examples.

Visualization of CAM-based attention results. Warmer colors indicate higher attention. Images are from the ImageNet validation dataset.

Convergence Advantages
With this design, feature learning only happens at local regions independently, and feature abstraction happens inside the aggregation function. This design and simple implementation is general enough for other types of visual understanding tasks beyond classification. It also improves the model convergence speed greatly, significantly reducing the training time to reach the desired maximum accuracy.

We validate this advantage in two ways. First, we compare the ViT structure on the ImageNet accuracy with a different number of total training epochs. The results are shown on the left side of the figure below, demonstrating much faster convergence than the original ViT, e.g., around 20% improvement in accuracy over ViT with 30 total training epochs.

Second, we modify the architecture to conduct unconditional image generation tasks, since training ViT-based models for image generation tasks is challenging due to convergence and speed issues. Creating such a generator is straightforward by transposing the proposed architecture: the input is an embedding vector, the output is a full image in RGB channels, and the block aggregation is replaced by a block de-aggregation component supported by Pixel Shuffling. Surprisingly, we find our generator is easy to train and demonstrates faster convergence speed, as well as better FID score (which measures how similar generated images are to real ones), than the capacity-comparable SAGAN.

Left: ImageNet accuracy given different number of total training epochs compared with standard ViT architecture. Right: ImageNet 64×64 image generation FID scores (lower is better) with single 1000-epoch training. On both tasks, our method shows better convergence speed.

Conclusion
In this work we demonstrate the simple idea that decoupled feature learning and feature information extraction in this nested hierarchy design leads to better feature interpretability through a new gradient-based class-aware tree traversal method. Moreover, the architecture improves convergence on not only classification tasks but also image generation tasks. The proposed idea is focusing on aggregation function and thereby is orthogonal to advanced architecture design for self-attention. We hope this new research encourages future architecture designers to explore more interpretable and data-efficient ViT-based models for visual understanding, like the adoption of this work for high-resolution image generation. We have also released the source code for the image classification portion of this work.

Acknowledgements
We gratefully acknowledge the contributions of other co-authors, including Han Zhang, Long Zhao, Ting Chen, Sercan Arik, Tomas Pfister. We also thank Xiaohua Zhai, Jeremy Kubica, Kihyuk Sohn, and Madeleine Udell for the valuable feedback of the work.

Read More

Extract entities from insurance documents using Amazon Comprehend named entity recognition

Intelligent document processing (IDP) is a common use case for customers on AWS. You can utilize Amazon Comprehend and Amazon Textract for a variety of use cases ranging from document extraction, data classification, and entity extraction. One specific industry that uses IDP is insurance. They use IDP to automate data extraction for common use cases such as claims intake, policy servicing, quoting, payments, and next best actions. However, in some cases, an office receives a document with complex, label-less information. This is normally difficult for optical character recognition (OCR) software to capture, and identifying relationships and key entities becomes a challenge. The solution is often requires manual human entry to ensure high accuracy.

In this post, we demonstrate how you can use named entity recognition (NER) for documents in their native formats in Amazon Comprehend to address these challenges.

Solution overview

In an insurance scenario, an insurer might receive a demand letter from an attorney’s office. The demand letter includes information such as what law office is sending the letter, who their client is, and what actions are required to satisfy their requests, as shown in the following example:

Because of the varied locations that this information could be found in a demand letter, these documents are often forwarded to an individual adjuster, who takes the time to read through the letter to determine all the necessary information required to proceed with a claim. The document may have multiple names, addresses, and requests that each need to be classified. If the client is mixed up with the beneficiary, or the addresses are switched, delays could add up and negative consequences could impact the company and customers. Because there are often small differences between categories like addresses and names, the documents are often processed by humans rather than using an IDP approach.

The preceding example document has many instances of overlapping entity values (entities that share similar properties but aren’t related). Examples of this are the address of the law office vs. the address of the insurance company or the names of the different individuals (attorney name, beneficiary, policy holder). Additionally, there is positional information (where the entity is positioned within the document) that a traditional text-only algorithm might miss. Therefore, traditional recognition techniques may not meet requirements.

In this post, we use named entity recognition in Amazon Comprehend to solve these challenges. The benefit of using this method is that the custom entity recognition model uses both the natural language and positional information of the text to accurately extract custom entities that may otherwise be impacted when flattening a document, as demonstrated in our preceding example of overlapping entity values. For this post, we use an AWS artificially created dataset of legal requisition and demand letters for life insurance, but you can use this approach across any industry and document that may benefit from spatial data in custom NER training. The following diagram depicts the solution architecture:

We implement the solution with the following high-level steps:

  1. Clone the repository containing the sample dataset.
  2. Create an Amazon Simple Storage Service (Amazon S3) bucket.
  3. Create and train your custom entity recognition model.
  4. Use the model by running an asynchronous batch job.

Prerequisites

You need to complete the following prerequisites to use this solution:

  1. Install Python 3.8.x.
  2. Make sure you have pip installed.
  3. Install and configure the AWS Command Line Interface (AWS CLI).
  4. Configure your AWS credentials.

Annotate your documents

To train a custom entity recognition model that can be used on your PDF, Word, and plain text documents, you need to first annotate PDF documents using a custom Amazon SageMaker Ground Truth annotation template that is provided by Amazon Comprehend. For instructions, see Custom document annotation for extracting named entities in documents using Amazon Comprehend.

We recommend a minimum of 250 documents and 100 annotations per entity to ensure good quality predictions. With more training data, you’re more likely to produce a higher-quality model.

When you’ve finished annotating, you can train a custom entity recognition model and use it to extract custom entities from PDF, Word, and plain text documents for batch (asynchronous) processing.

For this post, we have already labeled our sample dataset, and you don’t have to annotate the documents provided. However, if you want to use your own documents or adjust the entities, you have to annotate the documents. For instructions, see Custom document annotation for extracting named entities in documents using Amazon Comprehend.

We extract the following entities (which are case sensitive):

  • Law Firm
  • Law Office Address
  • Insurance Company
  • Insurance Company Address
  • Policy Holder Name
  • Beneficiary Name
  • Policy Number
  • Payout
  • Required Action
  • Sender

The dataset provided is entirely artificially generated. Any mention of names, places, and incidents are either products of the author’s imagination or are used fictitiously. Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.

Clone the repository

Start by cloning the repository by running the following command:

git clone https://github.com/aws-samples/aws-legal-entity-extraction

The repository contains the following files:

aws-legal-entity-extraction
	/source
	/annotations
	output.manifest
	sample.pdf
	bucketnamechange.py

Create an S3 bucket

To create an S3 bucket to use for this example, complete the following steps:

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Choose Create bucket.
  3. Note the name of the bucket you just created.

To reuse the annotations that we already made for the dataset, we have to modify the output.manifest file and reference the bucket we just created.

  1. Modify the file by running the following commands:
    cd aws-legal-entity-extraction
    python3 bucketnamechange.py
    Enter the name of your bucket: <Enter the name of the bucket you created>

When the script is finished running, you receive the following message:

The manifest file is updated with the correct bucket

We can now begin training our model.

Create and train the model

To start training your model, complete the following steps:

  1. On the Amazon S3 console, upload the /source folder, /annotations folder, output.manifest, and sample.pdf files.

Your bucket should look similar to the following screenshot.

  1. On the Amazon Comprehend console, under Customization in the navigation pane, choose Custom entity recognition.
  2. Choose Create new model.
  3. For Model name, enter a name.
  4. For Language, choose English.
  5. For Custom entity type, add the following case-sensitive entities:
    1. Law Firm
    2. Law Office Address
    3. Insurance Company
    4. Insurance Company Address
    5. Policy Holder Name
    6. Beneficiary Name
    7. Policy Number
    8. Payout
    9. Required Action
    10. Sender

  6. In Data specifications, for Data format, select Augmented manifest to reference the manifest we created when we annotated the documents.
  7. For Training model type, select PDF, Word Documents.

This specifies the type of documents you’re using for training and inference.

  1. For SageMaker Ground Truth augmented manifest file S3 location, enter the location of the output.manifest file in your S3 bucket.
  2. For S3 prefix for Annotation data files, enter the path to the annotations folder.
  3. For S3 prefix for Source documents, enter the path to the source folder.
  4. For Attribute names, enter legal-entity-label-job-labeling-job-20220104T172242.

The attribute name corresponds to the name of the labeling job you create for annotating the documents. For the pre-annotated documents, we use the name legal-entity-label-job-labeling-job-20220104T172242. If you choose to annotate your documents, substitute this value with the name of your annotation job.

  1. Create a new AWS Identity and Access Management (IAM) role and give it permissions to the bucket that contains all your data.
  2. Finish creating the model (select the Autosplit option for your data source to see similar metrics to those in the following screenshots).

Now your recognizer model is visible on the dashboard with the model training status and metrics.

The model may take several minutes to train.

The following screenshot shows your model metrics when the training is complete.

Use the custom entity recognition model

To use the custom entity recognition models trained on PDF documents, we create a batch job to process them asynchronously.

  1. On the Amazon Comprehend console, choose Analysis jobs.
  2. Choose Create job.
  3. Under Input data, enter the Amazon S3 location of the annotated PDF documents to process (for this post, the sample.pdf file).
  4. For Input format, select One document per file.
  5. Under Output Data, enter the Amazon S3 location you want them to populate in. For this post, we create a new folder called analysis-output in the S3 bucket containing all source PDF documents, annotated documents, and manifest.
  6. Use an IAM role with permissions to the sample.pdf folder.

You can use the role created earlier.

  1. Choose Create job.

This is an asynchronous job so it may take a few minutes to complete processing. When the job is complete, you get link to the output. When you open this output, you see a series of files as follows:

You can open the file sample.pdf.out in your preferred text editor. If you search for the Entities Block, you can find the entities identified in the document. The following table shows an example.

Type Text Score
Insurance Company Budget Mutual Insurance Company 0.999984086
Insurance Company Address 9876 Infinity Aven Springfield, MI 65541 0.999982051
Law Firm Bill & Carr 0.99997298
Law Office Address 9241 13th Ave SWn Spokane, Washington (WA),99217 0.999274625
Beneficiary Name Laura Mcdaniel 0.999972464
Policy Holder Name Keith Holt 0.999781546
Policy Number (#892877136) 0.999950143
Payout $15,000 0.999980728
Sender Angela Berry 0.999723455
Required Action We are requesting that you forward the full policy amount of Please forward ann acknowledgement of our demand and please forward the umbrella policy information if one isn applicable. Please send my secretary any information regarding liens on his policy. 0.999989449

Expand the solution

You can choose from a myriad of possibilities for what to do with the detected entities, such as the following:

  • Ingest them into a backend system of record
  • Create a searchable index based on the extracted entities
  • Enrich machine learning and analytics using extracted entity values as parameters for model training and inference
  • Configure back-office flows and triggers based on detected entity value (such as specific law firms or payout values)

The following diagram depicts these options:

Conclusion

Complex document types can often be impediments to full-scale IDP automation. In this post, we demonstrated how you can build and use custom NER models directly from PDF documents. This method is especially powerful for instances where positional information is especially pertinent (similar entity values and varied document formats). Although we demonstrated this solution by using legal requisition letters in insurance, you can extrapolate this use case across healthcare, manufacturing, retail, financial services, and many other industries.

To learn more about Amazon Comprehend, visit the Amazon Comprehend Developer Guide.


About the Authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Enzo Staton is a Solutions Architect with a passion for working with companies to increase their cloud knowledge. He works closely as a trusted advisor and industry specialists with customers around the country.

Read More

TFRT: A Progress Update

Posted by Mingsheng Hong, TFRT Tech Lead/Manager & Eric Johnson, TFRT Product Manager

Roughly two years ago, we announced an ambitious new Machine Learning (ML) runtime effort called TFRT (short for TensorFlow Runtime). We simultaneously provided a deep dive of the initial technical design and open-sourced its codebase.

Driven by trends in the ML ecosystem – larger and bigger models, ML being deployed to more diverse execution environments, and the need to keep up with continued research and modeling innovations – TFRT was started with the following set of goals in mind:

  • Deliver faster and cheaper execution for ML models
  • Enable more flexible deployment
  • Provide more modular and extensible infrastructure to facilitate innovations in ML infra and modeling

In this post, we share our progress to date, the experiences and lessons we’ve learned over the past two years of development, as well as what you can expect going forward.

Progress to Date

The last two years of development have largely been focused on implementing and validating our ambitious ideas by enabling Google’s most important internal workloads for users such as Ads and Search. To date, we have deployed TFRT broadly inside Google on a variety of training and inference workloads, and obtained great results.

Technical Lessons

How have we been able to achieve the above? Here are some interesting technical lessons that we learned, beyond what was in the original design:

First, async support is important for some of the key workloads (e.g. overlapping compute and I/O, and driving heterogeneous devices), while fast sync execution is critical for many other workloads, including small, “embedded” ML models.

We spent a lot of effort in designing and refining AsyncValue, a key low level abstraction in TFRT, which allows the host runtime to asynchronously drive devices, as well as invoking kernels. This led to improved device utilization due to the ability to overlap more computation and communication across hosts and devices. For example, we were able to successfully run bulk inference of an 80B-parameter model on one TPU chip with good performance by splitting the model into multiple stages and using TFRT to overlap variable transfer of the next stage with TPU computation of the current stage.

On the other hand, small CPU models that are embedded in application servers, invoked within the application process instead of via RPC/REST calls, remain critical for some of Google’s business workloads from users like Ads. For these models, the async-first internal design of TFRT initially caused a performance and resource regression. We worked with the Ads team to successfully address it, by extending the TFRT design with a synchronous interpreter, as well as an experimental memory planning optimization, to avoid heap allocation during kernel execution. We are working on productizing this extension.

This diagram below showcases the impact of the resulting TFRT design over a benchmark, as compared to “Current TF” which ran the old runtime before TFRT’s deployment. This benchmark focused on executing a tiny CPU model, where a large number of small matmuls executed sequentially. Notably, the optimized execution in TFRT (265 ns) is approaching the optimal baseline we set up (204 ns), via hand-written C++ code without any ML runtime overhead.

Second, while faster runtime execution is critical, optimizing the input program to reduce execution complexity is important as well.

Note that while compiler-based graph optimization should be performed when TF SavedModel is saved to the disk whenever possible, there are also important inference-time compiler optimizations that can only be performed with the knowledge of being in an inference context (e.g. when training variables remain constant).

As we were onboarding ML models onto TFRT, we had the chance to examine some of the models in depth, and identified new ways of rewriting and simplifying the program, before its execution. The simplified program, along with a faster execution of each kernel in the graph program, led to a nice compounding effect in the reduction of the execution latency and resource cost.

For example, in the left hand side graph program below, we were able to hoist the scalar op normalization computation (e.g. divide a float value by the max value of its domain), identical across all 18 input scalars, above the “concat” op, therefore enabling vectorized execution of the normalization, over a concatenated 1D float tensor.

While it is possible to perform this optimization at model training time as well, the compiler+runtime used to produce the trained model did not include this optimization.

In addition, we also find it critical to hoist computation from model execution time to load time whenever possible (e.g. const folding).

Third, cost-based execution is not just for SQL queries.

We developed a simple compile-time cost model (analogous to SQL query optimizer’s cost model) for TF op kernels, and applied cost-based optimization for ML model execution (see stream analysis), and achieved a better load balancing of the kernel execution across a set of threadpool threads. In contrast, TF1 has a runtime-based cost model, in which each operation’s runtime cost is profiled and used to guide that operation’s scheduling. In TFRT, we moved the cost analysis to compile-time, thus removing runtime cost. Also, our compiler approach allows the entire computational graph to be analyzed, thereby resulting in scheduling decisions that are optimal at a more global scope.

See this tech talk for more similarities between data and ML infra.

Looking Ahead

While we’ve certainly made some strong progress, especially with respect to our first goal – faster and cheaper execution – we admittedly still have work to do on enabling a more modular design and enabling more flexible deployments via hardware integration.

In terms of modularity, with the initial integration successes such as JAX’s adoption of TFRT device runtimes (e.g. CPU), we will continue to explore how TFRT could support workloads beyond just TensorFlow. We expect some of the TFRT components will also benefit the PyTorch/XLA workloads going forward.

Moreover, while we have successfully integrated CPU and TPU (with upcoming integration into Cloud TPU), the two most important device types at Google for ML computation, with NVIDIA GPU also in progress.

With respect to training workload, TFRT has been used as building blocks for Google’s large scale distributed training framework which are currently in active development.

As we look to the future, our organization has been exploring its integration with Pixel’s hardware SOC devices such as Google Tensor. In addition, due to TFRT’s proven success for Google’s internal workloads, it is also being integrated into new venues such as GCP’s Vertex AI and Waymo.

Special Thanks

The TFRT team has really enjoyed working on this new, ambitious infrastructure project. It has often felt like bootstrapping a new startup. With that in mind, we would like to give a huge shout out to everyone who has advised, contributed to and supported TFRT through this incredible 2-year journey:

(alphabetically) Adi Agrawal, Andrew Bernard, Andrew Leaver, Andy Selle, Ayush Dubey, Bangda Zhou, Bramandia Ramadhana, Catherine Payne, Ce Zheng, Chiachen Chou, Chao Xie, Christina Sorokin, Chuanhao Zhuge, Dan Hurt, Dong Lin, Eugene Zhulenev, Ewa Matejska, Hadi Hashemi, Haoliang Zhang, HanBin Yoon, Haoyu Zhang, Hongmin Fan, Jacques Pienaar, Jeff Dean, Jeremy Lau, Jordan Soyke, Jing Dong, Juanli Shen, Kemal El Moujahid, Kuangyuan Chen, Mehdi Amini, Ning Niu, Peter Gavin, Phil Sun, Pulkit Bhuwalka, Qiao Zhang, Raziel Alvarez, Russell Power, Sanjoy Das, Shengqi Zhu, Smit Hinsu, Tatiana Shpeisman, Tianrun Li, Tim Davis, Tom Black, Victor Akabutu, Vilobh Meshram, Xiao Yu, Xiaodan Song, Yiming Zhang, YC Ling, Youlong Chen, and Zhuoran Liu.

We would like to give special thanks to Chris Lattner for his initial technical leadership in bootstrapping this project, Martin Wicke for his support in TFRT throughout the first year, Alex Zaks for his support in TFRT during the second year and seeing through the impactful landing for Google’s ML serving workloads.

Read More

Startup Taps Finance Micromodels for Data Annotation Automation

After meeting at an entrepreneur matchmaking event, Ulrik Hansen and Eric Landau teamed up to parlay their experience in financial trading systems into a platform for faster data labeling.

In 2020, the pair of finance industry veterans founded Encord to adapt micromodels typical in finance to automated data annotation. Micromodels are neural networks that require less time to deploy because they’re trained on less data and used for specific tasks.

Encord’s NVIDIA GPU-driven service promises to automate as much as 99 percent of businesses’ manual data labeling with its micromodels.

“Instead of building one big model that does everything, we’re just combining a lot of smaller models together, and that’s very similar to how a lot of these trading systems work,” said Landau.

The startup, based in London, recently landed $12.5 million in Series A funding.

Encord is an NVIDIA Metropolis partner and a member of NVIDIA Inception, a program that offers go-to-market support, expertise and technology for AI, data science and HPC startups. NVIDIA Metropolis is an application framework that makes it easier for developers to combine video cameras and sensors with AI-enabled video analytics.

The company said it has attracted business in gastrointestinal endoscopy, radiology, thermal imaging, smart cities, agriculture, autonomous transportation and retail applications.

‘Augmenting Doctors’ for SurgEase

Back in 2021, the partners hunkered down near Laguna Beach, Calif., at the home of Landau’s parents, to build Encord while attending Y Combinator. And they had also just landed a first customer, SurgEase.

London-based SurgEase offers telepresence technology for gastroenterology. The company’s hardware device and software enable remote physicians to monitor high-definition images and video captured in colonoscopies.

“You could have a doctor in an emerging economy do the diagnostics or  detection, as well as a doctor from one of the very best hospitals in the U.S.,” said Hansen.

To improve diagnostics, SurgEase is also applying video data to training AI models for detection. Encord’s micromodels are being applied to annotate the video data that’s used for SurgEase’s models.The idea is to give doctors a second set of eyes on procedures.

“Encord’s software has been instrumental in aiding us in solving some of the hardest problems in endoscopic disease assessment,” said SurgEase CEO Fareed Iqbal.

With AI-aided diagnostics, clinicians using SurgEase might spot more things sooner so that people don’t need more severe procedures down the line, said Hansen. Doctors also don’t always agree, so it can help cut through the noise with another opinion, said Landau.

“It’s really augmenting doctors,” said Landau.

King’s College of London: 6x Faster 

King’s College of London had a challenge of finding a way to annotate images in precancerous polyp videos. So it turned to Encord for annotation automation because using highly skilled clinicians was costly on such large datasets.  

The result was that the micro models could be used to annotate about 6.4x faster than manual labeling. It was capable of handling about 97 percent of the datasets with automated annotation, the rest requiring manual labeling from clinicians.

Encord enabled King’s College of London to cut model development time from one year to two months, moving AI into production faster.

Triton: Quickly Into Inference

Encord was initially setting out to build its own inference engine, running on its API server. But Hansen and Landau decided using NVIDIA Triton would save a lot of engineering time and get them quickly into production.

Triton offers open-source software for taking AI into production by simplifying how models run in any framework and on any GPU or CPU for all inference types.

Also, it allowed them to focus on their early customers by not having to build inference engine architecture themselves.

People using Encord’s platform can train a micromodel and run inference very soon after that, enabled by Triton, Hansen said.

“With Triton, we get the native support for all these machine learning libraries like PyTorch and it’s compatible with CUDA,” said  Hansen. “It saved us a lot of time and hassles.”

The post Startup Taps Finance Micromodels for Data Annotation Automation appeared first on The Official NVIDIA Blog.

Read More

Robot See, Robot Do

People learn to do things by watching others — from mimicking new dance moves, to watching YouTube cooking videos. We’d like robots to do the same, i.e., to learn new skills by watching people do things during training. Today, however, the predominant paradigm for teaching robots is to remote control them using specialized hardware for teleoperation and then train them to imitate pre-recorded demonstrations. This limits both who can provide the demonstrations (programmers & roboticists) and where they can be provided (lab settings). If robots could instead self-learn new tasks by watching humans, this capability could allow them to be deployed in more unstructured settings like the home, and make it dramatically easier for anyone to teach or communicate with them, expert or otherwise. Perhaps one day, they might even be able to use Youtube videos to grow their collection of skills over time.

Our motivation is to have robots watch people do tasks, naturally with their hands, and then use that data as demonstrations for learning. Video by Teh Aik Hui and Nathaniel Lim. License: CC-BY

However, an obvious but often overlooked problem is that a robot is physically different from a human, which means it often completes tasks differently than we do. For example, in the pen manipulation task below, the hand can grab all the pens together and quickly transfer them between containers, whereas the two-fingered gripper must transport one at a time. Prior research assumes that humans and robots can do the same task similarly, which makes manually specifying one-to-one correspondences between human and robot actions easy. But with stark differences in physique, defining such correspondences for seemingly easy tasks can be surprisingly difficult and sometimes impossible.

Physically different end-effectors (i.e., “grippers”) (i.e., the part that interacts with the environment) induce different control strategies when solving the same task. Left: The hand grabs all pens and quickly transfers them between containers. Right: The two-fingered gripper transports one pen at a time.

In “XIRL: Cross-Embodiment Inverse RL”, presented as an oral paper at CoRL 2021, we explore these challenges further and introduce a self-supervised method for Cross-embodiment Inverse Reinforcement Learning (XIRL). Rather than focusing on how individual human actions should correspond to robot actions, XIRL learns the high-level task objective from videos, and summarizes that knowledge in the form of a reward function that is invariant to embodiment differences, such as shape, actions and end-effector dynamics. The learned rewards can then be used together with reinforcement learning to teach the task to agents with new physical embodiments through trial and error. Our approach is general and scales autonomously with data — the more embodiment diversity presented in the videos, the more invariant and robust the reward functions become. Experiments show that our learned reward functions lead to significantly more sample efficient (roughly 2 to 4 times) reinforcement learning on new embodiments compared to alternative methods. To extend and build on our work, we are releasing an accompanying open-source implementation of our method along with X-MAGICAL, our new simulated benchmark for cross-embodiment imitation.

Cross-Embodiment Inverse Reinforcement Learning (XIRL)
The underlying observation in this work is that in spite of the many differences induced by different embodiments, there still exist visual cues that reflect progression towards a common task objective. For example, in the pen manipulation task above, the presence of pens in the cup but not the mug, or the absence of pens on the table, are key frames that are common to different embodiments and indirectly provide cues for how close to being complete a task is. The key idea behind XIRL is to automatically discover these key moments in videos of different length and cluster them meaningfully to encode task progression. This motivation shares many similarities with unsupervised video alignment research, from which we can leverage a method called Temporal Cycle Consistency (TCC), which aligns videos accurately while learning useful visual representations for fine-grained video understanding without requiring any ground-truth correspondences.

We leverage TCC to train an encoder to temporally align video demonstrations of different experts performing the same task. The TCC loss tries to maximize the number of cycle-consistent frames (or mutual nearest-neighbors) between pairs of sequences using a differentiable formulation of soft nearest-neighbors. Once the encoder is trained, we define our reward function as simply the negative Euclidean distance between the current observation and the goal observation in the learned embedding space. We can subsequently insert the reward into a standard MDP and use an RL algorithm to learn the demonstrated behavior. Surprisingly, we find that this simple reward formulation is effective for cross-embodiment imitation.

XIRL self-supervises reward functions from expert demonstrations using temporal cycle consistency (TCC), then uses them for downstream reinforcement learning to learn new skills from third-person demonstrations.

X-MAGICAL Benchmark
To evaluate the performance of XIRL and baseline alternatives (e.g., TCN, LIFS, Goal Classifier) in a consistent environment, we created X-MAGICAL, which is a simulated benchmark for cross-embodiment imitation. X-MAGICAL features a diverse set of agent embodiments, with differences in their shapes and end-effectors, designed to solve tasks in different ways. This leads to differences in execution speeds and state-action trajectories, which poses challenges for current imitation learning techniques, e.g., ones that use time as a heuristic for weak correspondences between two trajectories. The ability to generalize across embodiments is precisely what X-MAGICAL evaluates.

The SweepToTop task we considered for our experiments is a simplified 2D equivalent of a common household robotic sweeping task, where an agent has to push three objects into a goal zone in the environment. We chose this task specifically because its long-horizon nature highlights how different agent embodiments can generate entirely different trajectories (shown below). X-MAGICAL features a Gym API and is designed to be easily extendable to new tasks and embodiments. You can try it out today with pip install x-magical.

Different agent shapes in the SweepToTop task in the X-MAGICAL benchmark need to use different strategies to reposition objects into the target area (pink), i.e., to “clear the debris”. For example, the long-stick can clear them all in one fell swoop, whereas the short-stick needs to do multiple consecutive back-and-forths.
Left: Heatmap of state visitation for each embodiment across all expert demonstrations. Right: Examples of expert trajectories for each embodiment.

Highlights
In our first set of experiments, we checked whether our learned embodiment-invariant reward function can enable successful reinforcement learning, when the expert demonstrations are provided through the agent itself. We find that XIRL significantly outperforms alternative methods especially on the tougher agents (e.g., short-stick and gripper).

Same-embodiment setting: Comparison of XIRL with baseline reward functions, using SAC for RL policy learning. XIRL is roughly 2 to 4 times more sample efficient than some of the baselines on the harder agents (short-stick and gripper).

We also find that our approach shows great potential for learning reward functions that generalize to novel embodiments. For instance, when reward learning is performed on embodiments that are different from the ones on which the policy is trained, we find that it results in significantly more sample efficient agents compared to the same baselines. Below, in the gripper subplot (bottom right) for example, the reward is first learned on demonstration videos from long-stick, medium-stick and short-stick, after which the reward function is used to train the gripper agent.

Cross-embodiment setting: XIRL performs favorably when compared with other baseline reward functions, trained on observation-only demonstrations from different embodiments. Each agent (long-stick, medium-stick, short-stick, gripper) had its reward trained using demonstrations from the other three embodiments.

We also find that we can train on real-world human demonstrations, and use the learned reward to train a Sawyer arm in simulation to push a puck to a designated target zone. In these experiments as well, our method outperforms baseline alternatives. For example, our XIRL variant trained only on the real-world demonstrations (purple in the plots below) reaches 80% of the total performance roughly 85% faster than the RLV baseline (orange).

What Do The Learned Reward Functions Look Like?
To further explore the qualitative nature of our learned rewards in more challenging real-world scenarios, we collect a dataset of the pen transfer task using various household tools.

Below, we show rewards extracted from a successful (top) and unsuccessful (bottom) demonstration. Both demonstrations follow a similar trajectory at the start of the task execution. The successful one nets a high reward for placing the pens consecutively into the mug then into the glass cup, while the unsuccessful one obtains a low reward because it drops the pens outside the glass cup towards the end of the execution (orange circle). These results are promising because they show that our learned encoder can represent fine-grained visual differences relevant to a task.

Conclusion
We highlighted XIRL, our approach to tackling the cross-embodiment imitation problem. XIRL learns an embodiment-invariant reward function that encodes task progress using a temporal cycle-consistency objective. Policies learned using our reward functions are significantly more sample-efficient than baseline alternatives. Furthermore, the reward functions do not require manually paired video frames between the demonstrator and the learner, giving them the ability to scale to an arbitrary number of embodiments or experts with varying skill levels. Overall, we are excited about this direction of work, and hope that our benchmark promotes further research in this area. For more details, please check out our paper and download the code from our GitHub repository.

Acknowledgments
Kevin and Andy summarized research performed together with Pete Florence, Jonathan Tompson, Jeannette Bohg (faculty at Stanford University) and Debidatta Dwibedi. All authors would additionally like to thank Alex Nichol, Nick Hynes, Sean Kirmani, Brent Yi, Jimmy Wu, Karl Schmeckpeper and Minttu Alakuijala for fruitful technical discussions, and Sam Toyer for invaluable help with setting up the simulated benchmark.

Read More