An easier way to teach robots new skills

With e-commerce orders pouring in, a warehouse robot picks mugs off a shelf and places them into boxes for shipping. Everything is humming along, until the warehouse processes a change and the robot must now grasp taller, narrower mugs that are stored upside down.

Reprogramming that robot involves hand-labeling thousands of images that show it how to grasp these new mugs, then training the system all over again.

But a new technique developed by MIT researchers would require only a handful of human demonstrations to reprogram the robot. This machine-learning method enables a robot to pick up and place never-before-seen objects that are in random poses it has never encountered. Within 10 to 15 minutes, the robot would be ready to perform a new pick-and-place task.

The technique uses a neural network specifically designed to reconstruct the shapes of 3D objects. With just a few demonstrations, the system uses what the neural network has learned about 3D geometry to grasp new objects that are similar to those in the demos.

In simulations and using a real robotic arm, the researchers show that their system can effectively manipulate never-before-seen mugs, bowls, and bottles, arranged in random poses, using only 10 demonstrations to teach the robot.

“Our major contribution is the general ability to much more efficiently provide new skills to robots that need to operate in more unstructured environments where there could be a lot of variability. The concept of generalization by construction is a fascinating capability because this problem is typically so much harder,” says Anthony Simeonov, a graduate student in electrical engineering and computer science (EECS) and co-lead author of the paper.

Simeonov wrote the paper with co-lead author Yilun Du, an EECS graduate student; Andrea Tagliasacchi, a staff research scientist at Google Brain; Joshua B. Tenenbaum, the Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Sciences and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Alberto Rodriguez, the Class of 1957 Associate Professor in the Department of Mechanical Engineering; and senior authors Pulkit Agrawal, a professor in CSAIL, and Vincent Sitzmann, an incoming assistant professor in EECS. The research will be presented at the International Conference on Robotics and Automation.

Grasping geometry

A robot may be trained to pick up a specific item, but if that object is lying on its side (perhaps it fell over), the robot sees this as a completely new scenario. This is one reason it is so hard for machine-learning systems to generalize to new object orientations.

To overcome this challenge, the researchers created a new type of neural network model, a Neural Descriptor Field (NDF), that learns the 3D geometry of a class of items. The model computes the geometric representation for a specific item using a 3D point cloud, which is a set of data points or coordinates in three dimensions. The data points can be obtained from a depth camera that provides information on the distance between the object and a viewpoint. While the network was trained in simulation on a large dataset of synthetic 3D shapes, it can be directly applied to objects in the real world.

The team designed the NDF with a property known as equivariance. With this property, if the model is shown an image of an upright mug, and then shown an image of the same mug on its side, it understands that the second mug is the same object, just rotated.

“This equivariance is what allows us to much more effectively handle cases where the object you observe is in some arbitrary orientation,” Simeonov says.

As the NDF learns to reconstruct shapes of similar objects, it also learns to associate related parts of those objects. For instance, it learns that the handles of mugs are similar, even if some mugs are taller or wider than others, or have smaller or longer handles.

“If you wanted to do this with another approach, you’d have to hand-label all the parts. Instead, our approach automatically discovers these parts from the shape reconstruction,” Du says.

The researchers use this trained NDF model to teach a robot a new skill with only a few physical examples. They move the hand of the robot onto the part of an object they want it to grip, like the rim of a bowl or the handle of a mug, and record the locations of the fingertips.

Because the NDF has learned so much about 3D geometry and how to reconstruct shapes, it can infer the structure of a new shape, which enables the system to transfer the demonstrations to new objects in arbitrary poses, Du explains.

Picking a winner

They tested their model in simulations and on a real robotic arm using mugs, bowls, and bottles as objects. Their method had a success rate of 85 percent on pick-and-place tasks with new objects in new orientations, while the best baseline was only able to achieve a success rate of 45 percent. Success means grasping a new object and placing it on a target location, like hanging mugs on a rack.

Many baselines use 2D image information rather than 3D geometry, which makes it more difficult for these methods to integrate equivariance. This is one reason the NDF technique performed so much better.

While the researchers were happy with its performance, their method only works for the particular object category on which it is trained. A robot taught to pick up mugs won’t be able to pick up boxes or headphones, since these objects have geometric features that are too different than what the network was trained on.

“In the future, scaling it up to many categories or completely letting go of the notion of category altogether would be ideal,” Simeonov says.

They also plan to adapt the system for nonrigid objects and, in the longer term, enable the system to perform pick-and-place tasks when the target area changes.

This work is supported, in part, by the Defense Advanced Research Projects Agency, the Singapore Defense Science and Technology Agency, and the National Science Foundation.

Read More

Pix2Seq: A New Language Interface for Object Detection

Object detection is a long-standing computer vision task that attempts to recognize and localize all objects of interest in an image. The complexity arises when trying to identify or localize all object instances while also avoiding duplication. Existing approaches, like Faster R-CNN and DETR, are carefully designed and highly customized in the choice of architecture and loss function. This specialization of existing systems has created two major barriers: (1) it adds complexity in tuning and training the different parts of the system (e.g., region proposal network, graph matching with GIOU loss, etc.), and (2), it can reduce the ability of a model to generalize, necessitating a redesign of the model for application to other tasks.

In “Pix2Seq: A Language Modeling Framework for Object Detection”, published at ICLR 2022, we present a simple and generic method that tackles object detection from a completely different perspective. Unlike existing approaches that are task-specific, we cast object detection as a language modeling task conditioned on the observed pixel inputs. We demonstrate that Pix2Seq achieves competitive results on the large-scale object detection COCO dataset compared to existing highly-specialized and well-optimized detection algorithms, and its performance can be further improved by pre-training the model on a larger object detection dataset. To encourage further research in this direction, we are also excited to release to the broader research community Pix2Seq’s code and pre-trained models along with an interactive demo.

Pix2Seq Overview
Our approach is based on the intuition that if a neural network knows where and what the objects in an image are, one could simply teach it how to read them out. By learning to “describe” objects, the model can learn to ground the descriptions on pixel observations, leading to useful object representations. Given an image, the Pix2Seq model outputs a sequence of object descriptions, where each object is described using five discrete tokens: the coordinates of the bounding box’s corners [ymin, xmin, ymax, xmax] and a class label.

Pix2Seq framework for object detection. The neural network perceives an image, and generates a sequence of tokens for each object, which correspond to bounding boxes and class labels.

With Pix2Seq, we propose a quantization and serialization scheme that converts bounding boxes and class labels into sequences of discrete tokens (similar to captions), and leverage an encoder-decoder architecture to perceive pixel inputs and generate the sequence of object descriptions. The training objective function is simply the maximum likelihood of tokens conditioned on pixel inputs and the preceding tokens.

Sequence Construction from Object Descriptions
In commonly used object detection datasets, images have variable numbers of objects, represented as sets of bounding boxes and class labels. In Pix2Seq, a single object, defined by a bounding box and class label, is represented as [ymin, xmin, ymax, xmax, class]. However, typical language models are designed to process discrete tokens (or integers) and are unable to comprehend continuous numbers. So, instead of representing image coordinates as continuous numbers, we normalize the coordinates between 0 and 1 and quantize them into one of a few hundred or thousand discrete bins. The coordinates are then converted into discrete tokens as are the object descriptions, similar to image captions, which in turn can then be interpreted by the language model. The quantization process is achieved by multiplying the normalized coordinate (e.g., ymin) by the number of bins minus one, and rounding it to the nearest integer (the detailed process can be found in our paper).

Quantization of the coordinates of the bounding boxes with different numbers of bins on a 480 × 640 image. With a small number of bins/tokens, such as 500 bins (∼1 pixel/bin), it achieves high precision even for small objects.

After quantization, the object annotations provided with each training image are ordered into a sequence of discrete tokens (shown below). Since the order of the objects does not matter for the detection task per se, we randomize the order of objects each time an image is shown during training. We also append an End of Sequence (EOS) token at the end as​​ different images often have different numbers of objects, and hence sequence lengths.

The bounding boxes and class labels for objects detected in the image on the left are represented in the sequences shown on the right. A random object ordering strategy is used in our work but other approaches to ordering could also be used.

The Model Architecture, Objective Function, and Inference
We treat the sequences that we constructed from object descriptions as a “dialect” and address the problem via a powerful and general language model with an image encoder and autoregressive language encoder. Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss. At inference time, we sample tokens from model likelihood. The sampled sequence ends when the EOS token is generated. Once the sequence is generated, we split it into chunks of 5 tokens for extracting and de-quantizing the object descriptions (i.e., obtaining the predicted bounding boxes and class labels). It is worth noting that both the architecture and loss function are task-agnostic in that they don’t assume prior knowledge about object detection (e.g., bounding boxes). We describe how we can incorporate task-specific prior knowledge with a sequence augmentation technique in our paper.

Results
Despite its simplicity, Pix2Seq achieves impressive empirical performance on benchmark datasets. Specifically, we compare our method with well established baselines, Faster R-CNN and DETR, on the widely used COCO dataset and demonstrate that it achieves competitive average precision (AP) results.

Pix2Seq achieves competitive AP results compared to existing systems that require specialization during model design, while being significantly simpler. The best performing Pix2Seq model achieved an AP score of 45.

Since our approach incorporates minimal inductive bias or prior knowledge of the object detection task into the model design, we further explore how pre-training the model using the large-scale object detection COCO dataset can impact its performance. Our results indicate that this training strategy (along with using bigger models) can further boost performance.

The average precision of the Pix2Seq model with pre-training followed by fine-tuning. The best performing Pix2Seq model without pre-training achieved an AP score of 45. When the model is pre-trained, we see an 11% improvement with an AP score of 50.

Pix2Seq can detect objects in densely populated and complex scenes, such as those shown below.

Example complex and densely populated scenes labeled by a trained Pix2Seq model. Try it out here.

Conclusion and Future Work
With Pix2Seq, we cast object detection as a language modeling task conditioned on pixel inputs for which the model architecture and loss function are generic, and have not been engineered specifically for the detection task. One can, therefore, readily extend this framework to different domains or applications, where the output of the system can be represented by a relatively concise sequence of discrete tokens (e.g., keypoint detection, image captioning, visual question answering), or incorporate it into a perceptual system supporting general intelligence, for which it provides a language interface to a wide range of vision and language tasks. We also hope that the release of our Pix2Seq’s code, pre-trained models and interactive demo will inspire further research in this direction.

Acknowledgements
This post reflects the combined work with our co-authors: Saurabh Saxena, Lala Li, Geoffrey Hinton. We would also like to thank Tom Small for the visualization of the Pix2Seq illustration figure.

Read More

Secure AWS CodeArtifact access for isolated Amazon SageMaker notebook instances

AWS CodeArtifact allows developers to connect internal code repositories to upstream code repositories like Pypi, Maven, or NPM. AWS CodeArtifact is a powerful addition to CI/CD workflows on AWS, but it is similarly effective for code-bases hosted on a Jupyter notebook. This is a common development paradigm for Machine Learning developers that build and train ML models regularly.

In this post, we demonstrate how to securely connect to AWS CodeArtifact from an Internet-disabled SageMaker Notebook Instance. This post is for network and security architects that support decentralized data science teams on AWS.

In another post, we discussed how to create an Internet-disabled notebook in a private subnet of an Amazon VPC while maintaining connectivity to AWS services via AWS Private Link endpoints. The examples in this post will connect an Internet-disabled notebook instance to AWS CodeArtifact and download open-source code packages without needing to traverse the public internet.

Solution overview

The following diagram describes the solution we will implement. We create a SageMaker notebook instance in a private subnet of a VPC. We also create an AWS CodeArtifact domain and a repository. Access to the repository will be controlled by CodeArtifact repository policies and PrivateLink access policies.

The architecture allows our Internet-disabled SageMaker notebook instance to access CodeArtifact repositories without traversing the public internet. Because the network traffic doesn’t traverse the public internet, we improve the security posture of the notebook instance by ensuring that only users with the expected network access can access the notebook instance. Furthermore, this paradigm allows security administrators to restrict library consumption to only “approved” distributions of code packages. By combining network security with secure package management, security engineers can transparently manage open-source libraries for data scientists without impeding their ability to work.

Prerequisites

For this post, we need an Internet-disabled SageMaker notebook instance, and a VPC with a private subnet. Visit this link to create a private subnet in a VPC, as well as the previous post in this series to get started with these prerequisites.

We also need a CodeArtifact domain in the AWS Region where you created your Internet-disabled SageMaker notebook instance. The domain is a way to organize repositories logically in the CodeArtifact service. Name the domain, select AWS managed key for encryption, and create the domain. This link discusses how to create an AWS CodeArtifact Domain.

Configure AWS CodeArtifact

To configure AWS CodeArtifact, we create a repository in the domain. Before proceeding, be sure to select the same region where the notebook instance has been deployed. Perform the following steps to configure AWS CodeArtifact:

  1. On the AWS CodeArtifact console, choose Create repository.
  2. Give the repository a name and description. Select the public upstream repository you want to use. We use the pypi-store in this post.
  3. Choose Next.

  4. Choose the AWS account you are working in and the AWS CodeArtifact domain you use for this account.
  5. Choose Next.

  6. Review the repository information summary and choose Create repository.
    1. Note: In the review screen section marked “Package flow” there is a flowchart describing the flow of dependencies from external connections into the domain managed by AWS CodeArtifact.
    2. This flowchart describes what happens when we create our repository. We are actually creating two repositories. The first is the “pypi-store” repository that connects to the externally hosted pypi repository. This repository was created by AWS CodeArtifact and is used to stage connections to the upstream repository. The second repository, “isolatedsm”, connects to the “pypi-store” repository. This transitive repository lets us combine external connections and stage third-party libraries before using them.
    3. This form of transitive repository management allows us to enforce least privileged access on the libraries we use for data science workloads.

Alternatively, we can perform these steps with the AWS CLI using the following command:

aws codeartifact create-repository --domain <domain_name> --domain-owner <account_number> --repository <repo_name> --description <repo_description> --region <region_name>

The result at the end of this process is two repositories in our domain.

CodeArtifact Repository Policy

For brevity, we will grant a relatively open policy for our IsolatedSM repository allowing any of our developers to access the CodeArtifact repository. This policy should be modified for production use cases. Later in the post, we will discuss how to implement least-privilege access at the role level using an IAM policy attached to the notebook instance role. For now, navigate to the repository in the AWS Management Console and expand the Details section of the repository configuration page. Choose Apply a repository policy under Repository policy.

On the next screen, paste the following policy document in the text field marked Edit repository policy and then choose Save:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeartifact:AssociateExternalConnection",
                "codeartifact:CopyPackageVersions",
                "codeartifact:DeletePackageVersions",
                "codeartifact:DeleteRepository",
                "codeartifact:DeleteRepositoryPermissionsPolicy",
                "codeartifact:DescribePackageVersion",
                "codeartifact:DescribeRepository",
                "codeartifact:DisassociateExternalConnection",
                "codeartifact:DisposePackageVersions",
                "codeartifact:GetPackageVersionReadme",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ListPackageVersionAssets",
                "codeartifact:ListPackageVersionDependencies",
                "codeartifact:ListPackageVersions",
                "codeartifact:ListPackages",
                "codeartifact:PublishPackageVersion",
                "codeartifact:PutPackageMetadata",
                "codeartifact:PutRepositoryPermissionsPolicy",
                "codeartifact:ReadFromRepository",
                "codeartifact:UpdatePackageVersionsStatus",
                "codeartifact:UpdateRepository"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Principal": {
                "AWS": "arn:aws:iam::329542461890:role/GeneralIsolatedNotebook"
            }
        }
    ]
}

This policy provides full access to the repository for the role attached to the isolated notebook instance. This policy is a sample policy enables developer access to CodeArtifact. For more information on policy definitions for CodeArtifact repositories (especially if you need more restrictive role-based access), see the CodeArtifact User Guide. To configure the same repository policy using the AWS CLI, save the preceding policy document as policy.json and run the following command:

aws codeartifact put-repository-permissions-policy --domain <domain_name> --domain-owner <account_number> --repository <repo_name> --policy-document file:///PATH/TO/policy.json --region <region_name>

Configure access to AWS CodeArtifact

Connecting to a CodeArtifact repository requires logging into the repository as a user. By navigating into a repository and choosing View connection instructions, we can select the appropriate connection instructions for our package manager of choice. We will use pip in this post; from the drop-down, select pip and copy the connection instructions.

The AWS CLI command to use should be similar to the following:

aws codeartifact login --tool pip --repository <repo_name> --domain <domain_name> --domain-owner <account_number> --region <region_name>

This command is a CodeArtifact API call that returns an authentication cookie for the role that requested access to this repository. This command can be run in a Jupyter notebook to authenticate access to the CodeArtifact repository and will configure package managers to install libraries from that upstream repository. We can test this in our Internet-disabled SageMaker notebook instance.

When running the login command in our isolated notebook instance, nothing happens for some time. After a while (approximately 300 seconds), Jupyter will output a connection timeout error. This is because our notebook instance lives in an isolated network subnet. This is expected behavior, it confirms our notebook instance is Internet-disabled. We need to provision network access between this subnet and our CodeArtifact repository.

Create A PrivateLink connection between the Notebook and CodeArtifact

AWS PrivateLink is a networking service that creates VPC endpoints in your VPC for other AWS services like Amazon Elastic Compute Cloud (Amazon EC2), Amazon S3, and Amazon Simple Notification Service (Amazon SNS). Private endpoints facilitate API requests to other AWS services through your VPC instead of through the public internet. This is the crucial component that lets our solution privately and securely access the CodeArtifact repository we’ve created.

Before we create our PrivateLink endpoints, we must create a security group to associate with the endpoints. Before proceeding, make sure you are in the same region as the Internet-disabled SageMaker notebook instance.

  1. On the Amazon VPC console, choose Security Groups.
  2. Choose Create security group.
  3. Give the group an appropriate name and description.
  4. Select the VPC that you will deploy the PrivateLink endpoints to. This should be the same VPC that hosts the isolated SageMaker notebook.
  5. Under Inbound Rules, choose Add Rule and then permit All Traffic from the security group hosting the isolated SageMaker notebook.
  6. Outbound Rules should remain the default. Create the security group.

You can replicate these steps in the CLI with the following:

> aws ec2 create-security-group --group-name endpoint-sec-group --description "group for endpoint security" --vpc-id vpc_id --region region
> {
    "GroupId": endpoint-sec-group-id
}
> aws ec2 authorize-security-group-ingress --group-id endpoint-sec-group-id --protocol all --port -1 --source-group isolated-SM-sec-group-id --region region

For this next step, we recommend that customers use the AWS CLI for simplicity. First, save the following policy document as policy.json in your local file system.

{
  "Statement": [
    {
      "Action": "codeartifact:*",
      "Effect": "Allow",
      "Resource": "*",
      "Principal": "*"
    },
    {
      "Effect": "Allow",
      "Action": "sts:GetServiceBearerToken",
      "Resource": "*",
      "Principal": "*"
    },
    {
      "Action": [
        "codeartifact:CreateDomain",
        "codeartifact:CreateRepository",
        "codeartifact:DeleteDomain",
        "codeartifact:DeleteDomainPermissionsPolicy",
        "codeartifact:DeletePackageVersions",
        "codeartifact:DeleteRepository",
        "codeartifact:DeleteRepositoryPermissionsPolicy",
        "codeartifact:PutDomainPermissionsPolicy",
        "codeartifact:PutRepositoryPermissionsPolicy",
        "codeartifact:UpdateRepository"
      ],
      "Effect": "Deny",
      "Resource": "*",
      "Principal": "*"
    }
  ]
}

Then, run the following commands using the AWS CLI to create the PrivateLink endpoints for CodeArtifact. This first command creates the VPC Endpoint for CodeArtifact repository API commands.

> aws ec2 create-vpc-endpoint 
--vpc-endpoint-type Interface 
--vpc-id vpc-id 
--service-name com.amazonaws.region.codeartifact.repositories 
--subnet-ids [list-of-subnet-ids] 
-–security-group-ids endpoint-sec-group-id 
–-private-dns-enabled 
--policy-document file:///PATH/TO/policy.json 
--region region

This second commands creates the VPC Endpoint for CodeArtifact non-repository API commands. Note in this command, we do not enable private DNS for the endpoint. Note the output of this command as we will use it to enable private DNS in a subsequent CLI command.

> aws ec2 create-vpc-endpoint 
--vpc-endpoint-type Interface 
--vpc-id vpc-id 
--service-name com.amazonaws.region.codeartifact.api 
--subnet-ids [list-of-subnet-ids] 
-–security-group-ids endpoint-sec-group-id 
–-no-private-dns-enabled 
--policy-document file:///PATH/TO/policy.json 
--region region
{
    "VpcEndpoint": {
        "VpcEndpointId": "vpc-endpoint-id",
		...
    }
}

Once this VPC endpoint has been created, enable private DNS for the endpoint by running the following, final command:

> aws ec2 modify-vpc-endpoint 
--vpc-endpoint-id vpc-endpoint-id 
--private-dns-enabled 
--region region 

This policy document permits common CodeArtifact operations performed by developers to be permitted over this PrivateLink endpoint. This is acceptable for our use case because CodeArtifact lets us define access policies on the repositories themselves. We are only blocking CodeArtifact administrative commands from being sent over this endpoint. We block administrative commands because we do not want developers to perform administrative commands on the repository.

The following screenshot shows the API endpoint and the security group it belongs to.

The inbound rules on this security group should list one inbound rule, allowing All traffic from the isolated SageMaker notebook’s security group.

Network test

Once the security groups have been configured and the PrivateLink endpoints have been created, open a Jupyter notebook on the isolated SageMaker notebook instance. In a cell, run the connection instructions for the CodeArtifact repository we created earlier. Instead of a long pause with an eventual timeout error, we now get an AccessDenied exception.

Recall that CodeArtifact connection instructions can be found in the AWS Management Console by navigating to the CodeArtifact repository and selecting View connection instructions. For this post, select the connection instructions for pip.

At this point, our isolated SageMaker notebook instance can connect to the CodeArtifact service via PrivateLink. We now need to give our notebook instance’s role the relevant permissions required to interact with the service from a Jupyter notebook.

Modify notebook permissions

Once our CodeArtifact repository is configured, we need to modify the permissions on our isolated notebook instance role to allow our notebook to read from the artifact repository. In the AWS Management Console, navigate to the IAM service and, under Policies, choose Create Policy. Choose the JSON tab and paste the following JSON document in the text window:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "codeartifact:GetAuthorizationToken",
            "Resource": "arn:aws:codeartifact:<region>:<account_no>:domain/<domain_name>"
        },
        {
            "Effect": "Allow",
            "Action": "sts:GetServiceBearerToken",
            "Resource": "*"
        }
    ]
}

To create this policy document in the CLI, save this JSON as policy.json and run the following command:

aws iam create-policy --policy-name <policy_name> --policy-document file:///PATH/TO/policy.json

When attached to a role, this policy document permits the retrieval of an authorization token from the CodeArtifact service. Attach this policy to our notebook instance role by navigating to the IAM service in the AWS Management Console. Choose the notebook instance role you are using and attach this policy directly to the instance role. This can be done with the AWS CLI by running the following command:

aws iam attach-role-policy --role-name <role_name> --policy-arn <policy_arn>

Equipped with a role that allows authentication to CodeArtifact, we can now continue testing.

Permissions Test

In the AWS Management Console, navigate to the SageMaker service and open a Jupyter notebook from the Internet-disabled notebook instance. In the notebook cell, attempt to log into the CodeArtifact repository using the same command from the network test (found in the Network Test section).

Instead of an access denied exception, the output should show a successful authentication to the repository with an expiration on the token. Continue testing by using pip to download, install, and uninstall packages. These commands are authorized based on the policy attached to the CodeArtifact repository. If you want to restrict access to the repository based on the user, for example, restricting the ability to uninstall a package, modify the CodeArtifact repository policy.

We can confirm that the packages are installed by navigating to the repository in the AWS Management Console and searching for the installed package.

Clean up

When you destroy the VPC endpoints, the notebook instance loses access to the CodeArtifact repository. This reintroduces the timeout error from earlier in this post. This is expected behavior. Additionally, you may also delete the CodeArtifact repository, which charges customers based on the number of GBs of data stored per month.

Conclusion

By combining VPC endpoints with SageMaker notebooks, we can extend the availability of other AWS services to Internet-disabled private notebook instances. This allows us to improve the security posture of our development environment, without sacrificing developer productivity.


About the Author

frgud HeadshotDan Ferguson is a Solutions Architect at Amazon Web Services, focusing primarily on Private Equity & Growth Equity investments into late-stage startups.

Read More

A Platform for Continuous Construction and Serving of Knowledge At Scale

We introduce Saga, a next-generation knowledge construction and serving platform for powering knowledge-based applications at industrial scale. Saga follows a hybrid batch-incremental design to continuously integrate billions of facts about real-world entities and construct a central knowledge graph that supports multiple production use cases with diverse requirements around data freshness, accuracy, and availability. In this paper, we discuss the unique challenges associated with knowledge graph construction at industrial scale, and review the main components of Saga and how they address…Apple Machine Learning Research

By Land, Sea and Space: How 5 Startups Are Using AI to Help Save the Planet

Different parts of the globe are experiencing distinct climate challenges — severe drought, dangerous flooding, reduced biodiversity or dense air pollution.

The challenges are so great that no country can solve them on their own. But innovative startups worldwide are lighting the way, demonstrating how these daunting challenges can be better understood and addressed with AI.

Here’s how five — all among the 10,000+ members of NVIDIA Inception, a program designed to nurture cutting-edge startups — are looking out for the environment using NVIDIA-accelerated applications:

Blue Sky Analytics Builds Cloud Platform for Climate Action

India-based Blue Sky Analytics is building a geospatial intelligence platform that harnesses satellite data for environmental monitoring and climate risk assessment. The company provides developers with climate datasets to analyze air quality and estimate greenhouse gas emissions from fires  — with additional datasets in the works to forecast future biomass fires and monitor water capacity in lakes, rivers and glacial melts.

The company uses cloud-based NVIDIA GPUs to power its work. It’s a founding member of Climate TRACE, a global coalition led by Al Gore that aims to provide high-resolution global greenhouse gas emissions data in near real time. The startup leads Climate TRACE’s work examining how land use and land cover change due to fires.

Rhions Lab Protects Wildlife With Computer Vision

Kenya-based Rhions Lab uses AI to tackle challenges to biodiversity, including human-wildlife conflict, poaching and illegal wildlife trafficking. The company is adopting NVIDIA Jetson Nano modules for AI at the edge to support its conservation projects.

One of the company’s projects, Xoome, is an AI-powered camera trap that identifies wild animals, vehicles and civilians — sending alerts of poaching threats to on-duty wildlife rangers. Another initiative monitors beekeepers’ colonies with a range of sensors that capture acoustic data, vibrations, temperature and humidity within beehives. The platform can help beekeepers monitor bee colony health and fend off threats from thieves, whether honey badgers or humans.

TrueOcean Predicts Undersea Carbon Capture and Storage

German startup TrueOcean analyzes global-scale maritime data to inform innovation around natural ocean carbon sinks, renewable energy and shipping route optimization. The company is using AI to predict and quantify carbon absorption and storage in seagrass meadows and subsea geology. This makes it possible to greatly increase the carbon storage potential of Earth’s oceans.

TrueOcean uses AI solutions, including federated learning accelerated on NVIDIA DGX A100 systems, to help scientists predict, monitor and manage these sequestration efforts.

ASTERRA Saves Water With GPU-Accelerated Leak Detection

ASTERRA, based in Israel, has developed AI models that analyze satellite images to answer critical questions around water infrastructure. It’s equipping maintenance workers and engineers with the insights needed to find deficient water pipelines, assess underground moisture and locate leaks. The company uses NVIDIA GPUs through Amazon Web Services to develop and run its machine learning algorithms.

Since deploying its leak detection solution in 2016, ASTERRA has helped the water industry identify tens of thousands of leaks, conserving billions of gallons of drinkable water each year. Stopping leaks prevents ground pollution, reduces water wastage and even saves power. The company estimates its solution has reduced the water industry’s energy use by more than 420,000 megawatt hours since its launch.

Neu.ro Launches Renewable Energy-Powered AI Cloud

Another way to make a difference is by decreasing the carbon footprint of training AI models.

To help address this challenge, San Francisco-based Inception startup Neu.ro launched an NVIDIA DGX A100-powered AI cloud that’s powered entirely by geothermal and hydropower, with free-air cooling. Located in Iceland, the data center is being used for AI applications in telecommunications, retail, finance and healthcare.

The company has also developed a Green AI suite to help businesses monitor the environmental impact of AI projects, allowing developer teams to optimize compute usage to balance performance with carbon footprint.

Learn more about how GPU technology drives applications with social impact, including environmental projects. AI, data science and HPC startups can apply to join NVIDIA Inception.

The post By Land, Sea and Space: How 5 Startups Are Using AI to Help Save the Planet appeared first on NVIDIA Blog.

Read More

Hidden Interfaces for Ambient Computing

As consumer electronics and internet-connected appliances are becoming more common, homes are beginning to embrace various types of connected devices that offer functionality like music control, voice assistance, and home automation. A graceful integration of devices requires adaptation to existing aesthetics and user styles rather than simply adding screens, which can easily disrupt a visual space, especially when they become monolithic surfaces or black screens when powered down or not actively used. Thus there is an increasing desire to create connected ambient computing devices and appliances that can preserve the aesthetics of everyday materials, while providing on-demand access to interaction and digital displays.

Illustration of how hidden interfaces can appear and disappear in everyday surfaces, such as a mirror or the wood paneling of a home appliance.

In “Hidden Interfaces for Ambient Computing: Enabling Interaction in Everyday Materials through High-Brightness Visuals on Low-Cost Matrix Displays”, presented at ACM CHI 2022, we describe an interface technology that is designed to be embedded underneath materials and our vision of how such technology can co-exist with everyday materials and aesthetics. This technology makes it possible to have high-brightness, low-cost displays appear from underneath materials such as textile, wood veneer, acrylic or one-way mirrors, for on-demand touch-based interaction.

Hidden interface prototypes demonstrate bright and expressive rendering underneath everyday materials. From left to right: thermostat under textile, a scalable clock under wood veneer, and a caller ID display and a zooming countdown under mirrored surfaces.

Parallel Rendering: Boosting PMOLED Brightness for Ambient Computing
While many of today’s consumer devices employ active-matrix organic light-emitting diode (AMOLED) displays, their cost and manufacturing complexity is prohibitive for ambient computing. Yet other display technologies, such as E-ink and LCD, do not have sufficient brightness to penetrate materials.

To address this gap, we explore the potential of passive-matrix OLEDs (PMOLEDs), which are based on a simple design that significantly reduces cost and complexity. However, PMOLEDs typically use scanline rendering, where active display driver circuitry sequentially activates one row at a time, a process that limits display brightness and introduces flicker.

Instead, we propose a system that uses parallel rendering, where as many rows as possible are activated simultaneously in each operation by grouping rectilinear shapes of horizontal and vertical lines. For example, a square can be shown with just two operations, in contrast to traditional scanline rendering that needs as many operations as there are rows. With fewer operations, parallel rendering can output significantly more light in each instant to boost brightness and eliminate flicker. The technique is not strictly limited to lines and rectangles even if that is where we see the most dramatic performance increase. For example, one could add additional rendering steps for antialiasing (i.e., smoothing of) non-rectilinear content.

Illustration of scanline rendering (top) and parallel rendering (bottom) operations of an unfilled rectangle. Parallel rendering achieves bright, flicker-free graphics by simultaneously activating multiple rows.

Rendering User Interfaces and Text
We show that hidden interfaces can be used to create dynamic and expressive interactions. With a set of fundamental UI elements such as buttons, switches, sliders, and cursors, each interface can provide different basic controls, such as light switches, volume controls and thermostats. We created a scalable font (i.e., a set of numbers and letters) that is designed for efficient rendering in just a few operations. While we currently exclude letters “k, z, x” with their diagonal lines, they could be supported with additional operations. The per-frame-control of font properties coupled with the high frame rate of the display enables very fluid animations — this capability greatly expands the expressivity of the rectilinear graphics far beyond what is possible on fixed 7-segment LED displays.

In this work, we demonstrate various examples, such as a scalable clock, a caller ID display, a zooming countdown timer, and a music visualizer.

Realizing Hidden Interfaces with Interactive Hardware
To implement proof-of-concept hidden interfaces, we use a PMOLED display with 128×96 resolution that has all row and column drivers routed to a connector for direct access. We use a custom printed circuit board (PCB) with fourteen 16-channel digital-to-analog converters (DACs) to directly interface those 224 lines from a Raspberry Pi 3 A+. The touch interaction is enabled by a ring-shaped PCB surrounding the display with 12 electrodes arranged in arc segments.

Comparison to Existing Technologies
We compared the brightness of our parallel rendering to both the scanline on the same PMOLED and a small and large state-of-the-art AMOLED. We tested brightness through six common materials, such as wood and plastic. The material thickness ranged from 0.2 mm for the one-way mirror film to 1.6 mm for basswood. We measured brightness in lux (lx = light intensity as perceived by the human eye) using a light meter near the display. The environmental light was kept dim, slightly above the light meter’s minimum sensitivity. For simple rectangular shapes, we observed 5–40x brightness increase for the PMOLED in comparison to the AMOLED. The exception was the thick basswood, which didn’t let much light through for any rendering technology.

Example showing performance difference between parallel rendering on the PMOLED (this work) and a similarly sized modern 1.4″ AMOLED.

To validate the findings from our technical characterization with more realistic and complex content, we evaluate the number “2”, a grid of checkboxes, three progress bars, and the text “Good Life”. For this more complex content, we observed a 3.6–9.3x brightness improvement. These results suggest that our approach of parallel rendering on PMOLED enables display through several materials, and outperforms common state-of-the-art AMOLED displays, which seem to not be usable for the tested scenarios.

Brightness experiments with additional shapes that require different numbers of operations (ops). Measurements are shown in comparison to large state-of-the-art AMOLED displays.

What’s Next?
In this work, we enabled hidden interfaces that can be embedded in traditional materials and appear on demand. Our lab evaluation suggests unmet opportunities to introduce hidden displays with simple, yet expressive, dynamic and interactive UI elements and text in traditional materials, especially wood and mirror, to blend into people’s homes.

In the future, we hope to investigate more advanced parallel rendering techniques, using algorithms that could also support images and complex vector graphics. Furthermore, we plan to explore efficient hardware designs. For example, application-specific integrated circuits (ASICs) could enable an inexpensive and small display controller with parallel rendering instead of a large array of DACs. Finally, longitudinal deployment would enable us to go deeper into understanding user adoption and behavior with hidden interfaces.

Hidden interfaces demonstrate how control and feedback surfaces of smart devices and appliances could visually disappear when not in use and then appear when in the user’s proximity or touch. We hope this direction will encourage the community to consider other approaches and scenarios where technology can fade into the background for a more harmonious coexistence with traditional materials and human environments.

Acknowledgements
First and foremost, we would like to thank Ali Rahimi and Roman Lewkow for the collaboration, including providing the enabling technology. We also thank Olivier Bau, Aaron Soloway, Mayur Panchal and Sukhraj Hothi for their prototyping and fabrication contributions. We thank Michelle Chang and Mark Zarich for visual designs, illustrations and presentation support. We thank Google ATAP and the Google Interaction Lab for their support of the project. Finally, we thank Sarah Sterman and Mathieu Le Goc for helpful discussions and suggestions.

Read More

Specify and extract information from documents using the new Queries feature in Amazon Textract

Amazon Textract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. Amazon Textract now offers the flexibility to specify the data you need to extract from documents using the new Queries feature within the Analyze Document API. You don’t need to know the structure of the data in the document (table, form, implied field, nested data) or worry about variations across document versions and formats.

In this post, we discuss the following topics:

  • Success stories from AWS customers and benefits of the new Queries feature
  • How the Analyze Document Queries API helps extract information from documents
  • A walkthrough of the Amazon Textract console
  • Code examples to utilize the Analyze Document Queries API
  • How to process the response with the Amazon Textract parser library

Benefits of the new Queries feature

Traditional OCR solutions struggle to extract data accurately from most semi-structured and unstructured documents because of significant variations in how the data is laid out across multiple versions and formats of these documents. You need to implement custom postprocessing code or manually review the extracted information from these documents. With the Queries feature, you can specify the information you need in the form of natural language questions (for example, “What is the customer name”) and receive the exact information (“John Doe”) as part of the API response. The feature uses a combination of visual, spatial, and language models to extract the information you seek with high accuracy. The Queries feature is pre-trained on a large variety of semi-structured and unstructured documents. Some examples include paystubs, bank statements, W-2s, loan application forms, mortgage notes, and vaccine and insurance cards.

Amazon Textract enables us to automate the document processing needs of our customers. With the Queries feature, we will be able to extract data from a variety of documents with even greater flexibility and accuracy,said Robert Jansen, Chief Executive Officer at TekStream Solutions. “We see this as a big productivity win for our business customers, who will be able to use the Queries capability as part of our IDP solution to quickly get key information out of their documents.

Amazon Textract enables us to extract text as well as structured elements like Forms and Tables from images with high accuracy. Amazon Textract Queries has helped us drastically improve the quality of information extraction from several business-critical documents such as safety data sheets or material specificationssaid Thorsten Warnecke, Principal | Head of PC Analytics, Camelot Management Consultants. “The natural language query system offers great flexibility and accuracy which has reduced our post-processing load and enabled us to add new documents to our data extraction tools quicker.

How the Analyze Document Queries API helps extract information from documents

Companies have increased their adoption of digital platforms, especially in light of the COVID-19 pandemic. Most organizations now offer a digital way to acquire their services and products utilizing smartphones and other mobile devices, which offers flexibility to users but also adds to the scale at which digital documents need to be reviewed, processed, and analyzed. In some workloads where, for example, mortgage documents, vaccination cards, paystubs, insurance cards, and other documents must be digitally analyzed, the complexity of data extraction can become exponentially aggravated because these documents lack a standard format or have significant variations in data format across different versions of the document.

Even powerful OCR solutions struggle to extract data accurately from these documents, and you may have to implement custom postprocessing for these documents. This includes mapping possible variations of form keys to customer-native field names or including custom machine learning to identify specific information in an unstructured document.

The new Analyze Document Queries API in Amazon Textract can take natural language written questions such as “What is the interest rate?” and perform powerful AI and ML analysis on the document to figure out the desired information and extract it from the document without any postprocessing. The Queries feature doesn’t require any custom model training or setting up of templates or configurations. You can quickly get started by uploading your documents and specifying questions on those documents via the Amazon Textract console, the AWS Command Line Interface (AWS CLI), or AWS SDK.

In subsequent sections of this post, we go through detailed examples of how to use this new functionality on common workload use cases and how to use the Analyze Document Queries API to add agility to the process of digitalizing your workload.

Use the Queries feature on the Amazon Textract console

Before we get started with the API and code samples, let’s review the Amazon Textract console. The following image shows an example of a vaccination card on the Queries tab for the Analyze Document API on the Amazon Textract console. After you upload the document to the Amazon Textract console, choose Queries in the Configure Document section. You can then add queries in the form of natural language questions. After you add all your queries, choose Apply Configuration. The answers to the questions are located on the Queries tab.

Code examples

In this section, we explain how to invoke the Analyze Document API with the Queries parameter to get answers to natural language questions about the document. The input document is either in a byte array format or located in an Amazon Simple Storage Service (Amazon S3) bucket. You pass image bytes to an Amazon Textract API operation by using the Bytes property. For example, you can use the Bytes property to pass a document loaded from a local file system. Image bytes passed by using the Bytes property must be base64 encoded. Your code might not need to encode document file bytes if you’re using an AWS SDK to call Amazon Textract API operations. Alternatively, you can pass images stored in an S3 bucket to an Amazon Textract API operation by using the S3Object property. Documents stored in an S3 bucket don’t need to be base64 encoded.

You can use the Queries feature to get answers from different types of documents like paystubs, vaccination cards, mortgage documents, bank statements, W-2 forms, 1099 forms, and others. In the following sections, we go over some of these documents and show how the Queries feature works.

Paystub

In this example, we walk through the steps to analyze a paystub using the Queries feature, as shown in the following example image.

We use the following sample Python code:

import boto3
import json

#create a Textract Client
textract = boto3.client('textract')

image_filename = "paystub.jpg"

response = None
with open(image_filename, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Textract AnalyzeDocument by passing a document from local disk
response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
        "Queries": [{
            "Text": "What is the year to date gross pay",
            "Alias": "PAYSTUB_YTD_GROSS"
        },
        {
            "Text": "What is the current gross pay?",
            "Alias": "PAYSTUB_CURRENT_GROSS"
        }]
    })

The following code is a sample AWS CLI command:

aws textract analyze-document —document '{"S3Object":{"Bucket":"your-s3-bucket","Name":"paystub.jpg"}}' —feature-types '["QUERIES"]' —queries-config '{"Queries":[{"Text":"What is the year to date gross pay", "Alias": "PAYSTUB_YTD_GROSS"}]}' 

Let’s analyze the response we get for the two queries we passed to the Analyze Document API in the preceding example. The following response has been trimmed to only show the relevant parts:

{
         "BlockType":"QUERY",
         "Id":"cbbba2fa-45be-452b-895b-adda98053153", #id of first QUERY
         "Relationships":[
            {
               "Type":"ANSWER",
               "Ids":[
                  "f2db310c-eaa6-481d-8d18-db0785c33d38" #id of first QUERY_RESULT
               ]
            }
         ],
         "Query":{
            "Text":"What is the year to date gross pay", #First Query
            "Alias":"PAYSTUB_YTD_GROSS"
         }
      },
      {
         "BlockType":"QUERY_RESULT",
         "Confidence":87.0,
         "Text":"23,526.80", #Answer to the first Query
         "Geometry":{...},
         "Id":"f2db310c-eaa6-481d-8d18-db0785c33d38" #id of first QUERY_RESULT
      },
      {
         "BlockType":"QUERY",
         "Id":"4e2a17f0-154f-4847-954c-7c2bf2670c52", #id of second QUERY
         "Relationships":[
            {
               "Type":"ANSWER",
               "Ids":[
                  "350ab92c-4128-4aab-a78a-f1c6f6718959"#id of second QUERY_RESULT
               ]
            }
         ],
         "Query":{
            "Text":"What is the current gross pay?", #Second Query
            "Alias":"PAYSTUB_CURRENT_GROSS"
         }
      },
      {
         "BlockType":"QUERY_RESULT",
         "Confidence":95.0,
         "Text":"$ 452.43", #Answer to the Second Query
         "Geometry":{...},
         "Id":"350ab92c-4128-4aab-a78a-f1c6f6718959" #id of second QUERY_RESULT
      }

The response has a BlockType of QUERY that shows the question that was asked and a Relationships section that has the ID for the block that has the answer. The answer is in the BlockType of QUERY_RESULT. The alias that is passed in as an input to the Analyze Document API is returned as part of the response and can be used to label the answer.

We use the Amazon Textract Response Parser to extract just the questions, the alias, and the corresponding answers to those questions:

import trp.trp2 as t2

d = t2.TDocumentSchema().load(response)
page = d.pages[0]

# get_query_answers returns a list of [query, alias, answer]
query_answers = d.get_query_answers(page=page)
for x in query_answers:
    print(f"{image_filename},{x[1]},{x[2]}")

from tabulate import tabulate
print(tabulate(query_answers, tablefmt="github"))

The preceding code returns the following results:

|------------------------------------|-----------------------|-----------|
| What is the current gross pay?     | PAYSTUB_CURRENT_GROSS | $ 452.43  |
| What is the year to date gross pay | PAYSTUB_YTD_GROSS     | 23,526.80 |

More questions and the full code can be found in the notebook on the GitHub repo.

Mortgage note

The Analyze Document Queries API also works well with mortgage notes like the following.

The process to call the API and process results is the same as the previous example. You can find the full code example on the GitHub repo.

The following code shows the example responses obtained using the API:

|------------------------------------------------------------|----------------------------------|---------------|
| When is this document dated?                               | MORTGAGE_NOTE_DOCUMENT_DATE      | March 4, 2022 |
| What is the note date?                                     | MORTGAGE_NOTE_DATE               | March 4, 2022 |
| When is the Maturity date the borrower has to pay in full? | MORTGAGE_NOTE_MATURITY_DATE      | April, 2032   |
| What is the note city and state?                           | MORTGAGE_NOTE_CITY_STATE         | Anytown, ZZ   |
| what is the yearly interest rate?                          | MORTGAGE_NOTE_YEARLY_INTEREST    | 4.150%        |
| Who is the lender?                                         | MORTGAGE_NOTE_LENDER             | AnyCompany    |
| When does payments begin?                                  | MORTGAGE_NOTE_BEGIN_PAYMENTS     | April, 2022   |
| What is the beginning date of payment?                     | MORTGAGE_NOTE_BEGIN_DATE_PAYMENT | April, 2022   |
| What is the initial monthly payments?                      | MORTGAGE_NOTE_MONTHLY_PAYMENTS   | $ 2500        |
| What is the interest rate?                                 | MORTGAGE_NOTE_INTEREST_RATE      | 4.150%        |
| What is the principal amount borrower has to pay?          | MORTGAGE_NOTE_PRINCIPAL_PAYMENT  | $ 500,000     |

Vaccination card

The Amazon Textract Queries feature also works very well to extract information from vaccination cards or cards that resemble it, like in the following example.

The process to call the API and parse the results is the same as used for a paystub. After we process the response, we get the following information:

|------------------------------------------------------------|--------------------------------------|--------------|
| What is the patients first name                            | PATIENT_FIRST_NAME                   | Major        |
| What is the patients last name                             | PATIENT_LAST_NAME                    | Mary         |
| Which clinic site was the 1st dose COVID-19 administrated? | VACCINATION_FIRST_DOSE_CLINIC_SITE   | XYZ          |
| Who is the manufacturer for 1st dose of COVID-19?          | VACCINATION_FIRST_DOSE_MANUFACTURER  | Pfizer       |
| What is the date for the 2nd dose covid-19?                | VACCINATION_SECOND_DOSE_DATE         | 2/8/2021     |
| What is the patient number                                 | PATIENT_NUMBER                       | 012345abcd67 |
| Who is the manufacturer for 2nd dose of COVID-19?          | VACCINATION_SECOND_DOSE_MANUFACTURER | Pfizer       |
| Which clinic site was the 2nd dose covid-19 administrated? | VACCINATION_SECOND_DOSE_CLINIC_SITE  | CVS          |
| What is the lot number for 2nd dose covid-19?              | VACCINATION_SECOND_DOSE_LOT_NUMBER   | BB5678       |
| What is the date for the 1st dose covid-19?                | VACCINATION_FIRST_DOSE_DATE          | 1/18/21      |
| What is the lot number for 1st dose covid-19?              | VACCINATION_FIRST_DOSE_LOT_NUMBER    | AA1234       |
| What is the MI?                                            | MIDDLE_INITIAL                       | M            |

The full code can be found in the notebook on the GitHub repo.

Insurance card

The Queries feature also works well with insurance cards like the following.

The process to call the API and process results is the same as showed earlier. The full code example is available in the notebook on the GitHub repo.

The following are the example responses obtained using the API:

|-------------------------------------|-----------------------------------|---------------|
| What is the insured name?           | INSURANCE_CARD_NAME               | Jacob Michael |
| What is the level of benefits?      | INSURANCE_CARD_LEVEL_BENEFITS     | SILVER        |
| What is medical insurance provider? | INSURANCE_CARD_PROVIDER           | Anthem        |
| What is the OOP max?                | INSURANCE_CARD_OOP_MAX            | $6000/$12000  |
| What is the effective date?         | INSURANCE_CARD_EFFECTIVE_DATE     | 11/02/2021    |
| What is the office visit copay?     | INSURANCE_CARD_OFFICE_VISIT_COPAY | $55/0%        |
| What is the specialist visit copay? | INSURANCE_CARD_SPEC_VISIT_COPAY   | $65/0%        |
| What is the member id?              | INSURANCE_CARD_MEMBER_ID          | XZ 9147589652 |
| What is the plan type?              | INSURANCE_CARD_PLAN_TYPE          | Pathway X-EPO |
| What is the coinsurance amount?     | INSURANCE_CARD_COINSURANCE        | 30%           |

Best practices for crafting queries

When crafting your queries, consider the following best practices:

  • In general, ask a natural language question that starts with “What is,” “Where is,” or “Who is.” The exception is when you’re trying to extract standard key-value pairs, in which case you can pass the key name as a query.
  • Avoid ill-formed or grammatically incorrect questions, because these could result in unexpected answers. For example, an ill-formed query is “When?” whereas a well-formed query is “When was the first vaccine dose administered?”
  • Where possible, use words from the document to construct the query. Although the Queries feature tries to do acronym and synonym matching for some common industry terms such as “SSN,” “tax ID,” and “Social Security number,” using language directly from the document improves results. For example, if the document says “job progress,” try to avoid using variations like “project progress,” “program progress,” or “job status.”
  • Construct a query that contains words from both the row header and column header. For example, in the preceding vaccination card example, in order to know the date of the second vaccination, you can frame the query as “What date was the 2nd dose administered?”
  • Long answers increase response latency and can lead to timeouts. Try to ask questions that respond with answers fewer than 100 words.
  • Passing only the key name as the question works when trying to extract standard key-value pairs from a form. We recommend framing full questions for all other extraction use cases.
  • Be as specific as possible. For example:
    • When the document contains multiple sections (such as “Borrower” and “Co-Borrower”) and both sections have a field called “SSN,” ask “What is the SSN for Borrower?” and “What is the SSN for Co-Borrower?”
    • When the document has multiple date-related fields, be specific in the query language and ask “What is the date the document was signed on?” or “What is the date of birth of the application?” Avoid asking ambiguous questions like “What is the date?”
  • If you know the layout of the document beforehand, give location hints to improve accuracy of results. For example, ask “What is the date at the top?” or “What is the date on the left?” or “What is the date at the bottom?”

For more information about the Queries feature, refer to [link to documentation].

Conclusion

In this post, we provided an overview of the new Queries feature of Amazon Textract to quickly and easily retrieve information from documents such as paystubs, mortgage notes, insurance cards, and vaccination cards based on natural language questions. We also described how you can parse the response JSON.

For more information, see Analyzing Documents , or check out the Amazon Textract console and try out this feature.


About the Authors

Uday Narayanan is a Sr. Solutions Architect at AWS. He enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are data analytics, big data systems, and machine learning. In his spare time, he enjoys playing sports, binge-watching TV shows, and traveling.

Rafael Caixeta is a Sr. Solutions Architect at AWS based in California. He has over 10 years of experience developing architectures for the cloud. His core areas are serverless, containers, and machine learning. In his spare time, he enjoys reading fiction books and traveling the world.

Navneeth Nair is a Senior Product Manager, Technical with the Amazon Textract team. He is focused on building machine learning-based services for AWS customers.

Martin Schade is a Senior ML Product SA with the Amazon Textract team. He has over 20 years of experience with internet-related technologies, engineering, and architecting solutions. He joined AWS in 2014, first guiding some of the largest AWS customers on the most efficient and scalable use of AWS services, and later focused on AI/ML with a focus on computer vision. Currently, he’s obsessed with extracting information from documents.

Read More