August 2022 – Page 11

Empowering PyTorch on Intel® Xeon® Scalable processors with Bfloat16

Overview

Recent years, the growing complexity of AI models have been posing requirements on hardware for more and more compute capability. Reduced precision numeric format has been proposed to address this problem. Bfloat16 is a custom 16-bit floating point format for AI which consists of one sign bit, eight exponent bits, and seven mantissa bits. With the same dynamic range as float32, bfloat16 doesn’t require a special handling such as loss scaling. Therefore, bfloat16 is a drop-in replacement for float32 when running deep neural networks for both inference and training.

The 3rd Gen Intel^® Xeon^® Scalable processor (codenamed Cooper Lake), is the first general purpose x86 CPU with native bfloat16 support. Three new bfloat16 instructions were introduced in Intel^® Advanced Vector Extensions-512 (Intel^® AVX-512): VCVTNE2PS2BF16, VCVTNEPS2BF16, and VDPBF16PS. The first two instructions perform conversion from float32 to bfloat16, and the last one performs a dot product of bfloat16 pairs. Bfloat16 theoretical compute throughput is doubled over float32 on Cooper Lake. On the next generation of Intel^® Xeon^® Scalable Processors, bfloat16 compute throughput will be further enhanced through Advanced Matrix Extensions (Intel^® AMX) instruction set extension.

Intel and Meta previously collaborated to enable bfloat16 on PyTorch, and the related work was published in an earlier blog during launch of Cooper Lake. In that blog, we introduced the hardware advancement for native bfloat16 support and showcased a performance boost of 1.4x to 1.6x of bfloat16 over float32 from DLRM, ResNet-50 and ResNext-101-32x4d.

In this blog, we will introduce the latest software enhancement on bfloat16 in PyTorch 1.12, which would apply to much broader scope of user scenarios and showcase even higher performance boost.

Native Level Optimization on Bfloat16

On PyTorch CPU bfloat16 path, the compute intensive operators, e.g., convolution, linear and bmm, use oneDNN (oneAPI Deep Neural Network Library) to achieve optimal performance on Intel CPUs with AVX512_BF16 or AMX support. The other operators, such as tensor operators and neural network operators, are optimized at PyTorch native level. We have enlarged bfloat16 kernel level optimizations to majority of operators on dense tensors, both inference and training applicable (sparse tensor bfloat16 support will be covered in future work), specifically:

Bfloat16 vectorization: Bfloat16 is stored as unsigned 16-bit integer, which requires it to be casted to float32 for arithmetic operations such as add, mul, etc. Specifically, each bfloat16 vector will be converted to two float32 vectors, processed accordingly and then converted back. While for non-arithmetic operations such as cat, copy, etc., it is a straight memory copy and no data type conversion will be involved.
Bfloat16 reduction: Reduction on bfloat16 data uses float32 as accumulation type to guarantee numerical stability, e.g., sum, BatchNorm2d, MaxPool2d, etc.
Channels Last optimization: For vision models, Channels Last is the preferable memory format over Channels First from performance perspective. We have implemented fully optimized CPU kernels for all the commonly used CV modules on channels last memory format, taking care of both float32 and bfloat16.

Run Bfloat16 with Auto Mixed Precision

To run model on bfloat16, typically user can either explicitly convert the data and model to bfloat16, for example:

# with explicit conversion
input = input.to(dtype=torch.bfloat16)
model = model.to(dtype=torch.bfloat16)

or utilize torch.amp (Automatic Mixed Precision) package. The autocast instance serves as context managers or decorators that allow regions of your script to run in mixed precision, for example:

# with AMP
with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
    output = model(input)

Generally, the explicit conversion approach and AMP approach have similar performance. Even though, we recommend run bfloat16 models with AMP, because:

Better user experience with automatic fallback: If your script includes operators that don’t have bfloat16 support, autocast will implicitly convert them back to float32 while the explicit converted model will give a runtime error.
Mixed data type for activation and parameters: Unlike the explicit conversion which converts all the model parameters to bfloat16, AMP mode will run in mixed data type. To be specific, input/output will be kept in bfloat16 while parameters, e.g., weight/bias, will be kept in float32. The mixed data type of activation and parameters will help improve performance while maintaining the accuracy.

Performance Gains

We benchmarked inference performance of TorchVision models on Intel® Xeon® Platinum 8380H CPU @ 2.90GHz (codenamed Cooper Lake), single instance per socket (batch size = 2 x number of physical cores). Results show that bfloat16 has 1.4x to 2.2x performance gain over float32.

The performance boost of bfloat16 over float32 primarily comes from 3 aspects:

The compute intensive operators take advantage of the new bfloat16 native instruction VDPBF16PS which doubles the hardware compute throughput.
Bfloat16 have only half the memory footprint of float32, so theoretically the memory bandwidth intensive operators will be twice faster.
On Channels Last, we intentionally keep the same parallelization scheme for all the memory format aware operators (can’t do this on Channels First though), which increases the data locality when passing each layer’s output to the next. Basically, it keeps the data closer to CPU cores while data would reside in cache anyway. And bfloat16 will have a higher cache hit rate compared with float32 in such scenarios due to smaller memory footprint.

Conclusion & Future Work

In this blog, we introduced recent software optimizations on bfloat16 introduced in PyTorch 1.12. Results on the 3rd Gen Intel^® Xeon^® Scalable processor show that bfloat16 has 1.4x to 2.2x performance gain over float32 on the TorchVision models. Further improvement is expected on the next generation of Intel^® Xeon^® Scalable Processors with AMX instruction support. Though the performance number for this blog is collected with TorchVision models, the benefit is broad across all topologies. And we will continue to extend the bfloat16 optimization effort to a broader scope in the future!

Acknowledgement

The results presented in this blog is a joint effort of Meta and Intel PyTorch team. Special thanks to Vitaly Fedyunin and Wei Wei from Meta who spent precious time and gave substantial assistance! Together we made one more step on the path of improving the PyTorch CPU eco system.

Reference

Customize your recommendations by promoting specific items using business rules with Amazon Personalize

Today, we are excited to announce Promotions feature in Amazon Personalize that allows you to explicitly recommend specific items to your users based on rules that align with your business goals. For instance, you can have marketing partnerships that require you to promote certain brands, in-house content, or categories that you want to improve the visibility of. Promotions give you more control over recommended items. You can define business rules to identify promotional items and showcase them across your entire user base, without any extra cost. You also control the percentage of the promoted content in your recommendations. Amazon Personalize automatically finds the relevant items within the set of promotional items that meet your business rule and distributes them within each user’s recommendations.

Amazon Personalize enables you to improve customer engagement by powering personalized product and content recommendations in websites, applications, and targeted marketing campaigns. You can get started without any prior machine learning (ML) experience, using APIs to easily build sophisticated personalization capabilities in a few clicks. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.

In this post, we demonstrate how to customize your recommendations with the new promotions feature for an ecommerce use case.

Solution overview

Different businesses can use promotions based on their individual goals for the type of content they want to increase engagement on. You can use promotions to have a percentage of your recommendations be of a particular type for any application regardless of the domain. For example, in ecommerce applications, you can use this feature to have 20% of recommended items be those marked as on sale, or from a certain brand, or category. For video-on-demand use cases, you can use this feature to fill 40% of a carousel with newly launched shows and movies that you want to highlight, or to promote live content. You can use promotions in domain dataset groups and custom dataset groups (User-Personalization and Similar-Items recipes).

Amazon Personalize makes configuring promotions simple: first, create a filter that selects the items you want promoted. You can use the Amazon Personalize console or API to create a filter with your logic using the Amazon Personalize DSL (domain-specific language). It only takes a few minutes. Then, when requesting recommendations, specify the promotion by specifying the filter, the percentage of the recommendations that should match that filter, and, if required, the dynamic filter parameters. The promoted items are randomly distributed in the recommendations, but any existing recommendations aren’t removed.

The following diagram shows how you can use promotions in recommendations in Amazon Personalize.

You define the items to promote in the catalog system, load them to the Amazon Personalize items dataset, and then get recommendations. Getting recommendations without specifying a promotion returns the most relevant items, and in this example, only one item from the promoted items. There is no guarantee of promoted items being returned. Getting recommendations with 50% promoted items returns half the items belonging to the promoted items.

This post walks you through the process of defining and applying promotions in your recommendations in Amazon Personalize to ensure the results from a campaign or recommender contain specific items that you want users to see. For this example, we create a retail recommender and promote items with CATEGORY_L2 as halloween, which corresponds to Halloween decorations. A code sample for this use case is available on GitHub.

Prerequisites

To use promotions, you first set up some Amazon Personalize resources on the Amazon Personalize console. Create your dataset group, load your data, and train a recommender. For full instructions, see Getting started.

Create a dataset group.

Create an Interactions dataset using the following schema:

{
    "type": "record",
    "name": "Interactions",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "USER_ID",
            "type": "string"
        },
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "TIMESTAMP",
            "type": "long"
        },
        {
            "name": "EVENT_TYPE",
            "type": "string"
        }
    ],
    "version": "1.0"
}

Import the interaction data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3). For this example, we use the following data file. We generated the synthetic data based on the code in the Retail Demo Store project. Refer to the GitHub repo to learn more about the data and potential uses.

Create an Items dataset using the following schema:

{
    "type": "record",
    "name": "Items",
    "namespace": "com.amazonaws.personalize.schema",
    "fields": [
        {
            "name": "ITEM_ID",
            "type": "string"
        },
        {
            "name": "PRICE",
            "type": "float"
        },
        {
            "name": "CATEGORY_L1",
            "type": ["string"],
            "categorical": true
        },
        {
            "name": "CATEGORY_L2",
            "type": ["string"],
            "categorical": true
        },
        {
            "name": "GENDER",
            "type": ["string"],
            "categorical": true
        }
    ],
    "version": "1.0"
}

Import the item data to Amazon Personalize from Amazon S3. For this example, we use the following data file, based on the code in the Retail Demo Store project.For more information on formatting and importing your interactions and items data from Amazon S3, see Importing bulk records.
Create a recommender. In this example, we create a “Recommended for you” recommender.

Create a filter for your promotions

Now that you have set up your Amazon Personalize resources, you can create a filter that selects the items for your promotion.

You can create a static filter where all variables are hardcoded at filter creation. For example, to add all items that have CATEGORY_L2 as halloween, use the following filter expression:

INCLUDE ItemID WHERE Items.CATEGORY_L2 IN ("halloween")

You can also create dynamic filters. Dynamic filters are customizable in real time when you request the recommendations. To create a dynamic filter, you define your filter expression criteria using a placeholder parameter instead of a fixed value. This allows you to choose the values to filter by applying a filter to a recommendation request, rather than when you create your expression. You provide a filter when you call the GetRecommendations or GetPersonalizedRanking API operations, or as a part of your input data when generating recommendations in batch mode through a batch inference job.

For example, to select all items in a category chosen when you make your inference call with a filter applied, use the following filter expression:

INCLUDE ItemID WHERE Items.CATEGORY_L2 IN ($CATEGORY)

You can use the preceding DSL to create a customizable filter on the Amazon Personalize console. Complete the following steps:

On the Amazon Personalize console, on the Filters page, choose Create filter.
For Filter name, enter the name for your filter (for this post, we enter category_filter).
Select Build expression or add your expression manually to create your custom filter.
Build the expression “Include ItemID WHERE Items.CATEGORY_L2 IN $CATEGORY”For Value, you enter a value of $ plus a parameter name that is similar to your property name and easy to remember (for this example, $CATEGORY).
Optionally, to chain additional expressions with your filter, choose, the plus sign.
To add additional filter expressions, choose Add expression.
Choose Create filter.

You can also create filters via the createFilter API in Amazon Personalize. For more information, see CreateFilter.

Apply promotions to your recommendations

Applying a filter when getting recommendations is a good way to tailor your recommendations to specific criteria. However, using filters directly applies the filter to all the recommendations returned. When using promotions, you can select what percentage of the recommendations correspond to the promoted items, allowing you to mix and match personalized recommendations and the best items that match the promotion criteria for each user in the proportions that make sense for your business use case.

The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender:

{
    "recommenderArn" = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20
}

This request returns personalized recommendations for the specified user. Of the items in the catalog, these are the 20 most relevant items for the user.

We can do the same call and apply a filter to return only items that match the filter. The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender and applies a dynamic filter to only return relevant items that have CATEGORY_L2 as halloween:

{
    "recommenderArn" = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20,
    filterArn = "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
    filterValues={ "CATEGORY": ""halloween""}
}

This request returns personalized recommendations for the specified user that have CATEGORY_L2 as halloween. Out of the items in the catalog, these are the 20 most relevant items with CATEGORY_L2 as halloween for the user.

You can use promotions if you want a certain percentage of items to be of an attribute you want to promote, and the rest to be items that are the most relevant for this user out of all items in the catalog. We can do the same call and apply a promotion. The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender and applies a promotion to include a certain percentage of relevant items that have CATEGORY_L2 as halloween:

{
    recommenderArn = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20,
    promotions = [{
        "name" : "halloween_promotion",
        "percentPromotedItems" : 20,
        "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
        "filterValues": {
            "CATEGORY" : ""halloween""
        }
    }]
}

This request returns 20% of recommendations that match the filter specified in the promotion: items with CATEGORY_L2 as halloween; and 80% personalized recommendations for the specified user that are the most relevant items for the user out of the items in the catalog.

You can use a filter combined with promotions. The filter in the top-level parameter block applies only to the non-promoted items.

The filter to select the promoted items is specified in the promotions parameter block. The following example code is a request body for the GetRecommendations API that gets recommendations for a user using the “Recommended for You” recommender and uses the dynamic filter we have been using twice. The first filter applies to non-promoted items, selecting items with CATEGORY_L2 as decorative, and the second filter applies to the promotion, promoting items with CATEGORY_L2 as halloween:

{
    recommenderArn = "arn:aws:personalize:us-west-2:000000000000:recommender/test-recommender",
    userId = "1",
    numResults = 20,
    "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
    "filterValues": {
        "CATEGORY" : ""decorative""
    }
    promotions = [{
        "name" : "halloween_promotion",
        "percentPromotedItems" : 20,
        "filterArn": "arn:aws:personalize:us-west-2:000000000000:filter/category_filter",
        "filterValues": {
            "CATEGORY" : ""halloween""
        }
    }]
}

This request returns 20% of recommendations that match the filter specified in the promotion: items with CATEGORY_L2 as halloween. The remaining 80% of recommended items are personalized recommendations for the specified user with CATEGORY_L2 as decorative. These are the most relevant items for the user out of the items in the catalog with CATEGORY_L2 as decorative.

Clean up

Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete filters, recommenders, datasets, and dataset groups via the AWS Management Console or using the Python SDK.

Summary

Adding promotions in Amazon Personalize allows you to customize your recommendations for each user by including items that you want to explicitly increase visibility and engagement on. Promotions also allow you to specify what percentage of the recommended items should be promoted items, which tailors the recommendations to meet your business objectives at no extra cost. You can use promotions for recommendations using the User-Personalization and Similar-Items recipes, as well as use case optimized recommenders.

For more information about Amazon Personalize, see What Is Amazon Personalize?

About the authors

Anna Gruebler is a Solutions Architect at AWS.

Alex Burkleaux is a Solutions Architect at AWS. She focuses on helping customers apply machine learning and data analytics to solve problems in the media and entertainment industry. In her free time, she enjoys spending time with family and volunteering as a ski patroller at her local ski hill.

Liam Morrison is a Solutions Architect Manager at AWS. He leads a team focused on Marketing Intelligence services. He has spent the last 5 years focused on practical applications of Machine Learning in Media & Entertainment, helping customers implement personalization, natural language processing, computer vision and more.

Amazon SageMaker JumpStart solutions now support custom IAM role settings

Amazon SageMaker JumpStart solutions are a feature within Amazon SageMaker Studio that allow a simple-click experience to set up your own machine learning (ML) workflows. When you launch a solution, various of AWS resources are set up in your account to demonstrate how the business problem can be solved using the pre-built architecture. The solutions use AWS CloudFormation templates for quick deployment, which means the resources are fully customizable. As of today, there are up to 18 end-to-end solutions that cover different aspects of real-world business problems, such as demand forecasting, product defect detection, and document understanding.

Starting today, we’re excited to announce that JumpStart solutions now supports custom AWS Identity and Access Management (IAM) roles be passed into services. This new feature enables you to take advantage of the rich security features offered by SageMaker and IAM.

In this post, we show you how to configure your SageMaker solution’s advanced parameters, and how this can benefit you when you use the pre-built solutions to start your ML journey.

New IAM advanced parameters

In order to allow JumpStart create the AWS resources for you, the IAM roles attached with Amazon managed policies are auto-created in your account. For the services created by JumpStart to be able to interact with each other, an IAM role needs to be passed into each service so they have the necessary permissions to call other services.

With the new Advanced Parameters option, you can select Default Roles, Find Roles, or Input Roles when you launch a solution. This means each service uses their own IAM role with dedicated IAM policy attached, and is fully customizable. This allows you to follow the least-privilege permissions principle, so that only the permissions required to perform a task are granted.

The policies attached to the default roles contain the least amount of permissions needed for the solution. In addition to the default roles, you can also select from a drop-down list, or input your own roles with the custom permissions you want to grant. This can greatly benefit you if you want to expand on the existing solution and perform even more tasks with these pre-built AWS services.

How to configure IAM advanced parameters

Before you use this feature, make sure you have the latest SageMaker domain enabled. You can create a new SageMaker domain if you haven’t done so, or update your SageMaker domain to create the default roles required for JumpStart solution. Then complete the following steps:

On the SageMaker console, choose Control Panel in the navigation pane.
Choose the gear icon to edit your domain settings.
In the General Settings section, choose Next.
In the SageMaker Projects and JumpStart section, select Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for this account and Enable Amazon SageMaker project templates and Amazon SageMaker JumpStart for Studio users.
Choose Next.
Done! Now you should be able to see the roles enabled on the SageMaker console.Now you can use JumpStart solutions with this new feature enabled.
On the Studio console, choose JumpStart in the navigation pane.
Choose Solutions.In the Launch Solution section, you can see a new drop-down menu called Advanced Parameters. Each solution requires different resources. Based on the services that the solution interacts with, there’s a dynamic list of roles you can pass in when launching the solution.
Select your preferred method to specify roles.
If you select Default Role, the roles are pre-populated for you. You can then proceed to launch the solution with one click. Under the hood, AWS CloudFormation uses a built-in template to provision all appropriate AWS resources, and the default roles are used by each service.If you select Find Role, you can select an existing IAM role in your account from the drop-down menu for each required service. In order to let the services work as they are designed, we recommend choosing a role that has the minimum permissions required. For more information about the permissions required for each service, refer to AWS Managed Policies for SageMaker projects and JumpStart.

You can have more flexibility by selecting Input Role, which allows you to enter a role name directly. This works best if you know which role you want to use, so you don’t need to choose it from the Find Role list.
After you specify the role you want to use for each service, launch the solution by choosing Launch.

The roles are passed into each service and grant each service permission to interact with other services. The CloudFormation template deploys these services in your account. You can then explore the ML solution for the business problem. Keep in mind that for each service, they now have the precise permissions you have granted them when you configured the advanced parameters. This gives you a fully controlled and secured environment when using JumpStart solutions.

Conclusion

Today, we announced support for configuring IAM roles when you launch a JumpStart solution. We also showed you how to configure the Advanced Parameters options before launching a solution.

Try out any JumpStart solution on Studio with this new feature enabled. If you have any questions and feedback regarding JumpStart solutions, please speak to your AWS support contact or post a message in the Amazon SageMaker discussion forums.

About the authors

Haotian An is a Software Development Engineer at Amazon SageMaker Jumpstart. He focuses on building tools and products to make machine learning easier to access for customers.

Manan Shah is a Software Development Manager at Amazon Web Services. He is a ML enthusiast and focuses on building no-code/low-code AI/ML products. I thrive empowering other talented, technical people to build great software.

Intelligent document processing with AWS AI services: Part 2

Amazon’s intelligent document processing (IDP) helps you speed up your business decision cycles and reduce costs. Across multiple industries, customers need to process millions of documents per year in the course of their business. For customers who process millions of documents, this is a critical aspect for the end-user experience and a top digital transformation priority. Because of the varied formats, most firms manually process documents such as W2s, claims, ID documents, invoices, and legal contracts, or use legacy OCR (optical character recognition) solutions that are time-consuming, error-prone, and costly. An IDP pipeline with AWS AI services empowers you to go beyond OCR with more accurate and versatile information extraction, process documents faster, save money, and shift resources to higher value tasks.

In this series, we give an overview of the IDP pipeline to reduce the amount of time and effort it takes to ingest a document and get the key information into downstream systems. The following figure shows the stages that are typically part of an IDP workflow.

In this two-part series, we discuss how you can automate and intelligently process documents at scale using AWS AI services. In part 1, we discussed the first three phases of the IDP workflow. In this post, we discuss the remaining workflow phases.

Solution overview

The following reference architecture shows how you can use AWS AI services like Amazon Textract and Amazon Comprehend, along with other AWS services to implement the IDP workflow. In part 1, we described the data capture and document classification stages, where we categorized and tagged documents such as bank statements, invoices, and receipt documents. We also discussed the extraction stage, where you can extract meaningful business information from your documents. In this post, we extend the IDP pipeline by looking at Amazon Comprehend default and custom entities in the extraction phase, perform document enrichment, and also briefly look at the capabilities of Amazon Augmented AI (Amazon A2I) to include a human review workforce in the review and validation stage.

We also use Amazon Comprehend Medical as part of this solution, which is a service to extract information from unstructured medical text accurately and quickly and identify relationships among extracted health information, and link to medical ontologies like ICD-10-CM, RxNorm, and SNOMED CT.

Amazon A2I is a machine learning (ML) service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers whether it runs on AWS or not. Amazon A2I integrates with Amazon Textract and Amazon Comprehend to provide you the ability to introduce human review steps within your IDP workflow.

Prerequisites

Before you get started, refer to part 1 for a high-level overview of IDP and details about the data capture, classification, and extraction stages.

Extraction phase

In part 1 of this series, we discussed how we can use Amazon Textract features for accurate data extraction for any type of documents. To extend this phase, we use Amazon Comprehend pre-trained entities and an Amazon Comprehend custom entity recognizer for further document extraction. The purpose of the custom entity recognizer is to identify specific entities and generate custom metadata regarding our documents in CSV or human readable format to be later analyzed by the business users.

Named entity recognition

Named entity recognition (NER) is a natural language processing (NLP) sub-task that involves sifting through text data to locate noun phrases, called named entities, and categorizing each with a label, such as brand, date, event, location, organizations, person, quantity, or title. For example, in the statement “I recently subscribed to Amazon Prime,” Amazon Prime is the named entity and can be categorized as a brand.

Amazon Comprehend enables you to detect such custom entities in your document. Each entity also has a confidence level score that Amazon Comprehend returns for each entity type. The following diagram illustrates the entity recognition process.

To get entities from the text document, we call the comprehend.detect_entities() method and configure the language code and text as input parameters:

def get_entities(text):
    try:
        #detect entities
        entities = comprehend.detect_entities(LanguageCode="en", Text=text)  
        df = pd.DataFrame(entities["Entities"], columns = ['Text', 'Type'])
        display(HTML(df.to_html(index=False)))
    except Exception as e:
        print(e)

We run the get_entities() method on the bank document and obtain the entity list in the results.

Although entity extraction worked fairly well in identifying the default entity types for everything in the bank document, we want specific entities to be recognized for our use case. More specifically, we need to identify the customer’s savings and checking account numbers in the bank statement. We can extract these key business terms using Amazon Comprehend custom entity recognition.

Train an Amazon Comprehend custom entity recognition model

To detect the specific entities that we’re interested in from the customer’s bank statement, we train a custom entity recognizer with two custom entities: SAVINGS_AC and CHECKING_AC.

Then we train a custom entity recognition model. We can choose one of two ways to provide data to Amazon Comprehend: annotations or entity lists.

The annotations method can often lead to more refined results for image files, PDFs, or Word documents because you train a model by submitting more accurate context as annotations along with your documents. However, the annotations method can be time-consuming and work-intensive. For simplicity of this blog post, we use the entity lists method, which you can only use for plain text documents. This method gives us a CSV file that should contain the plain text and its corresponding entity type, as shown in the preceding example. The entities in this file are going to be specific to our business needs (savings and checking account numbers).

For more details on how to prepare the training data for different use cases using annotations or entity lists methods, refer to Preparing the training data.

The following screenshot shows an example of our entity list.

Create an Amazon Comprehend custom NER real-time endpoint

Next, we create a custom entity recognizer real-time endpoint using the model that we trained. We use the CreateEndpoint API via the comprehend.create_endpoint() method to create the real-time endpoint:

#create comprehend endpoint
model_arn = entity_recognizer_arn
ep_name = 'idp-er-endpoint'

try:
    endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ER_ENDPOINT_ARN=endpoint_response['EndpointArn']
    print(f'Endpoint created with ARN: {ER_ENDPOINT_ARN}')
    %store ER_ENDPOINT_ARN
except Exception as error:
    if error.response['Error']['Code'] == 'ResourceInUseException':
        print(f'An endpoint with the name "{ep_name}" already exists.')
        ER_ENDPOINT_ARN = f'arn:aws:comprehend:{region}:{account_id}:entity-recognizer-endpoint/{ep_name}'
        print(f'The classifier endpoint ARN is: "{ER_ENDPOINT_ARN}"')
        %store ER_ENDPOINT_ARN
    else:
        print(error)

After we train a custom entity recognizer, we use the custom real-time endpoint to extract some enriched information from the document and then perform document redaction with the help of the custom entities recognized by Amazon Comprehend and bounding box information from Amazon Textract.

Enrichment phase

In the document enrichment stage, we can perform document enrichment by redacting personally identifiable information (PII) data, custom business term extraction, and so on. Our previous sample document (a bank statement) contains the customers’ savings and checking account numbers, which we want to redact. Because we already know these custom entities by means of our Amazon Comprehend custom NER model, we can easily use the Amazon Textract geometry data type to redact these PII entities wherever they appear in the document. In the following architecture, we redact key business terms (savings and checking accounts) from the bank statement document.

As you can see in the following example, the checking and savings account numbers are hidden in the bank statement now.

Traditional OCR solutions struggle to extract data accurately from most unstructured and semi-structured documents because of significant variations in how the data is laid out across multiple versions and formats of these documents. You may then need to implement custom preprocessing logic or even manually extract the information out of these documents. In this case, the IDP pipeline supports two features that you can use: Amazon Comprehend custom NER and Amazon Textract queries. Both these services use NLP to extract insights about the content of documents.

Extraction with Amazon Textract queries

When processing a document with Amazon Textract, you can add the new queries feature to your analysis to specify what information you need. This involves passing an NLP question, such as “What is the customer’s social security number?” to Amazon Textract. Amazon Textract finds the information in the document for that question and returns it in a response structure separate from the rest of the document’s information. Queries can be processed alone, or in combination with any other FeatureType, such as Tables or Forms.

With Amazon Textract queries, you can extract information with high accuracy irrespective of the how the data is laid out in a document structure, such as forms, tables, and checkboxes, or housed within nested sections in a document.

To demonstrate the queries feature, we extract valuable pieces of information like the patient’s first and last names, the dosage manufacturer, and so on from documents such as a COVID-19 vaccination card.

We use the textract.analyze_document() function and specify the FeatureType as QUERIES as well as add the queries in the form of natural language questions in the QueriesConfig.

The following code has been trimmed down for simplification purposes. For the full code, refer the GitHub sample code for analyze_document().

response = None
with open(image_filename, 'rb') as document:
    imageBytes = bytearray(document.read())

# Call Textract
response = textract.analyze_document(
    Document={'Bytes': imageBytes},
    FeatureTypes=["QUERIES"],
    QueriesConfig={
            "Queries": [{
                "Text": "What is the date for the 1st dose covid-19?",
                "Alias": "COVID_VACCINATION_FIRST_DOSE_DATE"
            },
# code trimmed down for simplification
#..
]
})

For the queries feature, the textract.analyze_document() function outputs all OCR WORDS and LINES, geometry information, and confidence scores in the response JSON. However, we can just print out the information that we queried for.

Document is a wrapper function used to help parse the JSON response from the API. It provides a high-level abstraction and makes the API output iterable and easy to get information out of. For more information, refer to the Textract Response Parser and Textractor GitHub repos. After we process the response, we get the following information as shown in the screenshot.

import trp.trp2 as t2
from tabulate import tabulate

d = t2.TDocumentSchema().load(response)
page = d.pages[0]

query_answers = d.get_query_answers(page=page)

print(tabulate(query_answers, tablefmt="github"))

Review and validation phase

This is the final stage of our IDP pipeline. In this stage, we can use our business rules to check for completeness of a document. For example, from an insurance claims document, the claim ID is extracted accurately and successfully. We can use AWS serverless technologies such as AWS Lambda for further automation of these business rules. Moreover, we can include a human workforce for document reviews to ensure the predictions are accurate. Amazon A2I accelerates building workflows required for human review for ML predictions.

With Amazon A2I, you can allow human reviewers to step in when a model is unable to make a high confidence prediction or to audit its predictions on an ongoing basis. The goal of the IDP pipeline is to reduce the amount of human input required to get accurate information into your decision systems. With IDP, you can reduce the amount of human input for your document processes as well as the total cost of document processing.

After you have all the accurate information extracted from the documents, you can further add business-specific rules using Lambda functions and finally integrate the solution with downstream databases or applications.

For more information on how to create an Amazon A2I workflow, follow the instructions from the Prep for Module 4 step at the end of 03-idp-document-enrichment.ipynb in our GitHub repo.

Clean up

To prevent incurring future charges to your AWS account, delete the resources that we provisioned in the setup of the repository by navigating to the Cleanup section in our repo.

Conclusion

In this two-part post, we saw how to build an end-to-end IDP pipeline with little or no ML experience. We discussed the various stages of the pipeline and a hands-on solution with AWS AI services such as Amazon Textract, Amazon Comprehend, Amazon Comprehend Medical, and Amazon A2I for designing and building industry-specific use cases. In the first post of the series, we demonstrated how to use Amazon Textract and Amazon Comprehend to extract information from various documents. In this post, we did a deep dive into how to train an Amazon Comprehend custom entity recognizer to extract custom entities from our documents. We also performed document enrichment techniques like redaction using Amazon Textract as well as the entity list from Amazon Comprehend. Finally, we saw how you can use an Amazon A2I human review workflow for Amazon Textract by including a private work team.

For more information about the full code samples in this post, refer to the GitHub repo.

We recommend you review the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and follow the guidelines provided. Also, take a moment to review and understand the pricing for Amazon Textract, Amazon Comprehend, and Amazon A2I.

About the authors

Chin Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core areas of focus are artificial intelligence and machine learning for intelligent document processing.

Anjan Biswas is an AI/ML specialist Senior Solutions Architect. Anjan works with enterprise customers and is passionate about developing, deploying and explaining AI/ML, data analytics, and big data solutions. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS.

Suprakash Dutta is a Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning. He is part of the AI/ML community at AWS and designs intelligent document processing solutions.

Intelligent document processing with AWS AI services: Part 1

Organizations across industries such as healthcare, finance and lending, legal, retail, and manufacturing often have to deal with a lot of documents in their day-to-day business processes. These documents contain critical information that are key to making decisions on time in order to maintain the highest levels of customer satisfaction, faster customer onboarding, and lower customer churn. In most cases, documents are processed manually to extract information and insights, which is time-consuming, error-prone, expensive, and difficult to scale. There is limited automation available today to process and extract information from these documents. Intelligent document processing (IDP) with AWS artificial intelligence (AI) services helps automate information extraction from documents of different types and formats, quickly and with high accuracy, without the need for machine learning (ML) skills. Faster information extraction with high accuracy helps in making quality business decisions on time, while reducing overall costs.

Although the stages in an IDP workflow may vary and be influenced by use case and business requirements, the following figure shows the stages that are typically part of an IDP workflow. Processing documents such as tax forms, claims, medical notes, new customer forms, invoices, legal contracts, and more are just a few of the use cases for IDP.

In this two-part series, we discuss how you can automate and intelligently process documents at scale using AWS AI services. In this post, we discuss the first three phases of the IDP workflow. In part 2, we discuss the remaining workflow phases.

Solution overview

The following architecture diagram shows the stages of an IDP workflow. It starts with a data capture stage to securely store and aggregate different file formats (PDF, JPEG, PNG, TIFF) and layouts of documents. The next stage is classification, where you categorize your documents (such as contracts, claim forms, invoices, or receipts), followed by document extraction. In the extraction stage, you can extract meaningful business information from your documents. This extracted data is often used to gather insights via data analysis, or sent to downstream systems such as databases or transactional systems. The following stage is enrichment, where documents can be enriched by redacting protected health information (PHI) or personally identifiable information (PII) data, custom business term extraction, and so on. Finally, in the review and validation stage, you can include a human workforce for document reviews to ensure the outcome is accurate.

For the purposes of this post, we consider a set of sample documents such as bank statements, invoices, and store receipts. The document samples, along with sample code, can be found in our GitHub repository. In the following sections, we walk you through these code samples along with real practical application. We demonstrate how you can utilize ML capabilities with Amazon Textract, Amazon Comprehend, and Amazon Augmented AI (Amazon A2I) to process documents and validate the data extracted from them.

Amazon Textract is an ML service that automatically extracts text, handwriting, and data from scanned documents. It goes beyond simple optical character recognition (OCR) to identify, understand, and extract data from forms and tables. Amazon Textract uses ML to read and process any type of document, accurately extracting text, handwriting, tables, and other data with no manual effort.

Amazon Comprehend is a natural-language processing (NLP) service that uses ML to extract insights about the content of the documents. Amazon Comprehend can identify critical elements in documents, including references to language, people, and places, and classify them into relevant topics or clusters. It can perform sentiment analysis to determine the sentiment of a document in real time using single document or batch detection. For example, it can analyze the comments on a blog post to know if your readers like the post or not. Amazon Comprehend also detects PII like addresses, bank account numbers, and phone numbers in text documents in real time and asynchronous batch jobs. It can also redact PII entities in asynchronous batch jobs.

Amazon A2I is an ML service that makes it easy to build the workflows required for human review. Amazon A2I brings human review to all developers, removing the undifferentiated heavy lifting associated with building human review systems or managing large numbers of human reviewers, whether it runs on AWS or not. Amazon A2I integrates both with Amazon Textract and Amazon Comprehend to provide you the ability to introduce human review steps within your intelligent document processing workflow.

Data capture phase

You can store documents in a highly scalable and durable storage like Amazon Simple Storage Service (Amazon S3). Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon S3 is designed for 11 9’s of durability and stores data for millions of customers all around the world. Documents can come in various formats and layouts, and can come from different channels like web portals or email attachments.

Classification phase

In the previous step, we collected documents of various types and formats. In this step, we need to categorize the documents before we can do further extraction. For that, we use Amazon Comprehend custom classification. Document classification is a two-step process. First, you train an Amazon Comprehend custom classifier to recognize the classes that are of interest to you. Next, you deploy the model with a custom classifier real-time endpoint and send unlabeled documents to the real-time endpoint to be classified.

The following figure represents a typical document classification workflow.

To train the classifier, identify the classes you’re interested in and provide sample documents for each of the classes as training material. Based on the options you indicated, Amazon Comprehend creates a custom ML model that it trains based on the documents you provided. This custom model (the classifier) examines each document you submit. It returns either the specific class that best represents the content (if you’re using multi-class mode) or the set of classes that apply to it (if you’re using multi-label mode).

Prepare training data

The first step is to extract text from documents required for the Amazon Comprehend custom classifier. To extract the raw text information for all the documents in Amazon S3, we use the Amazon Textract detect_document_text() API. We also label the data according to the document type to be used to train a custom Amazon Comprehend classifier.

The following code has been trimmed down for simplification purposes. For the full code, refer to the GitHub sample code for textract_extract_text(). The function call_textract() is a wr4apper function that calls the AnalyzeDocument API internally, and the parameters passed to the method abstract some of the configurations that the API needs to run the extraction task.

def textract_extract_text(document, bucket=data_bucket):        
    try:
        print(f'Processing document: {document}')
        lines = ""
        row = []
        
        # using amazon-textract-caller
        response = call_textract(input_document=f's3://{bucket}/{document}') 
        # using pretty printer to get all the lines
        lines = get_string(textract_json=response, output_type=[Textract_Pretty_Print.LINES])
        
        label = [name for name in names if(name in document)]  
        row.append(label[0])
        row.append(lines)        
        return row
    except Exception as e:
        print (e)

Train a custom classifier

In this step, we use Amazon Comprehend custom classification to train our model for classifying the documents. We use the CreateDocumentClassifier API to create a classifier that trains a custom model using our labeled data. See the following code:

create_response = comprehend.create_document_classifier(
        InputDataConfig={
            'DataFormat': 'COMPREHEND_CSV',
            'S3Uri': f's3://{data_bucket}/{key}'
        },
        DataAccessRoleArn=role,
        DocumentClassifierName=document_classifier_name,
        VersionName=document_classifier_version,
        LanguageCode='en',
        Mode='MULTI_CLASS'
    )

Deploy a real-time endpoint

To use the Amazon Comprehend custom classifier, we create a real-time endpoint using the CreateEndpoint API:

endpoint_response = comprehend.create_endpoint(
        EndpointName=ep_name,
        ModelArn=model_arn,
        DesiredInferenceUnits=1,    
        DataAccessRoleArn=role
    )
    ENDPOINT_ARN=endpoint_response['EndpointArn']
print(f'Endpoint created with ARN: {ENDPOINT_ARN}')

Classify documents with the real-time endpoint

After the Amazon Comprehend endpoint is created, we can use the real-time endpoint to classify documents. We use the comprehend.classify_document() function with the extracted document text and inference endpoint as input parameters:

response = comprehend.classify_document(
      Text= document,
      EndpointArn=ENDPOINT_ARN
      )

Amazon Comprehend returns all classes of documents with a confidence score linked to each class in an array of key-value pairs (name-score). We pick the document class with the highest confidence score. The following screenshot is a sample response.

We recommend going through the detailed document classification sample code on GitHub.

Extraction phase

Amazon Textract lets you extract text and structured data information using the Amazon Textract DetectDocumentText and AnalyzeDocument APIs, respectively. These APIs respond with JSON data, which contains WORDS, LINES, FORMS, TABLES, geometry or bounding box information, relationships, and so on. Both DetectDocumentText and AnalyzeDocument are synchronous operations. To analyze documents asynchronously, use StartDocumentTextDetection.

Structured data extraction

You can extract structured data such as tables from documents while preserving the data structure and relationships between detected items. You can use the AnalyzeDocument API with the FeatureType as TABLE to detect all tables in a document. The following figure illustrates this process.

See the following code:

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["TABLES"])

We run the analyze_document() method with the FeatureType as TABLES on the employee history document and obtain the table extraction in the following results.

Semi-structured data extraction

You can extract semi-structured data such as forms or key-value pairs from documents while preserving the data structure and relationships between detected items. You can use the AnalyzeDocument API with the FeatureType as FORMS to detect all forms in a document. The following diagram illustrates this process.

See the following code:

response = textract.analyze_document(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    },
    FeatureTypes=["FORMS"])

Here, we run the analyze_document() method with the FeatureType as FORMS on the employee application document and obtain the table extraction in the results.

Unstructured data extraction

Amazon Textract is optimal for dense text extraction with industry-leading OCR accuracy. You can use the DetectDocumentText API to detect lines of text and the words that make up a line of text, as illustrated in the following figure.

See the following code:

response = textract.detect_document_text(Document={'Bytes': imageBytes})

# Print detected text
for item in response["Blocks"]:
	if item["BlockType"] == "LINE":
 		print (item["Text"])

Now we run the detect_document_text() method on the sample image and obtain raw text extraction in the results.

Invoices and receipts

Amazon Textract provides specialized support to process invoices and receipts at scale. The AnalyzeExpense API can extract explicitly labeled data, implied data, and line items from an itemized list of goods or services from almost any invoice or receipt without any templates or configuration. The following figure illustrates this process.

See the following code:

response = textract.analyze_expense(
    Document={
        'S3Object': {
            'Bucket': s3BucketName,
            'Name': documentName
        }
    })

Amazon Textract can find the vendor name on a receipt even if it’s only indicated within a logo on the page without an explicit label called “vendor”. It can also find and extract expense items, quantity, and prices that aren’t labeled with column headers for line items.

Identity documents

The Amazon Textract AnalyzeID API can help you automatically extract information from identification documents, such as driver’s licenses and passports, without the need for templates or configuration. We can extract specific information, such as date of expiry and date of birth, as well as intelligently identify and extract implied information, such as name and address. The following diagram illustrates this process.

See the following code:

textract_client = boto3.client('textract')
j = call_textract_analyzeid(document_pages=["s3://amazon-textract-public-content/analyzeid/driverlicense.png"],boto3_textract_client=textract_client)

We can use tabulate to get a pretty printed output:

from tabulate import tabulate

print(tabulate([x[1:3] for x in result]))

We recommend going through the detailed document extraction sample code on GitHub. For more information about the full code samples in this post, refer to the GitHub repo.

Conclusion

In this first post of a two-part series, we discussed the various stages of IDP and a solution architecture. We also discussed document classification using an Amazon Comprehend custom classifier. Next, we explored the ways you can use Amazon Textract to extract information from unstructured, semi-structured, structured, and specialized document types.

In part 2 of this series, we continue the discussion with the extract and queries features of Amazon Textract. We look at how to use Amazon Comprehend pre-defined entities and custom entities to extract key business terms from documents with dense text, and how to integrate an Amazon A2I human-in-the-loop review in your IDP processes.

We recommend reviewing the security sections of the Amazon Textract, Amazon Comprehend, and Amazon A2I documentation and following the guidelines provided. Also, take a moment to review and understand the pricing for Amazon Textract, Amazon Comprehend, and Amazon A2I.

About the authors

Suprakash Dutta is a Solutions Architect at Amazon Web Services. He focuses on digital transformation strategy, application modernization and migration, data analytics, and machine learning.

Sonali Sahu is leading Intelligent Document Processing AI/ML Solutions Architect team at Amazon Web Services. She is a passionate technophile and enjoys working with customers to solve complex problems using innovation. Her core area of focus is artificial intelligence and machine learning for intelligent document processing.

Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and data analytics. Anjan is part of the world-wide AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing, and retail organizations, and is actively helping customers get started and scale on AWS AI services.

Chinmayee Rane is an AI/ML Specialist Solutions Architect at Amazon Web Services. She is passionate about applied mathematics and machine learning. She focuses on designing intelligent document processing solutions for AWS customers. Outside of work, she enjoys salsa and bachata dancing.

A Dense Material Segmentation Dataset for Indoor and Outdoor Scene Parsing

A key algorithm for understanding the world is material segmentation, which assigns a label (metal, glass, etc.) to each pixel. We find that a model trained on existing data underperforms in some settings and propose to address this with a large-scale dataset of 3.2 million dense segments on 44,560 indoor and outdoor images, which is 23x more segments than existing data. Our data covers a more diverse set of scenes, objects, viewpoints and materials, and contains a more fair distribution of skin types. We show that a model trained on our data outperforms a state-of-the-art model across…Apple Machine Learning Research

Amazon wins contest to control “formality” in machine translation

Data augmentation and post-editing strategies lift Amazon’s submission above competitors.Read More

Regularized Training of Nearest Neighbor Language Models

Including memory banks in a natural language processing architecture increases model capacity by equipping it with additional data at inference time. In this paper, we build upon kNN-LM, which uses a pre-trained language model together with an exhaustive kNN search through the training data (memory bank) to achieve state-of-the-art results. We investigate whether we can improve the kNN-LM performance by instead training a LM with the knowledge that we will be using a kNN post-hoc. We achieved significant improvement using our method on language modeling tasks on WIKI-2 and WIKI-103. The main…Apple Machine Learning Research

Registration now open for the 2022 Instagram Workshop on Recommendation Systems at Scale

Registration now open for the 2022 Instagram Workshop on Recommendation Systems at ScaleRead More

Erran Li receives 2022 SIGMOBILE test-of-time award

Li and co-authors honored for creating an antenna design that was essential to the growth of mobile devices.Read More

Overview

Native Level Optimization on Bfloat16

Run Bfloat16 with Auto Mixed Precision

Performance Gains

The performance boost of bfloat16 over float32 primarily comes from 3 aspects:

Conclusion & Future Work

Acknowledgement

Reference

Solution overview

Prerequisites

Create a filter for your promotions

Apply promotions to your recommendations

Clean up

Summary

About the authors

New IAM advanced parameters

How to configure IAM advanced parameters

Conclusion

About the authors

Solution overview

Prerequisites

Extraction phase

Named entity recognition

Train an Amazon Comprehend custom entity recognition model

Create an Amazon Comprehend custom NER real-time endpoint

Enrichment phase

Extraction with Amazon Textract queries

Review and validation phase

Clean up

Conclusion

About the authors

Solution overview

Data capture phase

Classification phase

Prepare training data

Train a custom classifier

Deploy a real-time endpoint

Classify documents with the real-time endpoint

Extraction phase

Structured data extraction

Semi-structured data extraction

Unstructured data extraction

Invoices and receipts

Identity documents

Conclusion

About the authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.