Amazon Science is now the destination for information on the SocialBot, TaskBot, and SimBot challenges, including FAQs, team updates, publications, and other program information.Read More
Automate a shared bikes and scooters classification model with Amazon SageMaker Autopilot
Amazon SageMaker Autopilot makes it possible for organizations to quickly build and deploy an end-to-end machine learning (ML) model and inference pipeline with just a few lines of code or even without any code at all with Amazon SageMaker Studio. Autopilot offloads the heavy lifting of configuring infrastructure and the time it takes to build an entire pipeline, including feature engineering, model selection, and hyperparameter tuning.
In this post, we show how to go from raw data to a robust and fully deployed inference pipeline with Autopilot.
Solution overview
We use Lyft’s public dataset on bikesharing for this simulation to predict whether or not a user participates in the Bike Share for All program. This is a simple binary classification problem.
We want to showcase how easy it is to build an automated and real-time inference pipeline to classify users based on their participation in the Bike Share for All program. To this end, we simulate an end-to-end data ingestion and inference pipeline for an imaginary bikeshare company operating in the San Francisco Bay Area.
The architecture is broken down into two parts: the ingestion pipeline and the inference pipeline.
We primarily focus on the ML pipeline in the first section of this post, and review the data ingestion pipeline in the second part.
Prerequisites
To follow along with this example, complete the following prerequisites:
- Create a new SageMaker notebook instance.
- Create an Amazon Kinesis Data Firehose delivery stream with an AWS Lambda transform function. For instructions, see Amazon Kinesis Firehose Data Transformation with AWS Lambda. This step is optional and only needed to simulate data streaming.
Data exploration
Let’s download and visualize the dataset, which is located in a public Amazon Simple Storage Service (Amazon S3) bucket and static website:
The following screenshot shows a subset of the data before transformation.
The last column of the data contains the target we want to predict, which is a binary variable taking either a Yes or No value, indicating whether the user participates in the Bike Share for All program.
Let’s take a look at the distribution of our target variable for any data imbalance.
As shown in the graph above, the data is imbalanced, with fewer people participating in the program.
We need to balance the data to prevent an over-representation bias. This step is optional because Autopilot also offers an internal approach to handle class imbalance automatically, which defaults to a F1 score validation metric. Additionally, if you choose to balance the data yourself, you can use more advanced techniques for handling class imbalance, such as SMOTE or GAN.
For this post, we downsample the majority class (No) as a data balancing technique:
The following code enriches the data and under-samples the overrepresented class:
We deliberately left our categorical features not encoded, including our binary target value. This is because Autopilot takes care of encoding and decoding the data for us as part of the automatic feature engineering and pipeline deployment, as we see in the next section.
The following screenshot shows a sample of our data.
The data in the following graphs looks otherwise normal, with a bimodal distribution representing the two peaks for the morning hours and the afternoon rush hours, as you would expect. We also observe low activities on weekends and at night.
In the next section, we feed the data to Autopilot so that it can run an experiment for us.
Build a binary classification model
Autopilot requires that we specify the input and output destination buckets. It uses the input bucket to load the data and the output bucket to save the artifacts, such as feature engineering and the generated Jupyter notebooks. We retain 5% of the dataset to evaluate and validate the model’s performance after the training is complete and upload 95% of the dataset to the S3 input bucket. See the following code:
After we upload the data to the input destination, it’s time to start Autopilot:
All we need to start experimenting is to call the fit() method. Autopilot needs the input and output S3 location and the target attribute column as the required parameters. After feature processing, Autopilot calls SageMaker automatic model tuning to find the best version of a model by running many training jobs on your dataset. We added the optional max_candidates parameter to limit the number of candidates to 30, which is the number of training jobs that Autopilot launches with different combinations of algorithms and hyperparameters in order to find the best model. If you don’t specify this parameter, it defaults to 250.
We can observe the progress of Autopilot with the following code:
The training takes some time to complete. While it’s running, let’s look at the Autopilot workflow.
To find the best candidate, use the following code:
The following screenshot shows our output.
Our model achieved a validation accuracy of 96%, so we’re going to deploy it. We could add a condition such that we only use the model if the accuracy is above a certain level.
Inference pipeline
Before we deploy our model, let’s examine our best candidate and what’s happening in our pipeline. See the following code:
The following diagram shows our output.
Autopilot has built the model and has packaged it in three different containers, each sequentially running a specific task: transform, predict, and reverse-transform. This multi-step inference is possible with a SageMaker inference pipeline.
A multi-step inference can also chain multiple inference models. For instance, one container can perform principal component analysis before passing the data to the XGBoost container.
Deploy the inference pipeline to an endpoint
The deployment process involves just a few lines of code:
Let’s configure our endpoint for prediction with a predictor:
Now that we have our endpoint and predictor ready, it’s time to use the testing data we set aside and test the accuracy of our model. We start by defining a utility function that sends the data one line at a time to our inference endpoint and gets a prediction in return. Because we have an XGBoost model, we drop the target variable before sending the CSV line to the endpoint. Additionally, we removed the header from the testing CSV before looping through the file, which is also another requirement for XGBoost on SageMaker. See the following code:
The following screenshot shows our output.
Now let’s calculate the accuracy of our model.
See the following code:
We get an accuracy of 92%. This is slightly lower than the 96% obtained during the validation step, but it’s still high enough. We don’t expect the accuracy to be exactly the same because the test is performed with a new dataset.
Data ingestion
We downloaded the data directly and configured it for training. In real life, you may have to send the data directly from the edge device into the data lake and have SageMaker load it directly from the data lake into the notebook.
Kinesis Data Firehose is a good option and the most straightforward way to reliably load streaming data into data lakes, data stores, and analytics tools. It can capture, transform, and load streaming data into Amazon S3 and other AWS data stores.
For our use case, we create a Kinesis Data Firehose delivery stream with a Lambda transformation function to do some lightweight data cleaning as it traverses the stream. See the following code:
This Lambda function performs light transformation of the data streamed from the devices onto the data lake. It expects a CSV formatted data file.
For the ingestion step, we download the data and simulate a data stream to Kinesis Data Firehose with a Lambda transform function and into our S3 data lake.
Let’s simulate streaming a few lines:
Clean up
It’s important to delete all the resources used in this exercise to minimize cost. The following code deletes the SageMaker inference endpoint we created as well the training and testing data we uploaded:
Conclusion
ML engineers, data scientists, and software developers can use Autopilot to build and deploy an inference pipeline with little to no ML programming experience. Autopilot saves time and resources, using data science and ML best practices. Large organizations can now shift engineering resources away from infrastructure configuration towards improving models and solving business use cases. Startups and smaller organizations can get started on machine learning with little to no ML expertise.
We recommend learning more about other important features SageMaker has to offer, such as the Amazon SageMaker Feature Store, which integrates with Amazon SageMaker Pipelines to create, add feature search and discovery, and reuse automated ML workflows. You can run multiple Autopilot simulations with different feature or target variants in your dataset. You could also approach this as a dynamic vehicle allocation problem in which your model tries to predict vehicle demand based on time (such as time of day or day of the week) or location, or a combination of both.
About the Authors
Doug Mbaya is a Senior Solution architect with a focus in data and analytics. Doug works closely with AWS partners, helping them integrate data and analytics solution in the cloud. Doug’s prior experience includes supporting AWS customers in the ride sharing and food delivery segment.
Valerio Perrone is an Applied Science Manager working on Amazon SageMaker Automatic Model Tuning and Autopilot.
Amazon Scholar Ranjit Jhala named ACM Fellow
Jhala received the ACM honor for lifetime contributions to software verification, developing innovative tools to help computer programmers test their code.Read More
Apply profanity masking in Amazon Translate
Amazon Translate is a neural machine translation service that delivers fast, high-quality, affordable, and customizable language translation. This post shows how you can mask profane words and phrases with a grawlix string (“?$#@$”).
Amazon Translate typically chooses clean words for your translation output. But in some situations, you want to prevent words that are commonly considered as profane terms from appearing in the translated output. For example, when you’re translating video captions or subtitle content, or enabling in-game chat, and you want the translated content to be age appropriate and clear of any profanity, Amazon Translate allows you to mask the profane words and phrases using the profanity masking setting. You can apply profanity masking to both real-time translation or asynchronous batch processing in Amazon Translate. When using Amazon Translate with profanity masking enabled, the five-character sequence ?$#@$ is used to mask each profane word or phrase, regardless of the number of characters. Amazon Translate detects each profane word or phrase literally, not contextually.
Solution overview
To mask profane words and phrases in your translation output, you can enable the profanity option under the additional settings on the Amazon Translate console when you run the translations with Amazon Translate both through real-time and asynchronous batch processing requests. The following sections demonstrate using profanity masking for real-time translation requests via the Amazon Translate console, AWS Command Line Interface (AWS CLI), or with the Amazon Translate SDK (Python Boto3).
Amazon Translate console
To demonstrate handling profanity with real-time translation, we use the following sample text in French to be translated into English:
Complete the following steps on the Amazon Translate console:
- Choose French (fr) as the Source language.
- Choose English (en) as the Target Language.
- Enter the preceding example text in the Source Language text area.
The translated text appears under Target language. It contains a word that is considered profane in English.
- Expand Additional settings and enable Profanity.
The word is now replaced with the grawlix string ?$#@$.
AWS CLI
Calling the translate-text
AWS CLI command with --settings Profanity=MASK
masks profane words and phrases in your translated text.
The following AWS CLI commands are formatted for Unix, Linux, and macOS. For Windows, replace the backslash () Unix continuation character at the end of each line with a caret (
^
).
You get a response like the following snippet:
Amazon Translate SDK (Python Boto3)
The following Python 3 code uses the real-time translation call with the profanity setting:
Conclusion
You can use the profanity masking setting to mask words and phrases that are considered profane to keep your translated text clean and meet your business requirements. To learn more about all the ways you can customize your translations, refer to Customizing Your Translations using Amazon Translate.
About the Authors
Siva Rajamani is a Boston-based Enterprise Solutions Architect at AWS. He enjoys working closely with customers and supporting their digital transformation and AWS adoption journey. His core areas of focus are serverless, application integration, and security. Outside of work, he enjoys outdoors activities and watching documentaries.
Sudhanshu Malhotra is a Boston-based Enterprise Solutions Architect for AWS. He’s a technology enthusiast who enjoys helping customers find innovative solutions to complex business challenges. His core areas of focus are DevOps, machine learning, and security. When he’s not working with customers on their journey to the cloud, he enjoys reading, hiking, and exploring new cuisines.
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.
How Süddeutsche Zeitung optimized their audio narration process with Amazon Polly
This is a guest post by Jakob Kohl, a Software Developer at the Süddeutsche Zeitung. Süddeutsche Zeitung is one of the leading quality dailies in Germany when it comes to paid subscriptions and unique users. Its website, SZ.de, reaches more than 15 million monthly unique users as of October 2021.
Thanks to smart speakers and podcasts, the audio industry has experienced a real boom in recent years. At Süddeutsche Zeitung, we’re constantly looking for new ways to make our diverse journalism even more accessible. As pioneers in digital journalism, we want to open up more opportunities for Süddeutsche Zeitung readers to consume articles. We started looking for solutions that could provide high-quality audio narration for our articles. Our ultimate goal was to launch a “listen to the article” feature.
In this post, we share how we optimized our audio narration process with Amazon Polly, a service that turns text into lifelike speech using advanced deep learning technologies.
Why Amazon Polly?
We believe that Vicki, the German neural Amazon Polly voice, is currently the best German voice on the market. Amazon Polly offers the impressive feature to switch between languages, correctly pronouncing for example English movie titles as well as personal names in different languages (for an example, listen to the article Schall und Wahn on our website).
A big part of our infrastructure already runs on AWS, so using Amazon Polly was a perfect fit. We can combine Amazon Polly with the following components:
- An Amazon Simple Notification Service (Amazon SNS) topic to which we can subscribe for articles. The articles are sent to this topic by the CMS whenever they’re saved by an editor.
- An Amazon CloudFront distribution with Lambda@Edge to paywall premium articles, which we can reuse for audio versions of articles.
The Amazon Polly API is easy to use and well documented. It took us less than a week to get our proof of concept to work.
The challenge
Hundreds of new articles are published every day on SZ.de. After initial publication, they might get updated several times for various reasons—new paragraphs are added in news-driven articles, typos are fixed, teasers are changed, or metadata is optimized for search engines.
Generating speech for the initial publication of an article is straightforward, because the whole text needs to be synthesized. But how can we quickly generate the audio for updated versions of articles without paying twice for the same content? Our biggest challenge was to prevent sending the whole text to Amazon Polly repeatedly for every single update.
Our technical solution
Every time an editor saves an article, the new version of the article is published to an SNS topic. An AWS Lambda function is subscribed to this topic and called for every new version of an article. This function runs the following steps:
- Check if the new version of the article has already been completely synthesized. If so, the function stops immediately (this may happen when only metadata is changed that doesn’t affect the audio).
- Convert the article into multiple SSML documents, roughly one for each text paragraph.
- For each SSML document, the function checks if it has already been synthesized to audio using calculated hashes. For example:
- If an article is saved for the first time, all SSML documents must be synthesized.
- If a typo has been fixed in a single paragraph, only the SSML document for this paragraph must be re-synthesized.
- If a new paragraph is added to the article, only the SSML document for this new paragraph must be synthesized.
- Send all not-yet-synthesized SSML documents separately to Amazon Polly.
These checks help optimize performance and reduce cost by preventing the synthesis of an entire article multiple times. We avoid incurring additional charges due to minor changes such as a title edit or metadata adjustments for SEO reasons.
The following diagram illustrates the solution workflow.
After Amazon Polly synthesizes the SSML documents, the audio files are sent to an output bucket in Amazon Simple Storage Service (Amazon S3). A second Lambda function is listening for object creation on that bucket, waits for the completion of all audio fragments of an article, and merges them into a final audio file using FFmpeg from a Lambda layer. This final audio is sent to another S3 bucket, which is used as the origin in our CloudFront distribution. In CloudFront, we reuse an existing paywall for premium articles for the corresponding audio version.
Based on our freemium model, we provide a shortened audio version of premium articles. Non-subscribers are able to listen to the first paragraph for free, but are required to purchase a subscription to access the full article.
Conclusion
Integration of Amazon Polly into our existing infrastructure was very straightforward. Our content requires minimal customization because we only include paragraphs and some additional breaks. The most challenging part was performance and cost optimization, which we achieved by splitting the article up into multiple SSML documents corresponding to paragraphs, checking for changes in each SSML document, and building the whole audio file by merging the fragments. With these optimizations, we are able to achieve the following:
- Decrease the amount of synthesized characters by at least 50% by only synthesizing real changes.
- Reduce the time it takes for a change in the article text to appear in the audio because there is less audio to synthesize.
- Add arbitrary audio files between paragraphs without re-synthesizing the whole article. For example, we can include a sound file in the shortened audio version of a premium articles to separate the first paragraph from the ensuing note that a subscription is needed to listen to the full version.
In the first month after the launch of the “listen to the article” feature in our SZ.de articles, we received a lot of positive user feedback. We were able to reach almost 30,000 users during the first 2 months after launch. From these users, approximately 200 converted into a paid subscription only from listening to the teaser of an article behind our paywall. The “listen to the article” feature isn’t behind our paywall, but users can only listen to premium articles fully if they have a subscription. Our website also offers free articles without a paywall. In the future, we will expand the feature to other SZ platforms, especially our mobile news apps.
About the Author
Jakob Kohl is a Software Developer at the Süddeutsche Zeitung, where he enjoys working with modern technologies on an agile website team. He is one of the main developers of the “listen to an SZ article” feature. In his leisure time, he likes building wooden furniture, where technical and visual design is as important as in web development.
Alexa AI team discusses NeurIPS workshop best paper award
Paper deals with detecting and answering out-of-domain requests for task-oriented dialogue systems.Read More
Reduce costs and complexity of ML preprocessing with Amazon S3 Object Lambda
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. Often, customers have objects in S3 buckets that need further processing to be used effectively by consuming applications. Data engineers must support these application-specific data views with trade-offs between persisting derived copies or transforming data at the consumer level. Neither solution is ideal because it introduces operational complexity, causes data consistency challenges, and wastes more expensive computing resources.
These trade-offs broadly apply to many machine learning (ML) pipelines that train on unstructured data, such as audio, video, and free-form text, among other sources. In each example, the training job must download data from S3 buckets, prepare an application-specific view, and then use an AI algorithm. This post demonstrates a design pattern for reducing costs, complexity, and centrally managing this second step. It uses the concrete example of image processing, though the approach broadly applies to any workload. The economic benefits are also most pronounced when the transformation step doesn’t require GPU, but the AI algorithm does.
The proposed solution also centralizes data transformation code and enables just-in-time (JIT) transformation. Furthermore, the approach uses a serverless infrastructure to reduce operational overhead and undifferentiated heavy lifting.
Solution overview
When ML algorithms process unstructured data like images and video, it requires various normalization tasks (such as grey-scaling and resizing). This step exists to accelerate model convergence, avoid overfitting, and improve prediction accuracy. You often perform these preprocessing steps on instances that later run the AI training. That approach creates inefficiencies, because those resources typically have more expensive processors (for example, GPUs) than these tasks require. Instead, our solution externalizes those operations across economic, horizontally scalable Amazon S3 Object Lambda functions.
This design pattern has three critical benefits. First, it centralizes the shared data transformation steps, such as image normalization and removing ML pipeline code duplication. Next, S3 Object Lambda functions avoid data consistency issues in derived data through JIT conversions. Third, the serverless infrastructure reduces operational overhead, increases access time, and limits costs to the per-millisecond time running your code.
An elegant solution exists in which you can centralize these data preprocessing and data conversion operations with S3 Object Lambda. S3 Object Lambda enables you to add code that modifies data from Amazon S3 before returning it to an application. The code runs within an AWS Lambda function, a serverless compute service. Lambda can instantly scale to tens of thousands of parallel runs while supporting dozens of programming languages and even custom containers. For more information, see Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3.
The following diagram illustrates the solution architecture.
In this solution, you have an S3 bucket that contains the raw images to be processed. Next, you create an S3 Access Point for these images. If you build multiple ML models, you can create separate S3 Access Points for each model. Alternatively, AWS Identity and Access Management (IAM) policies for access points support sharing reusable functions across ML pipelines. Then you attach a Lambda function that has your preprocessing business logic to the S3 Access Point. After you retrieve the data, you call the S3 Access Point to perform JIT data transformations. Finally, you update your ML model to use the new S3 Object Lambda Access Point to retrieve data from Amazon S3.
Create the normalization access point
This section walks through the steps to create the S3 Object Lambda access point.
Raw data is stored in an S3 bucket. To provide the user with the right set of permissions to access this data, while avoiding complex bucket policies that can cause unexpected impact to another application, you need to create S3 Access Points. S3 Access Points are unique host names that you can use to reach S3 buckets. With S3 Access Points, you can create individual access control policies for each access point to control access to shared datasets easily and securely.
- Create your access point.
- Create a Lambda function that performs the image resizing and conversion. See the following Python code:
import boto3
import cv2
import numpy as np
import requests
import io
def lambda_handler(event, context):
print(event)
object_get_context = event["getObjectContext"]
request_route = object_get_context["outputRoute"]
request_token = object_get_context["outputToken"]
s3_url = object_get_context["inputS3Url"]
# Get object from S3
response = requests.get(s3_url)
nparr = np.fromstring(response.content, np.uint8)
img = cv2.imdecode(nparr, flags=1)
# Transform object
new_shape=(256,256)
resized = cv2.resize(img, new_shape, interpolation= cv2.INTER_AREA)
gray_scaled = cv2.cvtColor(resized,cv2.COLOR_BGR2GRAY)
# Transform object
is_success, buffer = cv2.imencode(".jpg", gray_scaled)
if not is_success:
raise ValueError('Unable to imencode()')
transformed_object = io.BytesIO(buffer).getvalue()
# Write object back to S3 Object Lambda
s3 = boto3.client('s3')
s3.write_get_object_response(
Body=transformed_object,
RequestRoute=request_route,
RequestToken=request_token)
return {'status_code': 200}
- Create an Object Lambda access point using the supporting access point from Step 1.
The Lambda function uses the supporting access point to download the original objects.
- Update Amazon SageMaker to use the new S3 Object Lambda access point to retrieve data from Amazon S3. See the following bash code:
aws s3api get-object --bucket arn:aws:s3-object-lambda:us-west-2:12345678901:accesspoint/image-normalizer --key images/test.png
Cost savings analysis
Traditionally, ML pipelines copy images and other files from Amazon S3 to SageMaker instances and then perform normalization. However, transforming these actions on training instances has inefficiencies. First, Lambda functions horizontally scale to handle the burst then elastically shrink, only charging per millisecond when the code is running. Many preprocessing steps don’t require GPUs and can even use ARM64. That creates an incentive to move that processing to more economical compute such as Lambda functions powered by AWS Graviton2 processors.
Using an example from the Lambda pricing calculator, you can configure the function with 256 MB of memory and compare the costs for both x86 and Graviton (ARM64). We chose this size because it’s sufficient for many single-image data preparation tasks. Next, use the SageMaker pricing calculator to compute expenses for an ml.p2.xlarge instance. This is the smallest supported SageMaker training instance with GPU support. These results show up to 90% compute savings for operations that don’t use GPUs and can shift to Lambda. The following table summarizes these findings.
Lambda with x86 | Lambda with Graviton2 (ARM) | SageMaker ml.p2.xlarge | |
Memory (GB) | 0.25 | 0.25 | 61 |
CPU | — | — | 4 |
GPU | — | — | 1 |
Cost/hour | $0.061 | $0.049 | $0.90 |
Conclusion
You can build modern applications to unlock insights into your data. These different applications have unique data view requirements, such as formatting and preprocessing actions. Addressing these other use cases can result in data duplication, increasing costs, and more complexity to maintain consistency. This post offers a solution for efficiently handling these situations using S3 Object Lambda functions.
Not only does this remove the need for duplication, but it also forms a path to scale these actions across less expensive compute horizontally! Even optimizing the transformation code for the ml.p2.xlarge instance would still be significantly more costly because of the idle GPUs.
For more ideas on using serverless and ML, see Machine learning inference at scale using AWS serverless and Deploying machine learning models with serverless templates.
About the Authors
Nate Bachmeier is an AWS Senior Solutions Architect nomadically explores New York, one cloud integration at a time. He specializes in migrating and modernizing customers’ workloads. Besides this, Nate is a full-time student and has two kids.
Marvin Fernandes is a Solutions Architect at AWS, based in the New York City area. He has over 20 years of experience building and running financial services applications. He is currently working with large enterprise customers to solve complex business problems by crafting scalable, flexible, and resilient cloud architectures.
Automated reasoning’s scientific frontiers
Distributing proof search, reasoning about distributed systems, and automating regulatory compliance are just three fruitful research areas.Read More
Extract entities from insurance documents using Amazon Comprehend named entity recognition
Intelligent document processing (IDP) is a common use case for customers on AWS. You can utilize Amazon Comprehend and Amazon Textract for a variety of use cases ranging from document extraction, data classification, and entity extraction. One specific industry that uses IDP is insurance. They use IDP to automate data extraction for common use cases such as claims intake, policy servicing, quoting, payments, and next best actions. However, in some cases, an office receives a document with complex, label-less information. This is normally difficult for optical character recognition (OCR) software to capture, and identifying relationships and key entities becomes a challenge. The solution is often requires manual human entry to ensure high accuracy.
In this post, we demonstrate how you can use named entity recognition (NER) for documents in their native formats in Amazon Comprehend to address these challenges.
Solution overview
In an insurance scenario, an insurer might receive a demand letter from an attorney’s office. The demand letter includes information such as what law office is sending the letter, who their client is, and what actions are required to satisfy their requests, as shown in the following example:
Because of the varied locations that this information could be found in a demand letter, these documents are often forwarded to an individual adjuster, who takes the time to read through the letter to determine all the necessary information required to proceed with a claim. The document may have multiple names, addresses, and requests that each need to be classified. If the client is mixed up with the beneficiary, or the addresses are switched, delays could add up and negative consequences could impact the company and customers. Because there are often small differences between categories like addresses and names, the documents are often processed by humans rather than using an IDP approach.
The preceding example document has many instances of overlapping entity values (entities that share similar properties but aren’t related). Examples of this are the address of the law office vs. the address of the insurance company or the names of the different individuals (attorney name, beneficiary, policy holder). Additionally, there is positional information (where the entity is positioned within the document) that a traditional text-only algorithm might miss. Therefore, traditional recognition techniques may not meet requirements.
In this post, we use named entity recognition in Amazon Comprehend to solve these challenges. The benefit of using this method is that the custom entity recognition model uses both the natural language and positional information of the text to accurately extract custom entities that may otherwise be impacted when flattening a document, as demonstrated in our preceding example of overlapping entity values. For this post, we use an AWS artificially created dataset of legal requisition and demand letters for life insurance, but you can use this approach across any industry and document that may benefit from spatial data in custom NER training. The following diagram depicts the solution architecture:
We implement the solution with the following high-level steps:
- Clone the repository containing the sample dataset.
- Create an Amazon Simple Storage Service (Amazon S3) bucket.
- Create and train your custom entity recognition model.
- Use the model by running an asynchronous batch job.
Prerequisites
You need to complete the following prerequisites to use this solution:
- Install Python 3.8.x.
- Make sure you have pip installed.
- Install and configure the AWS Command Line Interface (AWS CLI).
- Configure your AWS credentials.
Annotate your documents
To train a custom entity recognition model that can be used on your PDF, Word, and plain text documents, you need to first annotate PDF documents using a custom Amazon SageMaker Ground Truth annotation template that is provided by Amazon Comprehend. For instructions, see Custom document annotation for extracting named entities in documents using Amazon Comprehend.
We recommend a minimum of 250 documents and 100 annotations per entity to ensure good quality predictions. With more training data, you’re more likely to produce a higher-quality model.
When you’ve finished annotating, you can train a custom entity recognition model and use it to extract custom entities from PDF, Word, and plain text documents for batch (asynchronous) processing.
For this post, we have already labeled our sample dataset, and you don’t have to annotate the documents provided. However, if you want to use your own documents or adjust the entities, you have to annotate the documents. For instructions, see Custom document annotation for extracting named entities in documents using Amazon Comprehend.
We extract the following entities (which are case sensitive):
Law Firm
Law Office Address
Insurance Company
Insurance Company Address
Policy Holder Name
Beneficiary Name
Policy Number
Payout
Required Action
Sender
The dataset provided is entirely artificially generated. Any mention of names, places, and incidents are either products of the author’s imagination or are used fictitiously. Any resemblance to actual events or locales or persons, living or dead, is entirely coincidental.
Clone the repository
Start by cloning the repository by running the following command:
The repository contains the following files:
Create an S3 bucket
To create an S3 bucket to use for this example, complete the following steps:
- On the Amazon S3 console, choose Buckets in the navigation pane.
- Choose Create bucket.
- Note the name of the bucket you just created.
To reuse the annotations that we already made for the dataset, we have to modify the output.manifest
file and reference the bucket we just created.
- Modify the file by running the following commands:
When the script is finished running, you receive the following message:
We can now begin training our model.
Create and train the model
To start training your model, complete the following steps:
- On the Amazon S3 console, upload the
/source
folder,/annotations
folder,output.manifest
, andsample.pdf
files.
Your bucket should look similar to the following screenshot.
- On the Amazon Comprehend console, under Customization in the navigation pane, choose Custom entity recognition.
- Choose Create new model.
- For Model name, enter a name.
- For Language, choose English.
- For Custom entity type, add the following case-sensitive entities:
Law Firm
Law Office Address
Insurance Company
Insurance Company Address
Policy Holder Name
Beneficiary Name
Policy Number
Payout
Required Action
Sender
- In Data specifications, for Data format, select Augmented manifest to reference the manifest we created when we annotated the documents.
- For Training model type, select PDF, Word Documents.
This specifies the type of documents you’re using for training and inference.
- For SageMaker Ground Truth augmented manifest file S3 location, enter the location of the
output.manifest
file in your S3 bucket. - For S3 prefix for Annotation data files, enter the path to the
annotations
folder. - For S3 prefix for Source documents, enter the path to the
source
folder. - For Attribute names, enter
legal-entity-label-job-labeling-job-20220104T172242
.
The attribute name corresponds to the name of the labeling job you create for annotating the documents. For the pre-annotated documents, we use the name legal-entity-label-job-labeling-job-20220104T172242
. If you choose to annotate your documents, substitute this value with the name of your annotation job.
- Create a new AWS Identity and Access Management (IAM) role and give it permissions to the bucket that contains all your data.
- Finish creating the model (select the Autosplit option for your data source to see similar metrics to those in the following screenshots).
Now your recognizer model is visible on the dashboard with the model training status and metrics.
The model may take several minutes to train.
The following screenshot shows your model metrics when the training is complete.
Use the custom entity recognition model
To use the custom entity recognition models trained on PDF documents, we create a batch job to process them asynchronously.
- On the Amazon Comprehend console, choose Analysis jobs.
- Choose Create job.
- Under Input data, enter the Amazon S3 location of the annotated PDF documents to process (for this post, the
sample.pdf
file).
- For Input format, select One document per file.
- Under Output Data, enter the Amazon S3 location you want them to populate in. For this post, we create a new folder called
analysis-output
in the S3 bucket containing all source PDF documents, annotated documents, and manifest. - Use an IAM role with permissions to the
sample.pdf
folder.
You can use the role created earlier.
- Choose Create job.
This is an asynchronous job so it may take a few minutes to complete processing. When the job is complete, you get link to the output. When you open this output, you see a series of files as follows:
You can open the file sample.pdf.out
in your preferred text editor. If you search for the Entities Block, you can find the entities identified in the document. The following table shows an example.
Type | Text | Score |
Insurance Company | Budget Mutual Insurance Company | 0.999984086 |
Insurance Company Address | 9876 Infinity Aven Springfield, MI 65541 | 0.999982051 |
Law Firm | Bill & Carr | 0.99997298 |
Law Office Address | 9241 13th Ave SWn Spokane, Washington (WA),99217 | 0.999274625 |
Beneficiary Name | Laura Mcdaniel | 0.999972464 |
Policy Holder Name | Keith Holt | 0.999781546 |
Policy Number | (#892877136) | 0.999950143 |
Payout | $15,000 | 0.999980728 |
Sender | Angela Berry | 0.999723455 |
Required Action | We are requesting that you forward the full policy amount of Please forward ann acknowledgement of our demand and please forward the umbrella policy information if one isn applicable. Please send my secretary any information regarding liens on his policy. | 0.999989449 |
Expand the solution
You can choose from a myriad of possibilities for what to do with the detected entities, such as the following:
- Ingest them into a backend system of record
- Create a searchable index based on the extracted entities
- Enrich machine learning and analytics using extracted entity values as parameters for model training and inference
- Configure back-office flows and triggers based on detected entity value (such as specific law firms or payout values)
The following diagram depicts these options:
Conclusion
Complex document types can often be impediments to full-scale IDP automation. In this post, we demonstrated how you can build and use custom NER models directly from PDF documents. This method is especially powerful for instances where positional information is especially pertinent (similar entity values and varied document formats). Although we demonstrated this solution by using legal requisition letters in insurance, you can extrapolate this use case across healthcare, manufacturing, retail, financial services, and many other industries.
To learn more about Amazon Comprehend, visit the Amazon Comprehend Developer Guide.
About the Authors
Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.
Enzo Staton is a Solutions Architect with a passion for working with companies to increase their cloud knowledge. He works closely as a trusted advisor and industry specialists with customers around the country.
New UW-Amazon Science Hub launches
The collaboration will focus on advancing innovation in core robotics and AI technologies and their applications.Read More