Amazon AWS – Page 233

Austin

April 11, 2022

by Amazon AWS

Research in Austin spans several efforts at Amazon, including retail supply chain optimization, transportation management, and technology related to Amazon consumer electronics such as Ring and Blink.Read More

San Diego

April 11, 2022

by Amazon AWS

Research in San Diego spans many aspects of Amazon customer experiences, from helping to safeguard financial transactions across Amazon sites to working on the technology that enables personalization for Amazon Fashion customers.Read More

Build a custom entity recognizer for PDF documents using Amazon Comprehend

April 8, 2022

by Joshua Levy Amazon AWS

In many industries, it’s critical to extract custom entities from documents in a timely manner. This can be challenging. Insurance claims, for example, often contain dozens of important attributes (such as dates, names, locations, and reports) sprinkled across lengthy and dense documents. Manually scanning and extracting such information can be error-prone and time-consuming. Rule-based software can help, but ultimately is too rigid to adapt to the many varying document types and layouts.

To help automate and speed up this process, you can use Amazon Comprehend to detect custom entities quickly and accurately by using machine learning (ML). This approach is flexible and accurate, because the system can adapt to new documents by using what it has learned in the past. Until recently, however, this capability could only be applied to plain text documents, which meant that positional information was lost when converting the documents from their native format. To address this, it was recently announced that Amazon Comprehend can extract custom entities in PDFs, images, and Word file formats.

In this post, we walk through a concrete example from the insurance industry of how you can build a custom recognizer using PDF annotations.

Solution overview

We walk you through the following high-level steps:

Create PDF annotations.
Use the PDF annotations to train a custom model using the Python API.
Obtain evaluation metrics from the trained model.
Perform inference on an unseen document.

By the end of this post, we want to be able to send a raw PDF document to our trained model, and have it output a structured file with information about our labels of interest. In particular, we train our model to detect the following five entities that we chose because of their relevance to insurance claims: DateOfForm, DateOfLoss, NameOfInsured, LocationOfLoss, and InsuredMailingAddress. After reading the structured output, we can visualize the label information directly on the PDF document, as in the following image.

This post is accompanied by a Jupyter notebook that contains the same steps. Feel free to follow along while running the steps in that notebook. Note that you need to set up the Amazon SageMaker environment to allow Amazon Comprehend to read from Amazon Simple Storage Service (Amazon S3) as described at the top of the notebook.

Create PDF annotations

To create annotations for PDF documents, you can use Amazon SageMaker Ground Truth, a fully managed data labeling service that makes it easy to build highly accurate training datasets for ML.

For this tutorial, we have already annotated the PDFs in their native form (without converting to plain text) using Ground Truth. The Ground Truth job generates three paths we need for training our custom Amazon Comprehend model:

Sources – The path to the input PDFs.
Annotations – The path to the annotation JSON files containing the labeled entity information.
Manifest – The file that points to the location of the annotations and source PDFs. This file is used to create an Amazon Comprehend custom entity recognition training job and train a custom model.

The following screenshot shows a sample annotation.

The custom Ground Truth job generates a PDF annotation that captures block-level information about the entity. Such block-level information provides the precise positional coordinates of the entity (with the child blocks representing each word within the entity block). This is distinct from a standard Ground Truth job in which the data in the PDF is flattened to textual format and only offset information—but not precise coordinate information—is captured during annotation. The rich positional information we obtain with this custom annotation paradigm allows us to train a more accurate model.

The manifest that’s generated from this type of job is called an augmented manifest, as opposed to a CSV that’s used for standard annotations. For more information, see Annotations.

Use the PDF annotations to train a custom model using the Python API

An augmented manifest file must be formatted in JSON Lines format. In JSON Lines format, each line in the file is a complete JSON object followed by a newline separator.

The following code is an entry within this augmented manifest file.

A few things to note:

Five labeling types are associated with this job: DateOfForm, DateOfLoss, NameOfInsured, LocationOfLoss, and InsuredMailingAddress.
The manifest file references both the source PDF location and the annotation location.
Metadata about the annotation job (such as creation date) is captured.
Use-textract-only is set to False, meaning the annotation tool decides whether to use PDFPlumber (for a native PDF) or Amazon Textract (for a scanned PDF). If set to true, Amazon Textract is used in either case (which is more costly but potentially more accurate).

Now we can train the recognizer, as shown in the following example code.

We create a recognizer to recognize all five types of entities. We could have used a subset of these entities if we preferred. You can use up to 25 entities.

For the details of each parameter, refer to create_entity_recognizer.

Depending on the size of the training set, training time can vary. For this dataset, training takes approximately 1 hour. To monitor the status of the training job, you can use the describe_entity_recognizer API.

Obtain evaluation metrics from the trained model

Amazon Comprehend provides model performance metrics for a trained model, which indicates how well the trained model is expected to make predictions using similar inputs. We can obtain both global precision and recall metrics as well as per-entity metrics. An accurate model has high precision and high recall. High precision means the model is usually correct when it indicates a particular label; high recall means that the model found most of the labels. F1 is a composite metric (harmonic mean) of these measures, and is therefore high when both components are high. For a detailed description of the metrics, see Custom Entity Recognizer Metrics.

When you provide the documents to the training job, Amazon Comprehend automatically separates them into a train and test set. When the model has reached TRAINED status, you can use the describe_entity_recognizer API again to obtain the evaluation metrics on the test set.

The following is an example of global metrics.

The following is an example of per-entity metrics.

The high scores indicate that the model has learned well how to detect these entities.

Perform inference on an unseen document

Let’s run inference with our trained model on a document that was not part of the training procedure. We can use this asynchronous API for standard or custom NER. If using it for custom NER (as in this post), we must pass the ARN of the trained model.

We can review the submitted job by printing the response.

We can format the output of the detection job with Pandas into a table. The Score value indicates the confidence level the model has about the entity.

Finally, we can overlay the predictions on the unseen documents, which gives the result as shown at the top of this post.

Conclusion

In this post, you saw how to extract custom entities in their native PDF format using Amazon Comprehend. As next steps, consider diving deeper:

Train your own recognizer using the accompanying notebook here. Remember to delete any resources when finished to avoid future charges.
Set up your own custom annotation job to collect PDF annotations for your entities of interest. For more information, refer to Custom document annotation for extracting named entities in documents using Amazon Comprehend.
Train a custom NER model on the Amazon Comprehend console. For more information, see Extract custom entities from documents in their native format with Amazon Comprehend.

About the Authors

Joshua Levy is Senior Applied Scientist in the Amazon Machine Learning Solutions lab, where he helps customers design and build AI/ML solutions to solve key business problems.

Andrew Ang is a Machine Learning Engineer in the Amazon Machine Learning Solutions Lab, where he helps customers from a diverse spectrum of industries identify and build AI/ML solutions to solve their most pressing business problems. Outside of work he enjoys watching travel & food vlogs.

Alex Chirayath is a Software Engineer in the Amazon Machine Learning Solutions Lab focusing on building use case-based solutions that show customers how to unlock the power of AWS AI/ML services to solve real world business problems.

Jennifer Zhu is an Applied Scientist from Amazon AI Machine Learning Solutions Lab. She works with AWS’s customers building AI/ML solutions for their high-priority business needs.

Niharika Jayanthi is a Front End Engineer in the Amazon Machine Learning Solutions Lab – Human in the Loop team. She helps create user experience solutions for Amazon SageMaker Ground Truth customers.

Boris Aronchik is a Manager in Amazon AI Machine Learning Solutions Lab where he leads a team of ML Scientists and Engineers to help AWS customers realize business goals leveraging AI/ML solutions.

Getting started with the Amazon Kendra Box connector

April 8, 2022

by Bob Strahan Amazon AWS

Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra offers a suite of data source connectors to simplify the process of ingesting and indexing your content, wherever it resides.

For many organizations, Box Content Cloud is a core part of their content storage and lifecycle management strategy. An enterprise Box account often contains a treasure trove of assets, such as documents, presentations, knowledge articles, and more. Now, with the new Amazon Kendra data source connector for Box, these assets and any associated tasks or comments can be indexed by Amazon Kendra’s intelligent search service to reveal content and unlock answers in response to users’ queries.

In this post, we show you how to set up the new Amazon Kendra Box connector to selectively index content from your Box Enterprise repository.

Solution overview

The solution consists of the following high-level steps:

Create a Box app for Amazon Kendra via the Box Developer Console.
Add sample documents to your Box account.
Create a Box data source via the Amazon Kendra console.
Index the sample documents from the Box account.

Prerequisites

To try out the Amazon Kendra connector for Box, you need the following:

An AWS account with privileges to create AWS Identity and Access Management (IAM) roles and policies. For more information, see Overview of access management: Permissions and policies.
Basic knowledge of AWS and working knowledge of Box Enterprise administration.
Admin access to a Box Enterprise workspace.

Create a Box app for Amazon Kendra

Before you configure an Amazon Kendra Box data source connector, you must first create a Box app.

Log in to the Box Enterprise Developer Console.
Choose Create New App.
Choose Custom App.
Choose Server Authentication (with JWT).
Enter a name for your app. For example, KendraConnector.
Choose Create App.
In your created app in My Apps, choose the Configuration tab.
In the App Access Level section, choose App + Enterprise Access.
In the Application Scopes section, check that the following permissions are enabled:
1. Write all files and folders stored in a Box
2. Manage users
3. Manage groups
4. Manage enterprise properties
In the Advanced Features section, select Make API calls using the as-user header.
In the Add and Manage Public Keys section, choose Generate a Public/Private Keypair.

This requires two-step verification. A JSON text file is downloaded to your computer.

Choose OK to accept this download.
Choose Save Changes.
On the Authorization tab, choose Review and Submit.
Select Submit app within this enterprise and choose Submit.

Your Box Enterprise owner needs to approve the app before you can use it.

Go to the downloads directory on your computer to review the downloaded JSON file. It contains the client ID, client secret, public key ID, private key, pass phrase, and enterprise ID. You need these values to create the Box data source in a later step.

Add sample documents to your Box account

In this step, you upload sample documents to your Box account. Later, we use the Amazon Kendra Box data source to crawl and index these documents.

Download AWS_Whitepapers.zip to your computer.
Extract the files to a folder called AWS_Whitepapers.
Upload the AWS_Whitepapers folder to your Box account.

Create a Box data source

To add a data source to your Amazon Kendra index using the Box connector, you can use an existing Amazon Kendra index, or create a new Amazon Kendra index. Then complete the following steps to create a Box data source:

On the Amazon Kendra console, choose Indexes in the navigation pane.
From the list of indexes, choose the index that you want to add the data source to.
Choose Add data sources.
From the list of data source connectors, choose Add connector under Box.
On the Specify data source details page, enter a data source name and optional description.
Choose Next.
Open the JSON file you downloaded from the Box Developer Console.

It contains values for clientID, clientSecret, publicKeyID, privateKey, passphrase, and enterpriseID.

On the Define access and security page, in the Source section, for Box enterprise ID, enter the value of the enterpriseID field.
In the Authentication section, under AWS Secrets Manager secret, choose Create and add a new secret.
For Secret name, enter a name for the secret, for example, boxsecret1.
For the remaining fields, enter the corresponding values from the downloaded JSON file.
Choose Save and add secret.
In the IAM role section, choose Create a new role (Recommended) and enter a role name, for example, box-role.

For more information on the required permissions to include in the IAM role, see IAM roles for data sources.

Choose Next.
On the Configure sync settings page, in the Sync scope section, you can include Box web links, comments, and tasks in your index, in addition to file contents. Use the default setting (unchecked) for this post.
For Additional configuration (change log) – optional, use the default setting (unchecked).
For Additional configuration (regex patterns) – optional, choose Include patterns.
For Type, choose Path
For Path – optional, enter the path to the sample documents you uploaded earlier: AWS_Whitepapers/.
Choose Add.
In the Sync run schedule section, choose Run on demand.
Choose Next.
On the Set fields mapping page, you can define how the data source maps attributes from Box objects to your index. Use the default settings for this post.
Choose Next.
On the Review and create page, review the details of your Box data source.
To make changes, choose the Edit button next to the item that you want to change.
When you’re done, choose Add data source to add your Box data source.

After you choose Add data source, Amazon Kendra starts creating the data source. It can take several minutes for the data source to be created. When it’s complete, the status of the data source changes from Creating to Active.

Index sample documents from the Box account

You configured the data source sync run schedule to run on demand, so you need to start it manually.

On the Amazon Kendra console, navigate to your index.
Choose your new data source.
Choose Sync now.

The current sync state changes to Syncing – crawling, then to Syncing – indexing.

After about 10 minutes, the current sync state changes to idle, the last sync status changes to Successful, and the Sync run history panel shows more details, including the number of documents added.

Test the solution

Now that you have ingested the AWS whitepapers from your Box account into your Amazon Kendra index, you can test some queries.

On the Amazon Kendra console, choose Search indexed content in the navigation pane.
In the query field, enter a test query, such as What databases are offered by AWS?

You can try your own queries too.

Congratulations! You have successfully used Amazon Kendra to surface answers and insights based on the content indexed from your Box account.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution.

If you created a new Amazon Kendra index while testing this solution, delete it.
If you added a new data source using the Amazon Kendra connector for Box, delete that data source.
Delete the AWS_Whitepapers folder and its contents from your Box account.

Conclusion

With the Amazon Kendra Box connector, organizations can make invaluable information trapped in their Box accounts available to their users securely using intelligent search powered by Amazon Kendra.

In this post, we introduced you to the basics, but there are many additional features that we didn’t cover. For example:

You can enable user-based access control for your Amazon Kendra index, and restrict access to Box documents based on the access controls you have already configured in Box
You can index additional Box object types, such as tasks, comments, and web links
You can map Box object attributes to Amazon Kendra index attributes, and enable them for faceting, search, and display in the search results
You can integrate the Box data source with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.

About the Authors

Bob Strahan is a Principal Solutions Architect in the AWS Language AI Services team.

At Amazon Robotics, simulation gains traction

April 8, 2022

by Amazon AWS

Scientists and engineers are developing a new generation of simulation tools accurate enough to develop and test robots virtually.Read More

Amazon helps create first conference on causal learning and reasoning

April 7, 2022

by Amazon AWS

Conference will be held April 11 – 13 in Eureka, California, with virtual elements.Read More

Amazon and Johns Hopkins announce new AI institute

April 7, 2022

by Amazon AWS

The JHU + Amazon Initiative for Interactive AI (AI2AI) will be housed in the Whiting School of Engineering.Read More

Receive notifications for image analysis with Amazon Rekognition Custom Labels and analyze predictions

April 6, 2022

by Jay Rao Amazon AWS

Amazon Rekognition Custom Labels is a fully managed computer vision service that allows developers to build custom models to classify and identify objects in images that are specific and unique to your business.

Rekognition Custom Labels doesn’t require you to have any prior computer vision expertise. You can get started by simply uploading tens of images instead of thousands. If the images are already labeled, you can begin training a model in just a few clicks. If not, you can label them directly within the Rekognition Custom Labels console, or use Amazon SageMaker Ground Truth to label them. Rekognition Custom Labels uses transfer learning to automatically inspect the training data, select the right model framework and algorithm, optimize the hyperparameters, and train the model. When you’re satisfied with the model accuracy, you can start hosting the trained model with just one click.

However, if you’re a business user looking to solve a computer vision problem, visualize inference results of the custom model, and receive notifications when such inference results are available, you have to rely on your engineering team to build such an application. For example, an agricultural operations manager can be notified when a crop is found to have a disease, a winemaker can be notified when the grapes are ripe for harvesting, or a store manager can be notified when it’s time to restock inventories such as soft drinks in a vertical refrigerator.

In this post, we walk you through the process of building a solution that allows you to visualize the inference result and send notifications to subscribed users when specific labels are identified in images that are processed using models built by Rekognition Custom Labels.

Solution overview

The following diagram illustrates our solution architecture.

This solution uses the following AWS services to implement a scalable and cost-effective architecture:

Amazon Athena – A serverless interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
AWS Lambda – A serverless compute service that lets you run code in response to triggers such as changes in data, shifts in system state, or user actions. Because Amazon S3 can directly trigger a Lambda function, you can build a variety of real-time serverless data-processing systems.
Amazon QuickSight – A very fast, easy-to-use, cloud-powered business analytics service that makes it easy to build visualizations, perform ad hoc analysis, and quickly get business insights from the data.
Amazon Rekognition Custom Labels – Allows you to train a custom computer vision model to identify the objects and scenes in images that are specific to your business needs.
Amazon Simple Notification Service – Amazon SNS is a fully managed messaging service for both application-to-application (A2A) and application-to-person (A2P) communication.
Amazon Simple Queue Service – Amazon SQS is a fully managed message queuing service that enables you to decouple and scale microservices, distributed systems, and serverless applications.
Amazon Simple Storage Service – Amazon S3 serves as an object store for your documents and allows for central management with fine-tuned access controls.

The solution utilizes a serverless workflow that gets triggered when an image is uploaded to the input S3 bucket. An SQS queue receives an event notification for object creation. The solution also creates dead-letter queues (DLQs) to set aside and isolate messages that can’t be processed correctly. A Lambda function feeds off of the SQS queue and makes the DetectLabels API call to detect all labels in the image. To scale this solution and make it a loosely coupled design, the Lambda function sends the prediction results to another SQS queue. This SQS queue triggers another Lambda function, which analyzes all the labels found in the predictions. Based on the user preference (configured during solution deployment), the function publishes a message to an SNS topic. The SNS topic is configured to deliver email notifications to the user. You can configure the Lambda function to add a URL to the message sent to Amazon SNS to access the image (using an Amazon S3 presigned URL). Finally, the Lambda function uploads a prediction result and image metadata to an S3 bucket. You can then use Athena and QuickSight to analyze and visualize the results from the S3 bucket.

Prerequisites

You need to have a model trained and running with Rekognition Custom Labels.

Rekognition Custom Labels lets you manage the machine learning model training process on the Amazon Rekognition console, which simplifies the end-to-end model development process. For this post, we use a classification model trained to detect plant leaf disease.

Deploy the solution

You deploy an AWS CloudFormation template to provision the necessary resources, including S3 buckets, SQS queues, SNS topic, Lambda functions, and AWS Identity and Access Management (IAM) roles. The template creates the stack the us-east-1 Region, but you can use the template to create your stack in any Region where the above AWS services are available.

Launch the following CloudFormation template in the Region and AWS account where you deployed the Rekognition Custom Labels model:

For Stack name, enter a stack name, such as rekognition-customlabels-analytics-and-notification.
For CustomModelARN, enter the ARN of the Amazon Rekognition Custom Labels model that you want to use.

The Rekognition Custom Labels model needs to be deployed in the same AWS account.

For EmailNotification, enter an email address where you want to receive notifications.
For InputBucketName, enter a unique name for the S3 bucket the stack creates; for example, plant-leaf-disease-data-input.

This is where the incoming plant leaf images are stored.

For LabelsofInterest, you can enter up to 10 different labels you want to be notified of, in comma-separated format. For our plant disease example, enter bacterial-leaf-blight,leaf-smut.
For MinConfidence, enter the minimum confidence threshold to receive notification. Labels detected with a confidence below the value of MinConfidence aren’t returned in the response and will not generate notification.
For OutputBucketName, enter a unique name for the S3 bucket the stack creates; for example, plant-leaf-disease-data-output.

The output bucket contains JSON files with image metadata (labels found and confidence score).

Choose Next.

On the Configure stack options page, set any additional parameters for the stack, including tags.
Choose Next.
In the Capabilities and transforms section, select the check box to acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.

The stack details page should show the status of the stack as CREATE_IN_PROGRESS. It can take up to 5 minutes for the status to change to CREATE_COMPLETE.

Amazon SNS will send a subscription confirmation message to the email address. You need to confirm the subscription.

Test the solution

Now that we have deployed the resources, we’re ready to test the solution. Make sure you start the model.

On the Amazon S3 console, choose Buckets.
Choose the input S3 bucket.

Upload test images to the bucket.

In production, you can set up automated processes to deliver images to this bucket.

These images trigger the workflow. If the label confidence exceeds the specified threshold, you receive an email notification like the following.

You can also configure the SNS topic to deliver these notifications to any destinations supported by the service.

Analyze the prediction results

After you test the solution, you can extend the solution to create a visual analysis for the predictions of processed images. For this purpose, we use Athena, an interactive query service that makes it easy to analyze data directly from Amazon S3 using standard SQL, and QuickSight to visualize the data.

Configure Athena

If you are not familiar with Amazon Athena, see this tutorial. On the Athena console, create a table in the Athena data catalog with the following code:

CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`rekognition_customlabels_analytics` (
`Image` string,
`Label` string,
`Confidence` string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1'
) LOCATION 's3://<<OUTPUT BUCKET NAME>>/'
TBLPROPERTIES ('has_encrypted_data'='false');

Populate the Location field in the preceding query with your output bucket name, such as plant-leaf-disease-data-output.

This code tells Athena how to interpret each row of the text in the S3 bucket.

You can now query the data:

SELECT * FROM "default"."rekognition_customlabels_analytics" limit 10;

Configure QuickSight

To configure QuickSight, complete the following steps:

Open the QuickSight console.
If you’re not signed up for QuickSight, you’re prompted with the option to sign up. Follow the steps to sign up to use QuickSight.
After you log in to QuickSight, choose Manage QuickSight under your account.

In the navigation pane, choose Security & permissions.
Under QuickSight access to AWS services, choose Add or remove.

A page appears for enabling QuickSight access to AWS services.

Select Amazon Athena.

In the pop-up window, choose Next.

On the S3 tab, select the necessary S3 buckets. For this post, I select the bucket that stores my Athena query results.
For each bucket, also select Write permission for Athena Workgroup.
Choose Finish.
Choose Update.
On the QuickSight console, choose New analysis.
Choose New dataset.
For Datasets, choose Athena.
For Data source name, enter Athena-CustomLabels-analysis.
For Athena workgroup, choose primary.
Choose Create data source.

For Database, choose default on the drop-down menu.
For Tables, select the table rekognition_customlabels_analytics.
Choose Select.

Choose Visualize.

On the Visualize page, under the Fields list, choose label and select the pie chart from Visual types.

You can add more visualizations in the dashboard. When your analysis is ready, you can choose Share to create a dashboard and share it within your organization.

Summary

In this post, we showed how you can create a solution to receive notifications for specific labels (such as bacterial leaf blight or leaf smut) found in processed images using Rekognition Custom Labels. In addition, we showed how you can create dashboards to visualize the results using Athena and QuickSight.

You can now easily share such visualization dashboards with business users and allow them to subscribe to notifications instead of having to rely on your engineering teams to build such an application.

About the Authors

Jay Rao is a Principal Solutions Architect at AWS. He enjoys providing technical and strategic guidance to customers and helping them design and implement solutions on AWS.

Pashmeen Mistry is the Senior Product Manager for Amazon Rekognition Custom Labels. Outside of work, Pashmeen enjoys adventurous hikes, photography, and spending time with his family.

Helping AWS customers accelerate success via machine learning

April 6, 2022

by Amazon AWS

Priya Ponnapalli leads the Amazon Machine Learning Solutions Lab, fostering inclusion and growth for her team along the way.Read More

Customize the Amazon SageMaker XGBoost algorithm container

April 5, 2022

by Peyman Razaghi Amazon AWS

The built-in Amazon SageMaker XGBoost algorithm provides a managed container to run the popular XGBoost machine learning (ML) framework, with added convenience of supporting advanced training or inference features like distributed training, dataset sharding for large-scale datasets, A/B model testing, or multi-model inference endpoints. You can also extend this powerful algorithm to accommodate different requirements.

Packaging the code and dependencies in a single container is a convenient and robust approach for long-term code maintenance, reproducibility, and auditing purposes. Modifying the container directly follows the base container faithfully and avoids duplicating existing functions already supported by the base container. In this post, we review the inner workings of the SageMaker XGBoost algorithm container and provide pragmatic scripts to directly customize the container.

SageMaker XGBoost container structure

The SageMaker built-in XGBoost algorithm is packaged as a stand-alone container, available on GitHub, and can be extended under the developer-friendly Apache 2.0 open-source license. The container packages the open-source XGBoost algorithm and ancillary tools to run the algorithm in the SageMaker environment integrated with other AWS Cloud services. This allows you to train XGBoost models on a variety of data sources, make batch predictions on offline data, or host an inference endpoint in a real-time pipeline.

The container supports training and inference operations with different entry points. For inference mode, the entry can be found in the main function in the serving.py script. For real-time inference serving, the container runs a Flask-based web server that when invoked, receives an HTTP-encoded request containing the data, decodes the data into the XGBoost’s DMatrix format, loads the model, and returns an HTTP-encoded response back. These methods are encapsulated under the ScoringService class, which can also be customized through the script mode to a great extent (see the Appendix below).

The entry point for training mode (algorithm mode) is the main function in the training.py. The main function sets up the training environment and calls the training job function. It’s flexible enough to allow for distributed or single-node training, or utilities like cross validation. The heart of the training process can be found in the train_job function.

Docker files packaging the container can be found in the GitHub repo. Note that the container is built in two steps: a base container is built first, followed by the final container on top.

Solution overview

You can modify and rebuild the container through the source code. However, this involves collecting and rebuilding all dependencies and packages from scratch. In this post, we discuss a more straightforward approach that modifies the container on top of the already-built and publicly-available SageMaker XGBoost algorithm container image directly.

In this approach, we pull a copy of the public SageMaker XGBoost image, modify the scripts or add packages, and rebuild the container on top. The modified container can be stored in a private repository. This way, we avoid rebuilding intermediary dependencies and instead build directly on top of the already-built libraries packaged in the official container.

The following figure shows an overview of the script used to pull the public base image, modify and rebuild the image, and upload it to a private Amazon Elastic Container Registry (Amazon ECR) repository. The bash script in the accompanying code of this post performs all the workflow steps shown in the diagram. The accompanying notebook shows an example where the URI of a specific version of the SageMaker XGBoost algorithm is first retrieved and passed to the bash script, which replaces two of the Python scripts in the image, rebuilds it, and pushes the modified image to a private Amazon ECR repository. You can modify the accompanying code to suit your needs.

Prerequisites

The GitHub repository contains the code accompanying this post. You can run the sample notebook in your AWS account, or use the provided AWS CloudFormation stack to deploy the notebook using a SageMaker notebook. You need the following prerequisites:

An AWS account.
Necessary permissions to run SageMaker batch transform and training jobs, and Amazon ECR privileges. The CloudFormation template creates sample AWS Identity and Access Management (IAM) roles.

Deploy the solution

To create your solution resources using AWS CloudFormation, choose Launch Stack:

The stack deploys a SageMaker notebook preconfigured to clone the GitHub repository. The walkthrough notebook includes the steps to pull the public SageMaker XGBoost image for a given version, modify it, and push the custom container to a private Amazon ECR repository. The notebook uses the public Abalone dataset as a sample, trains a model using the SageMaker XGBoost built-in training mode, and reuses this model in the custom image to perform batch transform jobs that produce inference together with SHAP values.

Conclusion

SageMaker built-in algorithms provide a variety of features and functionalities, and can be extended further under the Apache 2.0 open-source license. In this post, we reviewed how to extend the production built-in container for the SageMaker XGBoost algorithm to meet production requirements like backward code and API compatibility.

The sample notebook and helper scripts provide a convenient starting point to customize SageMaker XGBoost container image the way you would like it. Give it a try!

Appendix: Script mode

Script mode provides a way to modify many SageMaker built-in algorithms by providing an interface to replace the functions responsible for transforming the inputs and loading the model. Script mode isn’t as flexible as directly modifying the container, but it provides a completely Python-based route to customize the built-in algorithm with no need to work directly with Docker.

In script mode, a user-module is provided to customize data decoding, loading of the model, and making predictions. The user module can define a transformer_fn that handles all aspects of processing the request to preparing the response. Or instead of defining transformer_fn, you can provide custom methods model_fn, input_fn, predict_fn, and output_fn individually to customize loading the model and decoding and preparing the input for prediction. For a more thorough overview of script mode, see Bring Your Own Model with SageMaker Script Mode.

About the Authors

Peyman Razaghi is a Data Scientist at AWS. He holds a PhD in information theory from the University of Toronto and was a post-doctoral research scientist at the University of Southern California (USC), Los Angeles. Before joining AWS, Peyman was a staff systems engineer at Qualcomm contributing to a number of notable international telecommunication standards. He has authored several scientific research articles peer-reviewed in statistics and systems-engineering area, and enjoys parenting and road cycling outside work.

Solution overview

Create PDF annotations

Use the PDF annotations to train a custom model using the Python API

Obtain evaluation metrics from the trained model

Perform inference on an unseen document

Conclusion

About the Authors

Solution overview

Prerequisites

Create a Box app for Amazon Kendra

Add sample documents to your Box account

Create a Box data source

Index sample documents from the Box account

Test the solution

Clean up

Conclusion

About the Authors

Solution overview

Prerequisites

Deploy the solution

Test the solution

Analyze the prediction results

Configure Athena

Configure QuickSight

Summary

About the Authors

SageMaker XGBoost container structure

Solution overview

Prerequisites

Deploy the solution

Conclusion

Appendix: Script mode

About the Authors

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk

Copyright © 2019-2025 Vedere AI. All Rights Reserved.