Attendees explored new avenues of research in areas including robotics and conversational AI via roundtables moderated by researchers from Amazon.Read More
Build end-to-end document processing pipelines with Amazon Textract IDP CDK Constructs
Intelligent document processing (IDP) with AWS helps automate information extraction from documents of different types and formats, quickly and with high accuracy, without the need for machine learning (ML) skills. Faster information extraction with high accuracy can help you make quality business decisions on time, while reducing overall costs. For more information, refer to Intelligent document processing with AWS AI services: Part 1.
However, complexity arises when implementing real-world scenarios. Documents are often sent out of order, or they may be sent as a combined package with multiple form types. Orchestration pipelines need to be created to introduce business logic, and also account for different processing techniques depending on the type of form inputted. These challenges are only magnified as teams deal with large document volumes.
In this post, we demonstrate how to solve these challenges using Amazon Textract IDP CDK Constructs, a set of pre-built IDP constructs, to accelerate the development of real-world document processing pipelines. For our use case, we process an Acord insurance document to enable straight-through processing, but you can extend this solution to any use case, which we discuss later in the post.
Acord document processing at scale
Straight-through processing (STP) is a term used in the financial industry to describe the automation of a transaction from start to finish without the need for manual intervention. The insurance industry uses STP to streamline the underwriting and claims process. This involves the automatic extraction of data from insurance documents such as applications, policy documents, and claims forms. Implementing STP can be challenging due to the large amount of data and the variety of document formats involved. Insurance documents are inherently varied. Traditionally, this process involves manually reviewing each document and entering the data into a system, which is time-consuming and prone to errors. This manual approach is not only inefficient but can also lead to errors that can have a significant impact on the underwriting and claims process. This is where IDP on AWS comes in.
To achieve a more efficient and accurate workflow, insurance companies can integrate IDP on AWS into the underwriting and claims process. With Amazon Textract and Amazon Comprehend, insurers can read handwriting and different form formats, making it easier to extract information from various types of insurance documents. By implementing IDP on AWS into the process, STP becomes easier to achieve, reducing the need for manual intervention and speeding up the overall process.
This pipeline allows insurance carriers to easily and efficiently process their commercial insurance transactions, reducing the need for manual intervention and improving the overall customer experience. We demonstrate how to use Amazon Textract and Amazon Comprehend to automatically extract data from commercial insurance documents, such as Acord 140, Acord 125, Affidavit of Home Ownership, and Acord 126, and analyze the extracted data to facilitate the underwriting process. These services can help insurance carriers improve the accuracy and speed of their STP processes, ultimately providing a better experience for their customers.
Solution overview
The solution is built using the AWS Cloud Development Kit (AWS CDK), and consists of Amazon Comprehend for document classification, Amazon Textract for document extraction, Amazon DynamoDB for storage, AWS Lambda for application logic, and AWS Step Functions for workflow pipeline orchestration.
The pipeline consists of the following phases:
- Split the document packages and classification of each form type using Amazon Comprehend.
- Run the processing pipelines for each form type or page of form with the appropriate Amazon Textract API (Signature Detection, Table Extraction, Forms Extraction, or Queries).
- Perform postprocessing of the Amazon Textract output into machine-readable format.
The following screenshot of the Step Functions workflow illustrates the pipeline.
Prerequisites
To get started with the solution, ensure you have the following:
- AWS CDK version 2 installed
- Docker installed and running on your machine
- Appropriate access to Step Functions, DynamoDB, Lambda, Amazon Simple Queue Service (Amazon SQS), Amazon Textract, and Amazon Comprehend
Clone the GitHub repo
Start by cloning the GitHub repository:
Create an Amazon Comprehend classification endpoint
We first need to provide an Amazon Comprehend classification endpoint.
For this post, the endpoint detects the following document classes (ensure naming is consistent):
acord125
acord126
acord140
property_affidavit
You can create one by using the comprehend_acord_dataset.csv
sample dataset in the GitHub repository. To train and create a custom classification endpoint using the sample dataset provided, follow the instructions in Train custom classifiers. If you would like to use your own PDF files, refer to the first workflow in the post Intelligently split multi-form document packages with Amazon Textract and Amazon Comprehend.
After training your classifier and creating an endpoint, you should have an Amazon Comprehend custom classification endpoint ARN that looks like the following code:
Navigate to docsplitter/document_split_workflow.py
and modify lines 27–28, which contain comprehend_classifier_endpoint
. Enter your endpoint ARN in line 28.
Install dependencies
Now you install the project dependencies:
Initialize the account and Region for the AWS CDK. This will create the Amazon Simple Storage Service (Amazon S3) buckets and roles for the AWS CDK tool to store artifacts and be able to deploy infrastructure. See the following code:
Deploy the AWS CDK stack
When the Amazon Comprehend classifier and document configuration table are ready, deploy the stack using the following code:
Upload the document
Verify that the stack is fully deployed.
Then in the terminal window, run the aws s3 cp
command to upload the document to the DocumentUploadLocation
for the DocumentSplitterWorkflow
:
We have created a sample 12-page document package that contains the Acord 125, Acord 126, Acord 140, and Property Affidavit forms. The following images show a 1-page excerpt from each document.
All data in the forms is synthetic, and the Acord standard forms are the property of the Acord Corporation, and are used here for demonstration only.
![]() |
![]() |
![]() |
![]() |
Run the Step Functions workflow
Now open the Step Function workflow. You can get the Step Function workflow link from the document_splitter_outputs.json
file, the Step Functions console, or by using the following command:
Depending on the size of the document package, the workflow time will vary. The sample document should take 1–2 minutes to process. The following diagram illustrates the Step Functions workflow.
When your job is complete, navigate to the input and output code. From here you will see the machine-readable CSV files for each of the respective forms.
To download these files, open getfiles.py
. Set files to be the list outputted by the state machine run. You can run this function by running python3 getfiles.py
. This will generate the csvfiles_<TIMESTAMP>
folder, as shown in the following screenshot.
Congratulations, you have now implemented an end-to-end processing workflow for a commercial insurance application.
Extend the solution for any type of form
In this post, we demonstrated how we could use the Amazon Textract IDP CDK Constructs for a commercial insurance use case. However, you can extend these constructs for any form type. To do this, we first retrain our Amazon Comprehend classifier to account for the new form type, and adjust the code as we did earlier.
For each of the form types you trained, we must specify its queries and textract_features
in the generate_csv.py file. This customizes each form type’s processing pipeline by using the appropriate Amazon Textract API.
Queries
is a list of queries. For example, “What is the primary email address?” on page 2 of the sample document. For more information, see Queries.
textract_features
is a list of the Amazon Textract features you want to extract from the document. It can be TABLES, FORMS, QUERIES, or SIGNATURES. For more information, see FeatureTypes.
Navigate to generate_csv.py
. Each document type needs its classification
, queries
, and textract_features
configured by creating CSVRow
instances.
For our example we have four document types: acord125
, acord126
, acord140
, and property_affidavit
. In in the following we want to use the FORMS and TABLES features on the acord documents, and the QUERIES and SIGNATURES features for the property affidavit.
Refer to the GitHub repository for how this was done for the sample commercial insurance documents.
Clean up
To remove the solution, run the cdk destroy
command. You will then be prompted to confirm the deletion of the workflow. Deleting the workflow will delete all the generated resources.
Conclusion
In this post, we demonstrated how you can get started with Amazon Textract IDP CDK Constructs by implementing a straight-through processing scenario for a set of commercial Acord forms. We also demonstrated how you can extend the solution to any form type with simple configuration changes. We encourage you to try the solution with your respective documents. Please raise a pull request to the GitHub repo for any feature requests you may have. To learn more about IDP on AWS, refer to our documentation.
About the Authors
Raj Pathak is a Senior Solutions Architect and Technologist specializing in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM) and Machine Learning infrastructure and operations projects (MLOps).
Aditi Rajnish is a Second-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.
Enzo Staton is a Solutions Architect with a passion for working with companies to increase their cloud knowledge. He works closely as a trusted advisor and industry specialist with customers around the country.
Snapper provides machine learning-assisted labeling for pixel-perfect image object detection
Bounding box annotation is a time-consuming and tedious task that requires annotators to create annotations that tightly fit an object’s boundaries. Bounding box annotation tasks, for example, require annotators to ensure that all edges of an annotated object are enclosed in the annotation. In practice, creating annotations that are precise and well-aligned to object edges is a laborious process.
In this post, we introduce a new interactive tool called Snapper, powered by a machine learning (ML) model that reduces the effort required of annotators. The Snapper tool automatically adjusts noisy annotations, reducing the time required to annotate data at a high-quality level.
Overview of Snapper
Snapper is an interactive and intelligent system that automatically “snaps” object annotations to image-based objects in real time. With Snapper, annotators place bounding box annotations by drawing boxes, and then see immediate and automatic adjustments to their bounding box to better fit the bounded object.
The Snapper system is composed of two subsystems. The first subsystem is a front-end ReactJS component that intercepts annotation-related mouse events and handles the rendering of the model’s predictions. We integrate this front end with our Amazon SageMaker Ground Truth annotation UI. The second subsystem consists of the model backend, which receives requests from the front-end client, routes the requests to an ML model to generate adjusted bounding box coordinates, and sends the data back to the client.
ML model optimized for annotators
A tremendous number of high-performing object detection models have been proposed by the computer vision community in recent years. However, these state-of-the-art models are typically optimized for unguided object detection. To facilitate Snapper’s “snapping” functionality for adjusting users’ annotations, the input to our model is an initial bounding box, provided by the annotator, which can serve as a marker for the presence of an object. Furthermore, because the system has no intended object class it aims to support, Snapper’s adjustment model should be object-agnostic such that the system performs well on a range of object classes.
In general, these requirements diverge substantially from the use cases of typical ML object detection models. We note that the traditional object detection problem is formulated as “detect the object center, then regress the dimensions.” This is counterintuitive, because accurate predictions of bounding box edges rely crucially on first finding an accurate box center, and then trying to establish scalar distances to edges. Moreover, it doesn’t provide good confidence estimates that focus on the uncertainties of the edge locations, because only the classifier score is available for use.
To give our Snapper model the ability to adjust users’ annotations, we design and implement an ML model custom designed for bounding box adjustment. As input, the model takes an image and a corresponding bounding box annotation. The model extracts features from the image using a convolutional neural network. Following feature extraction, directional spatial pooling is applied to each dimension to aggregate the information needed to identify an appropriate edge location.
We formulate location prediction for bounding boxes as a classification problem over different locations. While seeing the whole object, we ask the machine to reason about the presence or absence of an edge directly at each pixel’s location as a classification task. This improves accuracy, as the reasoning for each edge uses image features from the immediate local neighborhood. Moreover, the scheme decouples the reasoning between different edges, which prevents unambiguous edge locations from being affected by the uncertain ones. Additionally, it provides us with edge-wise intuitive confidence estimates, as our model considers each edge of the object independently (like human annotators would) and provides an interpretable distribution (or uncertainty estimate) for each edge’s location. This allows us to highlight less confident edges for more efficient and precise human review.
Benchmarking and evaluating the Snapper tool
In practice, we find that the Snapper tool streamlines the bounding box annotation task and is very intuitive for users to pick up. We also conducted a quantitative analysis of Snapper to characterize the tool objectively. We evaluated Snapper’s adjustment model using a type of evaluation standard to object detection models that employs two measures to examine validity: Intersection over Union (IoU), and edge and corner deviance. IoU calculates the alignment between two annotations by dividing the annotations’ area of overlap by the annotations’ area of union, yielding a metric that ranges from 0–1. Edge deviance and corner deviance are calculated by taking the fraction of edges and corners that deviate from the ground truth by a pixel value.
To evaluate Snapper, we dynamically generated noisy annotation data by randomly adjusting the COCO ground truth bounding box coordinates with jitter. Our procedure for adding jitter first shifts the center of the bounding box by up to 10% of the corresponding bounding box dimension on each axis and then rescales the dimensions of the bounding box by a randomly sampled ratio between 0.9–1.1. Here, we apply these metrics to the validation set from the official MS-COCO dataset used for training. We specifically calculate the fraction of bounding boxes with IoU exceeding 90% alongside the fraction of edge deviations and corner deviations that deviate less than one or three pixels from the corresponding ground truth. The following table summarizes our findings.
As shown in the preceding table, Snapper’s adjustment model significantly improved the two sources of noisy data across each of the three metrics. With an emphasis on high precision annotations, we observe that applying Snapper to the jittered MS COCO dataset increases the fraction of bounding boxes with IoU exceeding 90% by upwards of 40%.
Conclusion
In this post, we introduced a new ML-powered annotation tool called Snapper. Snapper consists of a SageMaker model backend as well as a front-end component that we integrate into the Ground Truth labeling UI. We evaluated Snapper on simulated noisy bounding box annotations and found that it can successfully refine imperfect bounding boxes. The use of Snapper in labeling tasks can significantly reduce cost and increase accuracy.
To learn more, visit Amazon SageMaker Data Labeling and schedule a consultation today.
About the authors
Jonathan Buck is a Software Engineer at Amazon Web Services working at the intersection of machine learning and distributed systems. His work involves productionizing machine learning models and developing novel software applications powered by machine learning to put the latest capabilities in the hands of customers.
Alex Williams is an applied scientist in the human-in-the-loop science team at AWS AI where he conducts interactive systems research at the intersection of human-computer interaction (HCI) and machine learning. Before joining Amazon, he was a professor in the Department of Electrical Engineering and Computer Science at the University of Tennessee where he co-directed the People, Agents, Interactions, and Systems (PAIRS) research laboratory. He has also held research positions at Microsoft Research, Mozilla Research, and the University of Oxford. He regularly publishes his work at prem
Min Bai is an applied scientist at AWS, with a current specialization in 2D / 3D computer vision, with a focus on the fields of autonomous driving and user-friendly AI tools. When not at work, he enjoys exploring nature, especially off the beaten track.
Kumar Chellapilla is a General Manager and Director at Amazon Web Services and leads the development of ML/AI Services such as human-in-loop systems, AI DevOps, Geospatial ML, and ADAS/Autonomous Vehicle development. Prior to AWS, Kumar was a Director of Engineering at Uber ATG and Lyft Level 5 and led teams using machine learning to develop self-driving capabilities such as perception and mapping. He also worked on applying machine learning techniques to improve search, recommendations, and advertising products at LinkedIn, Twitter, Bing, and Microsoft Research.
Patrick Haffner is a Principal Applied Scientist with the AWS Sagemaker Ground Truth team. He has been working on human-in-the-loop optimization since 1995, when he applied the LeNet Convolutional Neural Network to check recognition. He is interested in holistic approaches where ML algorithms and labeling UIs are optimized together to minimize the labeling cost.
Erran Li is the applied science manager at humain-in-the-loop services, AWS AI, Amazon. His research interests are 3D deep learning, and vision and language representation learning. Previously he was a senior scientist at Alexa AI, the head of machine learning at Scale AI and the chief scientist at Pony.ai. Before that, he was with the perception team at Uber ATG and the machine learning platform team at Uber working on machine learning for autonomous driving, machine learning systems and strategic initiatives of AI. He started his career at Bell Labs and was adjunct professor at Columbia University. He co-taught tutorials at ICML’17 and ICCV’19, and co-organized several workshops at NeurIPS, ICML, CVPR, ICCV on machine learning for autonomous driving, 3D vision and robotics, machine learning systems and adversarial machine learning. He has a PhD in computer science at Cornell University. He is an ACM Fellow and IEEE Fellow.
Recommend top trending items to your users using the new Amazon Personalize recipe
Amazon Personalize is excited to announce the new Trending-Now recipe to help you recommend items gaining popularity at the fastest pace among your users.
Amazon Personalize is a fully managed machine learning (ML) service that makes it easy for developers to deliver personalized experiences to their users. It enables you to improve customer engagement by powering personalized product and content recommendations in websites, applications, and targeted marketing campaigns. You can get started without any prior ML experience, using APIs to easily build sophisticated personalization capabilities in a few clicks. All your data is encrypted to be private and secure, and is only used to create recommendations for your users.
User interests can change based on a variety of factors, such as external events or the interests of other users. It’s critical for websites and apps to tailor their recommendations to these changing interests to improve user engagement. With Trending-Now, you can surface items from your catalog that are rising in popularity with higher velocity than other items, such as trending news, popular social content, or newly released movies. Amazon Personalize looks for items that are rising in popularity at a faster rate than other catalog items to help users discover items that are engaging their peers. Amazon Personalize also allows you to define the time periods over which trends are calculated depending on their unique business context, with options for every 30 minutes, 1 hour, 3 hours, or 1 day, based on the most recent interactions data from users.
In this post, we show how to use this new recipe to recommend top trending items to your users.
Solution overview
Trending-Now identifies the top trending items by calculating the increase in interactions that each item has over configurable intervals of time. The items with the highest rate of increase are considered trending items. The time is based on timestamp data in your interactions dataset. You can specify the time interval by providing a trend discovery frequency when you create your solution.
The Trending-Now recipe requires an interactions dataset, which contains a record of the individual user and item events (such as clicks, watches, or purchases) on your website or app along with the event timestamps. You can use the parameter Trend discovery frequency to define the time intervals over which trends are calculated and refreshed. For example, if you have a high traffic website with rapidly changing trends, you can specify 30 minutes as the trend discovery frequency. Every 30 minutes, Amazon Personalize looks at the interactions that have been ingested successfully and refreshes the trending items. This recipe also allows you to capture and surface any new content that has been introduced in the last 30 minutes and has seen a higher degree of interest from your user base than any preexisting catalog items. For any parameter values that are greater than 2 hours, Amazon Personalize automatically refreshes the trending item recommendations every 2 hours to account for new interactions and new items.
Datasets that have low traffic but use a 30-minute value can see poor recommendation accuracy due to sparse or missing interactions data. The Trending-Now recipe requires that you provide interaction data for at least two past time periods (this time period is your desired trend discovery frequency). If interaction data doesn’t exist for the last 2 time periods, Amazon Personalize will replace the trending items with popular items until the required minimum data is available.
The Trending-Now recipe is available for both custom dataset groups as well as video-on-demand domain dataset groups. In this post, we demonstrate how to tailor your recommendations for the fast-changing trends in user interest with this new Trending-Now feature for a media use case with a custom dataset group. The following diagram illustrates the solution workflow.
For example, in video-on-demand applications, you can use this feature to show what movies are trending in the last 1 hour by specifying 1 hour for your trend discovery frequency. For every 1 hour of data, Amazon Personalize identifies the items with the greatest rate of increase in interactions since the last evaluation. Available frequencies include 30 minutes, 1 hour, 3 hours, and 1 day.
Prerequisites
To use the Trending-Now recipe, you first need to set up Amazon Personalize resources on the Amazon Personalize console. Create your dataset group, import your data, train a solution version, and deploy a campaign. For full instructions, see Getting started.
For this post, we have followed the console approach to deploy a campaign using the new Trending-Now recipe. Alternatively, you can build the entire solution using the SDK approach with this provided notebook. For both approaches, we use the MovieLens public dataset.
Prepare the dataset
Complete the following steps to prepare your dataset:
- Create a dataset group.
- Create an interactions dataset using the following schema:
- Import the interactions data to Amazon Personalize from Amazon Simple Storage Service (Amazon S3).
For the interactions data, we use ratings history from the movies review dataset, MovieLens.
Please use below python code to curate interactions dataset from the MovieLens public dataset.
The MovieLens
dataset contains the user_id
, rating
, item_id
, interactions between the users and items, and the time this interaction took place (a timestamp, which is given as UNIX epoch time). The dataset also contains movie title information to map the movie ID to the actual title and genres. The following table is a sample of the dataset.
USER_ID | ITEM_ID | TIMESTAMP | TITLE | GENRES |
116927 | 1101 | 1105210919 | Top Gun (1986) | Action|Romance |
158267 | 719 | 974847063 | Multiplicity (1996) | Comedy |
55098 | 186871 | 1526204585 | Heal (2017) | Documentary |
159290 | 59315 | 1485663555 | Iron Man (2008) | Action|Adventure|Sci-Fi |
108844 | 34319 | 1428229516 | Island, The (2005) | Action|Sci-Fi|Thriller |
85390 | 2916 | 953264936 | Total Recall (1990) | Action|Adventure|Sci-Fi|Thriller |
103930 | 18 | 839915700 | Four Rooms (1995) | Comedy |
104176 | 1735 | 985295513 | Great Expectations (1998) | Drama|Romance |
97523 | 1304 | 1158428003 | Butch Cassidy and the Sundance Kid (1969) | Action|Western |
87619 | 6365 | 1066077797 | Matrix Reloaded, The (2003) | Action|Adventure|Sci-Fi|Thriller|IMAX |
The curated dataset includes USER_ID
, ITEM_ID
(movie ID), and TIMESTAMP
to train the Amazon Personalize model. These are the mandatory required fields to train a model with the Trending-Now recipe. The following table is a sample of the curated dataset.
USER_ID | ITEM_ID | TIMESTAMP |
48953 | 529 | 841223587 |
23069 | 1748 | 1092352526 |
117521 | 26285 | 1231959564 |
18774 | 457 | 848840461 |
58018 | 179819 | 1515032190 |
9685 | 79132 | 1462582799 |
41304 | 6650 | 1516310539 |
152634 | 2560 | 1113843031 |
57332 | 3387 | 986506413 |
12857 | 6787 | 1356651687 |
Train a model
After the dataset import job is complete, you’re ready to train your model.
- On the Solutions tab, choose Create solution.
- Choose the
new aws-trending-now
recipe. - In the Advanced configuration section, set Trend discovery frequency to 30 minutes.
- Choose Create solution to start training.
Create a campaign
In Amazon Personalize, you use a campaign to make recommendations for your users. In this step, you create a campaign using the solution you created in the previous step and get the Trending-Now recommendations:
- On the Campaigns tab, choose Create campaign.
- For Campaign name, enter a name.
- For Solution, choose the solution
trending-now-solution
. - For Solution version ID, choose the solution version that uses the
aws-trending-now
recipe. - For Minimum provisioned transactions per second, leave it at the default value.
- Choose Create campaign to start creating your campaign.
Get recommendations
After you create or update your campaign, you can get a recommended list of items that are trending, sorted from highest to lowest. On the campaign (trending-now-campaign
) Personalization API tab, choose Get recommendations.
The following screenshot shows the campaign detail page with results from a GetRecommendations
call that includes the recommended items and the recommendation ID.
The results from the GetRecommendations
call includes the IDs of recommended items. The following table is a sample after mapping the IDs to the actual movie titles for readability. The code to perform the mapping is provided in the attached notebook.
ITEM_ID | TITLE |
356 | Forrest Gump (1994) |
318 | Shawshank Redemption, The (1994) |
58559 | Dark Knight, The (2008) |
33794 | Batman Begins (2005) |
44191 | V for Vendetta (2006) |
48516 | Departed, The (2006) |
195159 | Spider-Man: Into the Spider-Verse (2018) |
122914 | Avengers: Infinity War – Part II (2019) |
91974 | Underworld: Awakening (2012) |
204698 | Joker (2019) |
Get trending recommendations
After you create a solution version using the aws-trending-now
recipe, Amazon Personalize will identify the top trending items by calculating the increase in interactions that each item has over configurable intervals of time. The items with the highest rate of increase are considered trending items. The time is based on timestamp data in your interactions dataset.
Now let’s provide the latest interactions to Amazon Personalize to calculate the trending items. We can provide the latest interactions using real-time ingestion by creating an event tracker or through a bulk data upload with a dataset import job in incremental mode. In the notebook, we have provided sample code to individually import the latest real-time interactions data into Amazon Personalize using the event tracker.
For this post we will provide the latest interactions as a bulk data upload with a dataset import job in incremental mode. Please use below python code to generate dummy incremental interactions and upload the incremental interactions data using a dataset import job.
We have synthetically generated these interactions by randomly selecting a few values for USER_ID
and ITEM_ID
, and generating interactions between those users and items with latest timestamps. The following table contains the randomly selected ITEM_ID
values that are used for generating incremental interactions.
ITEM_ID | TITLE |
153 | Batman Forever (1995) |
260 | Star Wars: Episode IV – A New Hope (1977) |
1792 | U.S. Marshals (1998) |
2363 | Godzilla (Gojira) (1954) |
2407 | Cocoon (1985) |
2459 | Texas Chainsaw Massacre, The (1974) |
3948 | Meet the Parents (2000) |
6539 | Pirates of the Caribbean: The Curse of the Bla… |
8961 | Incredibles, The (2004) |
61248 | Death Race (2008) |
Upload the incremental interactions data by selecting Append to current dataset (or use incremental mode if using APIs), as shown in the following snapshot.
After the import job of incremental interactions dataset is complete, wait for the length of the trend discovery frequency time that you configured for the new recommendations to get reflected.
Choose Get recommendations on the campaign API page to get the latest recommended list of items that are trending.
Now we see the latest list of recommended items. The following table contains the data after mapping the IDs to the actual movie titles for readability. The code to perform the mapping is provided in the attached notebook.
ITEM_ID | TITLE |
260 | Star Wars: Episode IV – A New Hope (1977) |
6539 | Pirates of the Caribbean: The Curse of the Bla… |
153 | Batman Forever (1995) |
3948 | Meet the Parents (2000) |
1792 | U.S. Marshals (1998) |
2459 | Texas Chainsaw Massacre, The (1974) |
2363 | Godzilla (Gojira) (1954) |
61248 | Death Race (2008) |
8961 | Incredibles, The (2004) |
2407 | Cocoon (1985) |
The preceding GetRecommendations
call includes the IDs of recommended items. Now we see the ITEM_ID
values recommended are from the incremental interactions dataset that we had provided to the Amazon Personalize model. This is not surprising because these are the only items that gained interactions in the most recent 30 minutes from our synthetic dataset.
You have now successfully trained a Trending-Now model to generate item recommendations that are becoming popular with your users and tailor the recommendations according to user interest. Going forward, you can adapt this code to create other recommenders.
You can also use filters along with the Trending-Now recipe to differentiate the trends between different types of content, like long vs. short videos, or apply promotional filters to explicitly recommend specific items based on rules that align with your business goals.
Clean up
Make sure you clean up any unused resources you created in your account while following the steps outlined in this post. You can delete filters, recommenders, datasets, and dataset groups via the AWS Management Console or using the Python SDK.
Summary
The new aws-trending-now
recipe from Amazon Personalize helps you identify the items that are rapidly becoming popular with your users and tailor your recommendations for the fast-changing trends in user interest.
For more information about Amazon Personalize, see the Amazon Personalize Developer Guide.
About the authors
Vamshi Krishna Enabothala is a Sr. Applied AI Specialist Architect at AWS. He works with customers from different sectors to accelerate high-impact data, analytics, and machine learning initiatives. He is passionate about recommendation systems, NLP, and computer vision areas in AI and ML. Outside of work, Vamshi is an RC enthusiast, building RC equipment (planes, cars, and drones), and also enjoys gardening.
Anchit Gupta is a Senior Product Manager for Amazon Personalize. She focuses on delivering products that make it easier to build machine learning solutions. In her spare time, she enjoys cooking, playing board/card games, and reading.
Abhishek Mangal is a Software Engineer for Amazon Personalize and works on architecting software systems to serve customers at scale. In his spare time, he likes to watch anime and believes ‘One Piece’ is the greatest piece of story-telling in recent history.
Bundesliga Match Fact Ball Recovery Time: Quantifying teams’ success in pressing opponents on AWS
In football, ball possession is a strong predictor for team success. It’s hard to control the game without having control over the ball. In the past three Bundesliga seasons, as well as in the current season (at the time of this writing), Bayern Munich is ranked first in the table and in ball possession percentage, followed by Dortmund being second in both. The active tactics and playing styles that facilitate high possession values through ball retention have been widely discussed. Terms like Tiki-Taka were established to describe a playing style that is characterized by a precise short passing game with frequent long ball possessions of the attacking team. However, in order to arrive at high possession rates, teams also need to adapt their defense to quickly win back a ball lost to the opponent. Terms like high-press, middle-press, and low-press are often used to describe the amount of room a defending team is allowing their opponents when moving towards their goal before applying pressure on the ball.
The recent history of Bundesliga club FC Köln emphasizes the effect of different pressing styles on a team’s success. Since Steffen Baumgart took over as coach at FC Köln in 2021, the team has managed to lift themselves from the bottom and has established a steady position in the middle of the table. When analyzing the team statistics after the switch in coaches, one aspect stands our specifically: with 54 pressing situations per game, the team was ranked first in the league, being able to win the ball back in a third of those situations. This proved especially successful when attacking in the opponent’s half of the pitch. With an increased number of duels per match (+10% compared to previous season), the Billy Goats managed to finish the last season on a strong seventh place, securing a surprising spot in the UEFA Europa Conference League.
Our previous Bundesliga Match Fact (BMF) Pressure Handling sheds light on how successful different players and teams are in withstanding this pressure while retaining the ball. To facilitate the understanding of how active and successful a defending team applies pressure, we need to understand how long it takes them to win back a lost ball. Which Bundesliga teams are fastest in winning back lost possessions? How does a team’s ability to quickly regain possession develop over the course of a match? Are their recovery times diminished when playing stronger teams? And finally, are short recovery times a necessary ingredient to a winning formula?
Introducing the new Bundesliga Match Fact: Ball Recovery Time.
![]() |
![]() |
![]() |
How it works
Ball Recovery Time (BRT) calculates the amount of time it takes for a team to regain possession of the ball. It indicates how hungry a team is at winning the ball back and is measured in average ball recovery time in seconds.
Throughout a match, the positions of the players and the ball are tracked by cameras around the pitch and stored as coordinates in a positional data stream. This allows us to calculate which player has ball possession at any given moment in time. It’s no surprise that the ball possession alternates between the two teams over the course of a match. However, less obvious are the times where the ball possession is contested and can’t be directly assigned to any particular team. The timer for ball recovery starts counting from the moment the team loses possession until they regain it. The time when the ball’s possession is not clear is included in the timer, incentivizing teams to favor clear and fast recoveries.
The following example shows a sequence of alternating ball possessions between team A and B. At some point, team A loses ball possession to team B, which starts the ball recovery time for team A. The ball recovery time is calculated until team A regains the ball.
As already mentioned, FC Cologne has been the league leader in the number of pressing situations since Steffen Baumgart took office. This style of play is also evident when you look at the ball recovery times for the first 24 match days in the 2022/23 season. Cologne achieved an incredible ball recovery time of 13.4 seconds, which is the fourth fastest in the league. On average, it took them only 1.4 seconds longer to recover a lost ball than the fastest team in the league, Bayern Munich, who got the ball back from their opponents after an average of 12 seconds.
Let’s look at certain games played by Cologne in the 2022/23 season. The following chart shows the ball recovery times of Cologne for various games. At least two games stand out in particular. On the first match day, they faced FC Schalke—also known as the Miners—and managed an exceptionally low BRT of 8.3 seconds. This was aided by a red card for Schalke in the first half when the game was still tied 0:0. Cologne’s quick recovery of the ball subsequently helped them prevail a 3:1 against the Miners.
Also worth mentioning is the Cologne derby against Borussia Mönchengladbach on the ninth match day. In that game, Cologne took 21.6 seconds to recover the ball, which is around 60% slower than their season average of 13.4 seconds. A yellow-red card just before halftime certainly made it difficult for the Billy Goats to speed up recovering the ball from their local rival Borussia. At the same time, Borussia managed to win the ball back from Cologne on average after just 13.7 seconds, resulting in a consistent 5:2 win for Borussia over their perennial rivals from Cologne.
How it’s implemented
Positional data from an ongoing match, which is recorded at a sampling rate of 25 Hz, is utilized to determine the time taken to recover the ball. To ensure real-time updates of ball recovery times, we have implemented Amazon Managed Streaming for Apache Kafka (Amazon MSK) as a central solution for data streaming and messaging. This allows for seamless communication of positional data and various outputs of Bundesliga Match Facts between containers in real time.
The following diagram illustrates the end-to-end workflow for Ball Recovery Time.
The match-related data is collected and ingested using DFL’s DataHub. Metadata of the match is processed within the AWS Lambda function MetaDataIngestion
, while positional data is ingested using the AWS Fargate container called MatchLink
. Both the Lambda function and the Fargate container publish the data for further consumption in the relevant MSK topics. The core of the Ball Recovery Time BMF resides within a dedicated Fargate container called BMF BallRecoveryTime
. This container operates throughout the corresponding match and obtains all necessary data continuously through Amazon MSK. Its logic responds instantly to positional changes and constantly computes the current ball recovery times.
After the ball recovery times have been computed, they’re transmitted back to the DataHub for distribution to other consumers of Bundesliga Match Facts. Additionally, the ball recovery times are sent to a specific topic in the MSK cluster, where they can be accessed by other Bundesliga Match Facts. A Lambda function retrieves all recovery times from the relevant Kafka topic and stores them in an Amazon Aurora Serverless database. This data is then utilized to create interactive, near-real-time visualizations with Amazon QuickSight.
Summary
In this post, we demonstrated how the new Bundesliga Match Fact Ball Recovery Time makes it possible to quantify and objectively compare the speed of different Bundesliga teams in winning back a lost ball possession. This allows commentators and fans to understand how early and successful teams apply pressure to their opponents.
The new Bundesliga Match Fact is the result of an in-depth analysis by a team of football experts and data scientists from the Bundesliga and AWS. Noteworthy ball recovery times are shown in the live ticker of the respective matches in the official Bundesliga app and website. During live matches, ball recovery times are also provided to commentators through the data story finder and visually shown to fans at key moments in broadcast.
We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!
We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.
About the Authors
Javier Poveda-Panter is a Senior Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.
Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.
Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist, taking long bike-packing trips.
Fotinos Kyriakides is an ML Engineer with AWS Professional Services. He focuses his efforts in the fields of machine learning, MLOps, and application development, in supporting customers to develop applications in the cloud that leverage and innovate on insights generated from data. In his spare time, he likes to run and explore nature.
Luuk Figdor is a Principal Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.
Bundesliga Match Fact Keeper Efficiency: Comparing keepers’ performances objectively using machine learning on AWS
The Bundesliga is renowned for its exceptional goalkeepers, making it potentially the most prominent among Europe’s top five leagues in this regard. Apart from the widely recognized Manuel Neuer, the Bundesliga has produced remarkable goalkeepers who have excelled in other leagues, including the likes of Marc-André ter Stegen, who is a superstar at Barcelona. In view of such steep competition, people are split on the question of who the most remarkable sweeper in the German top league is. As demonstrated by Yann Sommer’s stunning 19 saves (Bundesliga record) against Bayern Munich last summer that aided his former club Mönchengladbach to pull a draw on the Bavarians, this league’s keepers are fiercely vying for the top spot.
We have witnessed time and time again that a keeper can make or break a win, yet it remains challenging to objectively quantify their effect on a team’s success. Who is the most efficient goal keeper in the Bundesliga? Who prevents more goals than the average? How can we even compare keepers with different playing styles? It’s about time to shed some light on our guardians’ achievements. Enter the brand-new Bundesliga Match Fact: Keeper Efficiency.
When talking about the best of the best shot-stoppers in the Bundesliga, the list is long and rarely complete. In recent years, one name has been especially dominant: Kevin Trapp. For years, Trapp has been regarded as one of the finest goalies in the Bundesliga. Not only was he widely considered the top-rated goalkeeper in the league during the 2021/22 season, but he also held that title back in 2018/19 when Eintracht Frankfurt reached the Europa League semifinals. Similar to Yann Sommer, Trapp often delivered his best performances on nights when his team was up against the Bavarians.
Many football enthusiasts would argue that Yann Sommer is the best keeper in Germany’s top league, despite being also the smallest. Sommer is highly skilled with the ball at his feet and has demonstrated his ability to produce jaw-dropping saves that are on par with others in the world elite. Although Sommer can genuinely match any goalkeeper’s level on his best days, he hasn’t had enough of those best days frequently in the past. Although he has improved his consistency over time, he still makes occasional errors that can frustrate fans. While being the well-deserved Switzerland’s #1 since 2016, time will tell whether he pushes Manuel Neuer off the throne in Munich.
And let’s not forget about Gregor Kobel. Since joining Borussia Dortmund, Kobel, who has previously played for Hoffenheim, Augsburg, and VfB Stuttgart, has been a remarkable signing for the club. Although Jude Bellingham has possibly overtaken him as the team’s highest valued player, there is still a valid argument that Kobel is the most important player for Dortmund. At only 25 years old, Kobel is among the most promising young goalkeepers globally, with the ability to make quality saves and face a significant number of shots in the Bundesliga. The pressure to perform at Dortmund is immense, second only to their fierce rivals Bayern Munich (at the time of this writing), and Kobel doesn’t have the same defensive protection as any Bayern keeper would. In 2022/23 so far, he has almost secured a clean sheet every other match for Die Schwarzgelben, despite the team’s inconsistency and often poor midfield performance.
As these examples show, the ways in which keepers shine and compete are manifold. Therefore, it’s no surprise that determining the proficiency of goalkeepers in preventing the ball from entering the net is considered one of the most difficult tasks in football data analysis. Bundesliga and AWS have collaborated to perform an in-depth examination to study the quantification of achievements of Bundesliga’s keepers. The result is a machine learning (ML)-powered insight that allows fans to easily evaluate and compare the goalkeepers’ proficiencies. We’re excited to announce the new Bundesliga Match Fact: Keeper Efficiency.
How it works
The new Bundesliga Match Fact Keeper Efficiency allows fans to evaluate the proficiency of goalkeepers in terms of their ability to prevent shooters from scoring. Although tallying the total number of saves a goalkeeper makes during a match can be informative, it doesn’t account for variations in the difficulty of the shots faced. To avoid treating a routine catch of a 30-meter shot aimed directly at the goalkeeper as being equivalent to an exceptional save made from a shot taken from a distance of 5 meters, we assign each shot a value known as xSaves, which measures the probability that a shot will be saved by a Keeper. In other words, a shot with an xSaves value of 0.9 would be saved 9 out of 10 times.
An ML model is trained through Amazon SageMaker, using data from four seasons of the first and second Bundesliga, encompassing all shots that landed on target (either resulting in a goal or being saved). Using derived characteristics of a shot, the model generates the probability that the shot will be successfully saved by the goalkeeper. Some of the factors considered by the model are: distance to goal, distance to goalkeeper, shot angle, number of players between the shot location and the goal, goalkeeper positioning, and predicted shot trajectory. We utilize an extra model to predict the trajectory of the shot using the initial few frames of the observed shot. With the predicted trajectory of the shot and the goalkeeper’s position, the xSaves model can evaluate the probability of the goalkeeper saving the ball.
Adding up all xSaves values of saved and conceded shots by a goalkeeper yields the expected number of saves a goalkeeper should have during a match or season. Comparing that against the actual number of saves yields the Keeper Efficiency. In other words, a goalkeeper with a positive Keeper Efficiency rating indicates that the goalkeeper has saved more shots than expected.
Keeper Efficiency in action
The following are a few estimates to showcase the Keeper Efficiency.
Example 1
Due to the large distance to the goal, and the relatively low distance and large number of defenders covering the goal, the probability that the shot will result in a goal is low. Because the goalkeeper saved the shot, he will receive a small increase in the Keeper Efficiency ranking.
Example 2
In this example, the striker is much closer to the goal, with only one defender between him and the goalkeeper, resulting in a lower save probability.
Example 3
In this example, the speed of the ball is much higher and the ball is higher off the ground, resulting in a very low probability that the ball will be saved. The goal was conceded, and therefore the goalkeeper will see a small decrease in his Keeper Efficiency statistic.
What makes a good save
The preceding video shows a medium difficulty shot with approximately a 50/50 chance of being saved, meaning that half the keepers in the league would save it and the other half concede the goal. What makes this save remarkable is the goalkeeper’s positioning, instinct, and reflexes. The goalkeeper remains focused on the ball even when his vision is obstructed by the defenders and changes his positioning multiple times according to where he thinks the biggest opening lies. Looking at it frame by frame, as soon as the attacking player winds up to take the shot, the goalkeeper makes a short hop backwards to better position himself for the jump to save the shot. The keeper’s reflexes are perfect, landing precisely at the moment when the striker kicks the ball. If he lands too late, he would be mid-air as the ball is flying towards the goal, wasting precious time. With both feet planted on the grass, he makes a strong jump, managing to save the shot.
How Keeper Efficiency is implemented
This Bundesliga Match Fact consumes both event and positional data. Positional data is information gathered by cameras on the positions of the players and ball at any moment during the match (x-y coordinates), arriving at 25Hz. Event data consists of hand-labelled event descriptions with useful attributes, such as shot on target. When a shot on target (a scored or saved goal) event is received, it queries the stored positional data and finds a sync frame—a frame during which the timing and position of the ball match with the event. This frame is used to synchronize the event data with the positional data. Having synchronized, the subsequent frames that track the ball trajectory are used to predict where the ball will enter the goal. Additionally, the goalkeeper position at the time of the shot is considered, as well as a number of other features such as the number of defenders between the ball and the goalpost and the speed of the ball. All this data is then passed to an ML model (xGBoost), which is deployed on Amazon SageMaker Serverless Inference to generate a prediction on the probability of the shot being saved.
The BMF logic itself (except for the ML model) runs on an AWS Fargate container. For every xSaves prediction, it produces a message with the prediction as a payload, which then gets distributed by a central message broker running on Amazon Managed Streaming for Apache Kafka (Amazon MSK). The information also gets stored in a data lake for future auditing and model improvements. The contents of the Kafka messages then get written via an AWS Lambda function to an Amazon Aurora Serverless database to be presented in an Amazon QuickSight dashboard. The following diagram illustrates this architecture.
Summary
The new Bundesliga Match Fact Keeper Efficiency measures the shot-stopping skills of the Bundesliga’s goalies, which are considered to be among the finest in the world. This gives fans and commentators the unique opportunity to understand quantitatively how much a goalkeeper’s performance has contributed to a team’s match result or seasonal achievements.
This Bundesliga Match Fact was developed among a team of Bundesliga and AWS experts. Noteworthy goalkeeper performances are pushed into the Bundesliga live ticker in the mobile app and on the webpage. Match commentators can observe exceptional Keeper Efficiency through the data story finder, and visuals are presented to the fans as part of broadcasting streams.
We hope that you enjoy this brand-new Bundesliga Match Fact and that it provides you with new insights into the game. To learn more about the partnership between AWS and Bundesliga, visit Bundesliga on AWS!
We’re excited to learn what patterns you will uncover. Share your insights with us: @AWScloud on Twitter, with the hashtag #BundesligaMatchFacts.
About the Authors
Javier Poveda-Panter is a Senior Data Scientist for EMEA sports customers within the AWS Professional Services team. He enables customers in the area of spectator sports to innovate and capitalize on their data, delivering high-quality user and fan experiences through machine learning and data science. He follows his passion for a broad range of sports, music, and AI in his spare time.
Tareq Haschemi is a consultant within AWS Professional Services. His skills and areas of expertise include application development, data science, machine learning, and big data. He supports customers in developing data-driven applications within the cloud. Prior to joining AWS, he was also a consultant in various industries such as aviation and telecommunications. He is passionate about enabling customers on their data/AI journey to the cloud.
Jean-Michel Lourier is a Senior Data Scientist within AWS Professional Services. He leads teams implementing data-driven applications side by side with AWS customers to generate business value out of their data. He’s passionate about diving into tech and learning about AI, machine learning, and their business applications. He is also an enthusiastic cyclist, taking long bike-packing trips.
Fotinos Kyriakides is an ML Engineer with AWS Professional Services. He focuses his efforts in the fields of machine learning, MLOps, and application development, in supporting customers to develop applications in the cloud that leverage and innovate on insights generated from data. In his spare time, he likes to run and explore nature.
Uwe Dick is a Data Scientist at Sportec Solutions AG. He works to enable Bundesliga clubs and media to optimize their performance using advanced stats and data—before, after, and during matches. In his spare time, he settles for less and just tries to last the full 90 minutes for his recreational football team.
Luuk Figdor is a Principal Sports Technology Advisor in the AWS Professional Services team. He works with players, clubs, leagues, and media companies such as the Bundesliga and Formula 1 to help them tell stories with data using machine learning. In his spare time, he likes to learn all about the mind and the intersection between psychology, economics, and AI.
HAYAT HOLDING uses Amazon SageMaker to increase product quality and optimize manufacturing output, saving $300,000 annually
This is a guest post by Neslihan Erdogan, Global Industrial IT Manager at HAYAT HOLDING.
With the ongoing digitization of the manufacturing processes and Industry 4.0, there is enormous potential to use machine learning (ML) for quality prediction. Process manufacturing is a production method that uses formulas or recipes to produce goods by combining ingredients or raw materials.
Predictive quality comprises the use of ML methods in production to estimate and classify product-related quality based on manufacturing process data with the following goals[1]:
- Quality description – The identification of relationships between process variables and product quality. For instance, how does the volume of an adhesive ingredient effect the quality parameters, such as its strength and elasticity.
- Quality prediction – The estimation of a quality variable on the basis of process variables for decision support or for automation. For example, how much kg/m3 adhesive ingredient shall be ingested to achieve certain strength and elasticity.
- Quality classification – In addition to quality prediction, this involves estimation of certain product quality types.
In this post, we share how HAYAT HOLDING—a global player with 41 companies operating in different industries, including HAYAT, the world’s fourth-largest branded diaper manufacturer, and KEAS, the world’s fifth-largest wood-based panel manufacturer—collaborated with AWS to build a solution that uses Amazon SageMaker Model Training, Amazon SageMaker Automatic Model Tuning, and Amazon SageMaker Model Deployment to continuously improve operational performance, increase product quality, and optimize manufacturing output of medium-density fiberboard (MDF) wood panels.
Product quality prediction and adhesive consumption recommendation results can be observed by field experts through dashboards in near-real time, resulting in a faster feedback loop. Laboratory results indicate a significant impact equating to savings of $300,000 annually, reducing their carbon footprint in production by preventing unnecessary chemical waste.
ML-based predictive quality in HAYAT HOLDING
HAYAT is the world’s fourth-largest branded baby diapers manufacturer and the largest paper tissue manufacturer of the EMEA. KEAS (Kastamonu Entegre Ağaç Sanayi) is a subsidy of HAYAT HOLDING, for production in the wood-based panel industry, and is positioned as the fourth in Europe and the fifth in the world.
Medium-density fiberboard (MDF) is an engineered wood product made by breaking down wood residuals into fibers, combining it with adhesives, and forming it into panels by applying high temperature and pressure. It has many application areas such as furniture, cabinetry, and flooring.
Production of MDF wood panels requires extensive use of adhesives (double-digit tons consumed each year at HAYAT HOLDING).
In a typical production line, hundreds of sensors are used. Product quality is identified by tens of parameters. Applying the correct volume of adhesives is an important cost item as well as an important quality factor for the produced panel, such as density, screw holding ability, tensile strength, modulus elasticity, and bending strength. While excessive use of glue increases production costs redundantly, poor utilization of glue raises quality problems. Incorrect usage causes up to tens of thousands of dollars in a single shift. The challenge is that there is a regressive dependency of quality on the production process.
Human operators decide on the amount of glue to be used based on domain expertise. This know-how is solely empirical and takes years of expertise to build competence. To support the decision-making for the human operator, laboratory tests are performed on selected samples to precisely measure quality during production. The lab results provide feedback to the operators revealing product quality levels. Nevertheless, lab tests are not in real time and are applied with a delay of up to several hours. The human operator uses lab results to gradually adjust glue consumption to achieve the required quality threshold.
Overview of solution
Quality prediction using ML is powerful but requires effort and skill to design, integrate with the manufacturing process, and maintain. With the support of AWS Prototyping specialists, and AWS Partner Deloitte, HAYAT HOLDING built an end-to-end pipeline as follows:
- Ingest sensor data from production plant to AWS
- Perform data preparation and ML model generation
- Deploy models at the edge
- Create operator dashboards
- Orchestrate the workflow
The following diagram illustrates the solution architecture.
Data ingestion
HAYAT HOLDING has a state-of-the art infrastructure for acquiring, recording, analyzing, and processing measurement data.
Two types of data sources exist for this use case. Process parameters are set for the production of a particular product and are usually not changed during production. Sensor data is taken during the manufacturing process and represents the actual condition of the machine.
Input data is streamed from the plant via OPC-UA through SiteWise Edge Gateway in AWS IoT Greengrass. In total, 194 sensors were imported and used to increase the accuracy of the predictions.
Model training and optimization with SageMaker automatic model tuning
Prior to the model training, a set of data preparation activities are performed. For instance, an MDF panel plant produces multiple distinct products on the same production line (multiple types and sizes of wood panels). Each batch is associated with a different product, with different raw materials and different physical characteristics. Although the equipment and process time series are recorded continuously and can be seen as a single-flow time series indexed by time, they need to be segmented by the batch they are associated with. For instance, in a shift, product panels may be produced for different durations. A sample of the produced MDF is sent to the laboratory for quality tests from time to time. Other feature engineering tasks include feature reduction, scaling, unsupervised dimensionality reduction using PCA (Principal Component Analysis), feature importance, and outlier detection.
After the data preparation phase, a two-stage approach is used to build the ML models. Lab test samples are conducted by intermittent random product sampling from the conveyor belt. Samples are sent to a laboratory for quality tests. Because the lab results can’t be presented in real time, the feedback loop is relatively slow. The first model is trained to predict lab results for product quality parameters: density, elasticity, pulling resistance, swelling, absorbed water, surface durability, moisture, surface suction, and bending resistance. The second model is trained to recommend the amount of glue to be used in production, depending on the predicted output quality.
Setting up and managing custom ML environments can be time-consuming and cumbersome. Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and ML practitioners get started on training and deploying ML models quickly.
Multiple ML models were trained using SageMaker built-in algorithms for the top N most produced product types and for different quality parameters. The quality prediction models identify the relationships between glue usage and nine quality parameters. The recommendation models predict the minimum glue usage to satisfy quality requirements using the following approach: an algorithm starts from the highest allowed glue amount and reduces it step by step if all requirements are satisfied until the minimum amount of glue allowed. If the max amount of glue doesn’t satisfy all the requirements, it gives an error.
SageMaker automatic model tuning, also known as hyperparameter tuning, finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.
With automatic model tuning, the team focused on defining the right objective, scoping the hyperparameters and the search space. Automatic model tuning takes care of the rest, including the infrastructure, running and orchestrating training jobs in parallel, and improving hyperparameter selection. Automatic model tuning provides a wide range of training instance types. The model was fine-tuned on c5.x2large instance types using an intelligent version of hyperparameter tuning methods that is based on the Bayesian search theory and is designed to find the best model in the shortest time.
Inference at the edge
Multiple methods are available for deploying ML models to get predictions.
SageMaker real-time inference is ideal for workloads where then are real-time, interactive, low-latency requirements. During the prototyping phase, HAYAT HOLDING deployed models to SageMaker hosting services and got endpoints that are fully managed by AWS. SageMaker multi-model endpoints provide a scalable and cost-effective solution for deploying large numbers of models. They use the same fleet of resources and a shared serving container to host all your models. This reduces hosting costs by improving endpoint utilization compared with using single-model endpoints. It also reduces deployment overhead because SageMaker manages loading models in memory and scaling them based on the traffic patterns to your endpoint.
SageMaker real-time inference is used with multi-model endpoints for cost optimization and for making all models available at all times during development. Although using an ML model for each product type results in higher inference accuracy, the cost of developing and testing these models increases accordingly, and it also becomes difficult to manage multiple models. SageMaker multi-model endpoints address these pain points and give the team a rapid and cost-effective solution to deploy multiple ML models.
Amazon SageMaker Edge provides model management for edge devices so you can optimize, secure, monitor, and maintain ML models on fleets of edge devices. Operating ML models on edge devices is challenging, because devices, unlike cloud instances, have limited compute, memory, and connectivity. After the model is deployed, you need to continuously monitor the models, because model drift can cause the quality of model to decay overtime. Monitoring models across your device fleets is difficult because you need to write custom code to collect data samples from your device and recognize skew in predictions.
For production, the SageMaker Edge Manager agent is used to make predictions with models loaded onto an AWS IoT Greengrass device.
Conclusion
HAYAT HOLDING was evaluating an advanced analytics platform as part of their digital transformation strategy and wanted to bring AI to the organization for quality prediction in production.
With the support of AWS Prototyping specialists and AWS Partner Deloitte, HAYAT HOLDING built a unique data platform architecture and an ML pipeline to address long-term business and technical needs.
HAYAT KIMYA integrated the ML solution in one of its plants. Laboratory results indicate a significant impact equating to savings of $300,000 annually, reducing their carbon footprint in production by preventing unnecessary chemical waste. The solution provides a faster feedback loop to the human operators by presenting product quality predictions and adhesive consumption recommendation results through dashboards in near-real time. The solution will eventually be deployed across HAYAT HOLDING’s other wood panel plants.
ML is a highly iterative process; over the course of a single project, data scientists train hundreds of different models, datasets, and parameters in search of maximum accuracy. SageMaker offers the most complete set of tools to harness the power of ML. It lets you organize, track, compare, and evaluate ML experiments at scale. You can boost the bottom-line impact of your ML teams to achieve significant productivity improvements using SageMaker built-in algorithms, automatic model tuning, real-time inference, and multi-model endpoints.
Accelerate time to results and optimize operations by modernizing your business approach from edge to cloud using Machine Learning on AWS. Take advantage of industry-specific innovations and solutions using AWS for Industrial.
Share your feedback and questions in the comments.
About HAYAT HOLDING
HAYAT HOLDING, whose foundations were laid in 1937, is a global player today, with 41 companies operating in different industries, including HAYAT in the fast-moving consumer goods sector, KEAS (Kastamonu Entegre Ağaç Sanayi) in the wood-based panel sector, and LIMAS in the port management sector, with a workforce of over 17,000 people. HAYAT HOLDING delivers 49 brands produced with advanced technologies in 36 production facilities in 13 countries to millions of consumers worldwide.
Operating in the fast-moving consumer goods sector, Hayat was founded in 1987. Today, rapidly advancing on the path of globalization with 21 production facilities in 8 countries around the world, Hayat is the world’s fourth-largest branded diaper manufacturer and the largest tissue producer in the Middle East, Eastern Europe, and Africa, and a major player in the fast-moving consumer goods sector. With its 16 powerful brands, including Molfix, Bebem, Molped, Joly, Bingo, Test, Has, Papia, Familia, Teno, Focus, Nelex, Goodcare, and Evony in the hygiene, home care, tissue, and personal health categories, Hayat brings HAYAT* to millions of homes in more than 100 countries.
Kastamonu Entegre Ağaç Sanayi (KEAS), the first investment of HAYAT HOLDING in its industrialization move, was founded in 1969. Continuing its uninterrupted growth towards becoming a global power in its sector, it ranks fourth in Europe and fifth in the world. KEAS ranks first in the industry with its approximately 7,000 employees and exports to more than 100 countries.
*“Hayat” means “life” in Turkish.
References
- Tercan H, “Machine learning and deep learning based predictive quality in manufacturing: a systematic review”, Journal of Intelligent Manufacturing, 2022.
About the authors
Neslihan Erdoğan, (BSc and MSc in Electrical Engineering), held various technical & business roles as a specialist, architect and manager in Information Technologies. She has been working in HAYAT as the Global Industrial IT Manager and led Industry 4.0, Digital Transformation, OT Security and Data & AI projects.
Çağrı Yurtseven (BSc in Electrical-Electronics Engineering, Bogazici University) is the Enterprise Account Manager at Amazon Web Services. He is leading Sustainability and Industrial IOT initiatives in Turkey while helping customers realize their full potential by showing the art of the possible on AWS.
Cenk Sezgin (PhD – Electrical Electronics Engineering) is a Principal Manager at AWS EMEA Prototyping Labs. He supports customers with exploration, ideation, engineering and development of state-of-the-art solutions using emerging technologies such as IoT, Analytics, AI/ML & Serverless.
Hasan-Basri AKIRMAK (BSc and MSc in Computer Engineering and Executive MBA in Graduate School of Business) is a Principal Solutions Architect at Amazon Web Services. He is a business technologist advising enterprise segment clients. His area of specialty is designing architectures and business cases on large scale data processing systems and Machine Learning solutions. Hasan has delivered Business development, Systems Integration, Program Management for clients in Europe, Middle East and Africa. Since 2016 he mentored hundreds of entrepreneurs at startup incubation programs pro-bono.
Mustafa Aldemir (BSc in Electrical-Electronics Engineering, MSc in Mechatronics and PhD-candidate in Computer Science) is the Robotics Prototyping Lead at Amazon Web Services. He has been designing and developing Internet of Things and Machine Learning solutions for some of the biggest customers across EMEA and leading their teams in implementing them. Meanwhile, he has been delivering AI courses at Amazon Machine Learning University and Oxford University.
Achieve effective business outcomes with no-code machine learning using Amazon SageMaker Canvas
On November 30, 2021, we announced the general availability of Amazon SageMaker Canvas, a visual point-and-click interface that enables business analysts to generate highly accurate machine learning (ML) predictions without having to write a single line of code. With Canvas, you can take ML mainstream throughout your organization so business analysts without data science or ML experience can use accurate ML predictions to make data-driven decisions.
ML is becoming ubiquitous in organizations across industries to gather valuable business insights using predictions from existing data quickly and accurately. The key to scaling the use of ML is making it more accessible. This means empowering business analysts to use ML on their own, without depending on data science teams. Canvas helps business analysts apply ML to common business problems without having to know the details such as algorithm types, training parameters, or ensemble logic. Today, customers are using Canvas to address a wide range of use cases across verticals including churn detection, sales conversion, and time series forecasting.
In this post, we discuss key Canvas capabilities.
Get started with Canvas
Canvas offers an interactive tour to help you navigate through the visual interface, starting with importing data from the cloud or on-premises sources. Getting started with Canvas is quick; we offer sample datasets for multiple use cases, including predicting customer churn, estimating loan default probabilities, forecasting demand, and predicting supply chain delivery times. These datasets cover all the use cases currently supported by Canvas, including binary classification, multi-class classification, regression, and time series forecasting. To learn more about navigating Canvas and using the sample datasets, see Amazon SageMaker Canvas accelerates onboarding with new interactive product tours and sample datasets.
Exploratory data analysis
After you import your data, Canvas allows you to explore and analyze it, before building predictive models. You can preview your imported data and visualize the distribution of different features. You can then choose to transform your data to make it suitable to address your problem. For example, you may choose to drop columns, extract date and time, impute missing values, or replace outliers with standard or custom values. These activities are recorded in a model recipe, which is a series of steps towards data preparation. This recipe is maintained throughout the lifecycle of a particular ML model from data preparation to generating predictions. See Amazon SageMaker Canvas expands capabilities to better prepare and analyze data for machine learning to learn more about preparing and analyzing data within Canvas.
Visualize your data
Canvas also offers the ability to define and create new features in your data through mathematical operators and logical functions. You can visualize and explore your data through box plots, bar graphs, and scatterplots by dragging and dropping features directly on charts. In addition, Canvas provides correlation matrices for numerical and categorical variables to understand the relationships between features in your data. This information can be used to refine your input data and drive more accurate models. For more details on data analysis capabilities in Canvas, see Use Amazon SageMaker Canvas for exploratory data analysis. To learn more about mathematical functions and operators in Canvas, see Amazon SageMaker Canvas supports mathematical functions and operators for richer data exploration.
After you prepare and explore your data, Canvas gives you an option to validate your datasets so you can proactively check for data quality issues. Canvas validates the data on your behalf and surfaces issues such as missing values in any row or column and too many unique labels in the target column compared to the number of rows. In addition, Canvas provides you with options to fix these issues before you build your ML model. For a deeper dive into data validation capabilities, refer to Identifying and avoiding common data issues while building no code ML models with Amazon SageMaker Canvas.
Build ML models
The first step towards building ML models in Canvas is to define the target column for the problem. For example, you could choose the total number of rooms as the target column to determine home prices in a housing model. Alternatively, you could use churn as the target column to determine the probability of losing customers under different conditions. After you select the target column, Canvas automatically determines the type of problem for the model to be built.
Prior to building an ML model, you can get directional insights into the model’s estimated accuracy and how each feature influenced results by running a preview analysis. Based on these insights, you can further prepare, analyze, or explore your data to get the desired accuracy for model predictions.
Canvas offers two methods to train ML models: Quick build and Standard build. Both methods deliver a fully trained ML model with complete transparency to understand the importance of each feature towards the model outcome. Quick build focuses on speed and experimentation, whereas standard build focuses on the highest levels of accuracy by going through multiple iterations of data preprocessing, choosing the right algorithm, exploring the hyperparameter space, and generating multiple candidate models before selecting the best performing model. This process is done behind the scenes by Canvas without the need to write code.
New performance improvements deliver up to three times faster ML model training time, enabling rapid prototyping and faster time-to-value for business outcomes. To learn more, see Amazon SageMaker Canvas announces up to 3x faster ML model training time.
Model analysis
After you build the model, Canvas provides detailed model accuracy metrics and feature explainability.
Canvas also presents a Sankey chart depicting the flow of the data from one value into the other, including false positives and false negatives.
For users interested in analyzing more advanced metrics, Canvas provides F1 scores that combine precision and recall, an accuracy metric quantifying how many times the model made a correct prediction across the entire dataset, and the Area Under the Curve (AUC), which measures how well the model separates the categories in the dataset.
Model predictions
With Canvas, you can run real-time predictions on the trained model with interactive what-if analyses by analyzing the impact of different feature values on the model accuracy.
Furthermore, you can run batch predictions on any validation dataset as a whole. These predictions can be previewed and downloaded for use with downstream applications.
Sharing and collaboration
Canvas allows you to continue the ML journey by sharing your models with your data science teams for review, feedback, and updates. You can share your models with other users using Amazon SageMaker Studio, a fully integrated development environment (IDE) for ML. Studio users can review the model and, if needed, update data transformations, retrain the model, and share back the updated version of the model with Canvas users who can then use it to generate predictions.
In addition, data scientists can share models built outside of Amazon SageMaker with Canvas users, removing the heavy lifting to build a separate tool or user interface to share models between different teams. With the bring your own model (BYOM) approach, you can now use ML models built by your data science teams in other environments and generate predictions within minutes directly in Canvas. This seamless collaboration between business and technical teams helps democratize ML across the organization by bringing transparency to ML models and accelerating ML deployments. To learn more about sharing and collaboration between business and technical teams using Canvas, see New – Bring ML Models Built Anywhere into Amazon SageMaker Canvas and Generate Predictions.
Conclusion
Get started today with Canvas and take advantage of ML to achieve your business outcomes without writing a line of code. Learn more from the interactive tutorial or MOOC course on Coursera. Happy innovating!
About the author
Shyam Srinivasan is on the AWS low-code/no-code ML product team. He cares about making the world a better place through technology and loves being part of this journey. In his spare time, Shyam likes to run long distances, travel around the world, and experience new cultures with family and friends.
How the UNDP Independent Evaluation Office is using AWS AI/ML services to enhance the use of evaluation to support progress toward the Sustainable Development Goals
The United Nations (UN) was founded in 1945 by 51 original Member States committed to maintaining international peace and security, developing friendly relations among nations, and promoting social progress, better living standards, and human rights. The UN is currently made up of 193 Member States and has evolved over the years to keep pace with a rapidly changing world. The United Nations Development Programme (UNDP) is the UN’s development agency and operates in over 170 countries and territories. It plays a critical role in helping countries achieve the Sustainable Development Goals (SDGs), which are a global call to action to end poverty, protect the planet, and ensure all people enjoy peace and prosperity.
As a learning organization, the UNDP highly values the evaluation function. Each UNDP program unit commissions evaluations to access the performance of their projects and programs. The Independent Evaluation Office (IEO) is a functionally independent office within the UNDP that supports the oversight and accountability functions of the Executive Board and management of the UNDP, UNCDF, and UNV. The core functions of the IEO are to conduct independent programmatic and thematic evaluations that are of strategic importance to the organization—like its support for the COVID-19 pandemic recovery.
In this post, we discuss how the IEO developed UNDP’s artificial intelligence and machine learning (ML) platform—named Artificial Intelligence for Development Analytics (AIDA)— in collaboration with AWS, UNDP’s Information and Technology Management Team (UNDP ITM), and the United Nations International Computing Centre (UNICC). AIDA is a web-based platform that allows program managers and evaluators to expand their evidence base by searching existing data in a smarter, more efficient, and innovative way to produce insights and lessons learned. By searching at the granular level of paragraphs, AIDA finds pieces of evidence that would not be found using conventional searches. The creation of AIDA aligns with the UNDP Strategic Plan 2022–2025 to use digitization and innovation for greater development impact.
The challenge
The IEO is the custodian of the UNDP Evaluation Resource Center (ERC). The ERC is a repository of over 6,000 evaluation reports that cover every aspect of the organization’s work, everywhere it has worked, since 1997. The findings and recommendations of the evaluation reports inform UNDP management, donor, and program staff to better design future interventions, take course-correction measures in their current programs, and make funding and policy decisions at every level.
Before AIDA, the process to extract evaluative evidence and generate lessons and insights was manual, resource-intensive, and time-consuming. Moreover, traditional search methods didn’t work well with unstructured data, therefore the evidence base was limited. To address this challenge, the IEO decided to use AI and ML to better mine the evaluation database for lessons and knowledge.
The AIDA team was mindful of the challenging task of extracting evidence from unstructured data such as evaluation reports. Usually, evaluation reports are 80–100 pages, are in multiple languages, and contain findings, conclusions, and recommendations. Even though evaluations are guided by the UNDP Evaluation Guideline, there is no standard written format for these evaluations, and the aforementioned sections may occur at different locations in the document, or not all of them may exist. Therefore, accurately exacting evaluative evidence at the paragraph level and applying appropriate labels was a significant ML challenge.
Solution overview
The AIDA technical solution was developed by AWS Professional Services and the UNICC. The core technology platform was designed and developed by the AWS ProServe team. The UNICC was responsible for developing the AIDA web portal and human-in-the-loop interface. The AIDA platform was envisioned to provide a simple and highly accurate mechanism to search UNDP evaluation reports across various themes and export them for further analysis. AIDA’s architecture needed to address several requirements:
- Automate the extraction and labeling of evaluation data
- Process thousands of reports
- Allow the IEO to add new labels without calling on the expertise of data scientists and ML experts
To deliver the requirements, the components were designed with these tenets in mind:
- Technically and environmentally sustainable
- Cost conscious
- Extensible to allow for future expansion
The resulting solution can be broken down to three components, as shown in the following architecture diagram:
- Data ingestion and extraction
- Data classification
- Intelligent search
The following sections describe these components in detail.
Data ingestion and extraction
Evaluation reports are prepared and submitted by UNDP program units across the globe—there is no standard report layout template or format. The data ingestion and extraction component ingests and extracts content from these unstructured documents.
Amazon Textract is used to extract data from PDF documents. This solution uses the asynchronous StartDocumentTextDetection API to build the document processing workflow that handles Amazon Textract asynchronous invocation, raw response extraction, and persistence in Amazon Simple Storage Service (Amazon S3). This solution adds an Amazon Textract postprocessing component to handle paragraph-based text extraction. The postprocessing component uses bounding box metadata from Amazon Textract for intelligent data extraction. The postprocessing component is capable of extracting data from complex, multi-format, multi-page PDF files with varying headers, footers, footnotes, and multi-column data. The Apache Tika open-source Python library is used for data extraction from word documents.
The following diagram illustrates this workflow, orchestrated with AWS Step Functions.
This workflow has the following steps:
TextractCompleted
is the first step to ensure documents are not processed multiple times with Amazon Textract. This step is to avoid unnecessary processing time and cost by preventing duplicate processing.TextractAsyncCallTask
submits the documents to be processed by Amazon Textract using the asynchronous StartDocumentTextDetection API. This API processes the documents and stores the JSON output files in Amazon S3 for postprocessing.TextractAsyncSNSListener
is an AWS Lambda function that handles the Amazon Textract job completion event, and returns the metadata back to the workflow for further processing.TextractPostProcessorTask
is an AWS Lambda function that uses the metadata and processes the JSON output files produced by Amazon Textract to extract meaningful paragraphs.TextractQAValidationTask
is an AWS Lambda function that performs some simple text validations on the extracted paragraphs and collects metrics like number of complete or incomplete paragraphs. These metrics are used to measure the quality of text extractions.
Please refer to TextractAsync, an IDP CDK construct that abstracts the invocation of the Amazon Textract Async API, handling Amazon Simple Notification Service (Amazon SNS) messages and workflow processing to accelerate your development.
Data classification
The data classification component identifies the critical parts of the evaluation reports, and further classifies them into a taxonomy of categories organized around the various themes of the Sustainable Development Goals. We have built one multi-class and two multi-label classification models with Amazon Comprehend.
Extracted paragraphs are processed using Step Functions, which integrates with Amazon Comprehend to perform classification in batch mode. Paragraphs are classified into findings, recommendations, and conclusions (FRCs) using a custom multi-class model, which helps identify the critical sections of the evaluation reports. For the identified critical sections, we identify the categories (thematic and non-thematic) using a custom multi-label classification model. Thematic and non-thematic classification is used to identify and align the evaluation reports with Sustainable Development Goals like no poverty (SDG-1), gender equality (SDG-5), clean water and sanitation (SDG-6), and affordable and clean energy (SDG-7).
The following figure depicts the Step Functions workflow to process text classification.
To reduce cost on the classification process, we have created the workflow to submit Amazon Comprehend jobs in batch mode. The workflow waits for all the Amazon Comprehend jobs to complete and performs data refinement by aggregating the text extraction and Amazon Comprehend results to filter the paragraphs that aren’t identified as FRC, and aggregates the thematic and non-thematic classification categories by paragraphs.
Extracted paragraphs with their classification categories are stored in Amazon RDS for PostgreSQL. This is a staging database to preserve all the extraction and classification results. We also use this database to further enrich the results to aggregate the themes of the paragraphs, and filter paragraphs that are not FRC. Enriched content is fed to Amazon Kendra.
For the first release, we had over 2 million paragraphs extracted and classified. With the help of FRC custom classification, we were able to accurately narrow down the paragraphs to over 700,000 from 2 million. The Amazon Comprehend custom classification model helped accurately present the relevant content and substantially reduced the cost on Amazon Kendra indexes.
Amazon DynamoDB is used for storing document metadata and keeping track of the document processing status across all key components. Metadata tracking is particularly useful to handle errors and retries.
Intelligent search
The intelligent search capability allows the users of the AIDA platform to intuitively search for evaluative evidence on UNDP program interventions contained within all the evaluation reports. The following diagram illustrates this architecture.
Amazon Kendra is used for intelligent searches. Enriched content from Amazon RDS for PostgreSQL is ingested into Amazon Kendra for indexing. The web portal layer uses the intelligent search capability of Amazon Kendra to intuitively search the indexed content. Labelers use the human-in-the-loop user interface to update the text classification generated by Amazon Comprehend for any extracted paragraphs. Changes to the classification are immediately reflected in the web portal, and human-updated feedback is extracted and used for Amazon Comprehend model training to continuously improve the custom classification model.
AIDA incorporates a human-in-the-loop functionality, which boosts AIDA’s capacity to correct classifications (FRC, thematic, non-thematic) and data extractions errors. Labels, updated by the humans performing the human-in-the-loop function, are augmented to the training dataset and used to retrain the Amazon Comprehend models to continuously improve classification accuracy.
Conclusion
In this post, we discussed how evaluators, through the IEO’s AIDA platform, are using Amazon AI and ML services like Amazon Textract, Amazon Comprehend, and Amazon Kendra to build a custom document processing system that identifies, extracts, and classifies data from unstructured documents. Using Amazon Textract for PDF text extraction improved paragraph-level evidence extraction from under 60% to over 80% accuracy. Additionally, multi-label classification improved from under 30% to 90% accuracy by retraining models in Amazon Comprehend with improved training datasets.
This platform enabled evaluators to intuitively search relevant content quickly and accurately. Transforming unstructured data to semi-structured data empowers the UNDP and other UN entities to make informed decisions based on a corpus of hundreds or thousands of data points about what works, what doesn’t work, and how to improve the impact of UNDP operations for the people it serves.
For more information about the intelligent document processing reference architecture, refer to Intelligent Document Processing. Please share your thoughts with us in the comments section.
About the Authors
Oscar A. Garcia is the Director of the Independent Evaluation Office (IEO) of the United Nations Development Program (UNDP). As Director, he provides strategic direction, thought leadership, and credible evaluations to advance UNDP work in helping countries progress towards national SDG achievement. Oscar also currently serves as the Chairperson of the United Nations Evaluation Group (UNEG). He has more than 25 years of experience in areas of strategic planning, evaluation, and results-based management for sustainable development. Prior to joining the IEO as Director in 2020, he served as Director of IFAD’s Independent Office of Evaluation (IOE), and Head of Advisory Services for Green Economy, UNEP. Oscar has authored books and articles on development evaluation, including one on information and communication technology for evaluation. He is an economist with a master’s degree in Organizational Change Management, New School University (NY), and an MBA from Bolivian Catholic University, in association with the Harvard Institute for International Development.
Sathya Balakrishnan is a Sr. Customer Delivery Architect in the Professional Services team at AWS, specializing in data and ML solutions. He works with US federal financial clients. He is passionate about building pragmatic solutions to solve customers’ business problems. In his spare time, he enjoys watching movies and hiking with his family.
Thuan Tran is a Senior Solutions Architect in the World Wide Public Sector supporting the United Nations. He is passionate about using AWS technology to help customers conceptualize the art of the possible. In this spare time, he enjoys surfing, mountain biking, axe throwing, and spending time with family and friends.
Prince Mallari is an NLP Data Scientist in the Professional Services team at AWS, specializing in applications of NLP for public sector customers. He is passionate about using ML as a tool to allow customers to be more productive. In his spare time, he enjoys playing video games and developing one with his friends.
Enable predictive maintenance for line of business users with Amazon Lookout for Equipment
Predictive maintenance is a data-driven maintenance strategy for monitoring industrial assets in order to detect anomalies in equipment operations and health that could lead to equipment failures. Through proactive monitoring of an asset’s condition, maintenance personnel can be alerted before issues occur, thereby avoiding costly unplanned downtime, which in turn leads to an increase in Overall Equipment Effectiveness (OEE).
However, building the necessary machine learning (ML) models for predictive maintenance is complex and time consuming. It requires several steps, including preprocessing of the data, building, training, evaluating, and then fine-tuning multiple ML models that can reliably predict anomalies in your asset’s data. The finished ML models then need to be deployed and provided with live data for online predictions (inferencing). Scaling this process to multiple assets of various types and operating profiles is often too resource intensive to make broader adoption of predictive maintenance viable.
With Amazon Lookout for Equipment, you can seamlessly analyze sensor data for your industrial equipment to detect abnormal machine behavior—with no ML experience required.
When customers implement predictive maintenance use cases with Lookout for Equipment, they typically choose between three options to deliver the project: build it themselves, work with an AWS Partner, or use AWS Professional Services. Before committing to such projects, decision-makers such as plant managers, reliability or maintenance managers, and line leaders want to see evidence of the potential value that predictive maintenance can uncover in their lines of business. Such an evaluation is usually performed as part of a proof of concept (POC) and is the basis for a business case.
This post is directed to both technical and non-technical users: it provides an effective approach for evaluating Lookout for Equipment with your own data, allowing you to gauge the business value it provides your predictive maintenance activities.
Solution overview
In this post, we guide you through the steps to ingest a dataset in Lookout for Equipment, review the quality of the sensor data, train a model, and evaluate the model. Completing these steps will help derive insights into the health of your equipment.
Prerequisites
All you need to get started is an AWS account and a history of sensor data for assets that can benefit from a predictive maintenance approach. The sensor data should be stored as CSV files in an Amazon Simple Storage Service (Amazon S3) bucket from your account. Your IT team should be able to meet these prerequisites by referring to Formatting your data. To keep things simple, it’s best to store all the sensor data in one CSV file where the rows are timestamps and the columns are individual sensors (up to 300).
Once you have your dataset available on Amazon S3, you can follow along with the rest of this post.
Add a dataset
Lookout for Equipment uses projects to organize the resources for evaluating pieces of industrial equipment. To create a new project, complete the following steps:
- On the Lookout for Equipment console, choose Create project.
- Enter a project name and choose Create project.
After the project is created, you can ingest a dataset that will be used to train and evaluate a model for anomaly detection.
- On the project page, choose Add dataset.
- For S3 location, enter the S3 location (excluding the file name) of your data.
- For Schema detection method, select By filename, which assumes that all sensor data for an asset is contained in a single CSV file at the specified S3 location.
- Keep the other settings as default and choose Start ingestion to start the ingestion process.
Ingestion may take around 10–20 minutes to complete. In the background, Lookout for Equipment performs the following tasks:
- It detects the structure of the data, such as sensor names and data types.
- The timestamps between sensors are aligned and missing values are filled (using the latest known value).
- Duplicate timestamps are removed (only the last value for each timestamp is kept).
- Lookout for Equipment uses multiple types of algorithms for building the ML anomaly detection model. During the ingestion phase, it prepares the data so it can be used for training those different algorithms.
- It analyzes the measurement values and grades each sensor as high, medium, or low quality.
- When the dataset ingestion is complete, inspect it by choosing View dataset under Step 2 of the project page.
When creating an anomaly detection model, selecting the best sensors (the ones containing the highest data quality) is often critical to training models that deliver actionable insights. The Dataset details section shows the distribution of sensor gradings (between high, medium, and low), while the table displays information on each sensor separately (including the sensor name, date range, and grading for the sensor data). With this detailed report, you can make an informed decision about which sensors you will use to train your models. If a large proportion of sensors in your dataset are graded as medium or low, there might be a data issue needing investigation. If necessary, you can reupload the data file to Amazon S3 and ingest the data again by choosing Replace dataset.
By choosing the sensor grade entry in the details table, you can review details on the validation errors resulting in a given grade. Displaying and addressing these details will help ensure information provided to the model is high quality. For example, you might see a signal has unexpected big chunks of missing values. Is this a data transfer issue, or was the sensor malfunctioning? Time to dive deeper in your data!
To learn more about the different type of sensor issues Lookout for Equipment addresses when grading your sensors, refer to Evaluating sensor grades. Developers can also extract these insights using the ListSensorStatistics API.
When you’re happy with your dataset, you can move to the next step of training a model for predicting anomalies.
Train a model
Lookout for Equipment allows the training of models for specific sensors. This gives you the flexibility to experiment with different sensor combinations or exclude sensors with a low grading. Complete the following steps:
- In the Details by sensor section on the dataset page, select the sensors to include in your model and choose Create model.
- For Model name, enter a model name, then choose Next.
- In the Training and evaluation settings section, configure the model input data.
To effectively train models, the data needs to be split into separate training and evaluation sets. You can define date ranges for this split in this section, along with a sampling rate for the sensors. How do you choose this split? Consider the following:
- Lookout for Equipment expects at least 3 months of data in the training range, but the optimal amount of data is driven by your use case. More data may be necessary to account for any type of seasonality or operational cycles your production goes through.
- There are no constraints on the evaluation range. However, we recommend setting up an evaluation range that includes known anomalies. This way, you can test if Lookout for Equipment is able to capture any events of interest leading to these anomalies.
By specifying the sample rate, Lookout for Equipment effectively downsamples the sensor data, which can significantly reduce training time. The ideal sampling rate depends on the types of anomalies you suspect in your data: for slow-trending anomalies, selecting a sampling rate between 1–10 minutes is usually a good starting point. Choosing lower values (increasing the sampling rate) results in longer training times, whereas higher values (low sampling rate) shorten the training time at the risk of cutting out leading indicators from your data relevant to predicting the anomalies.
For training only on relevant portions of your data where the industrial equipment was in operation, you can perform off-time detection by selecting a sensor and defining a threshold indicating whether the equipment was in an on or off state. This is critical because it allows Lookout for Equipment to filter out time periods for training when the machine is off. This means the model learns only relevant operational states and not just when the machine is off.
- Specify your off-time detection, then choose Next.
Optionally, you can provide data labels, which indicate maintenance periods or known equipment failure times. If you have such data available, you can create a CSV file with the data in a documented format, upload it to Amazon S3, and use it for model training. Providing labels can improve the accuracy of the trained model by telling Lookout for Equipment where it should expect to find known anomalies.
- Specify any data labels, then choose Next.
- Review your settings in the final step. If everything looks fine, you can start the training.
Depending on the size of your dataset, the number of sensors, and the sampling rate, training the model may take a few moments or up to a few hours. For example, if you use 1 year of data at a 5-minute sampling rate with 100 sensors and no labels, training a model will take less than 15 minutes. On the other hand, if your data contains a large number of labels, training time could increase significantly. In such a situation, you can decrease training time by merging adjacent label periods to decrease their number.
You have just trained your first anomaly detection model without any ML knowledge! Now let’s look at the insights you can get from a trained model.
Evaluate a trained model
When model training has finished, you can view the model’s details by choosing View models on the project page, and then choosing the model’s name.
In addition to general information like name, status, and training time, the model page summarizes model performance data like the number of labeled events detected (assuming you provided labels), the average forewarning time, and the number of anomalous equipment events detected outside of the label ranges. The following screenshot shows an example. For better visibility, the detected events are visualized (the red bars on the top of the ribbon) along with the labeled events (the blue bars at the bottom of the ribbon).
You can select detected events by choosing the red areas representing anomalies in the timeline view to get additional information. This includes:
- The event start and end times along with its duration.
- A bar chart with the sensors the model believes are most relevant to why an anomaly occurred. The percentage scores represent the calculated overall contribution.
These insights allow you to work with your process or reliability engineers to do further root cause evaluation of events and ultimately optimize maintenance activities, reduce unplanned downtimes, and identify suboptimal operating conditions.
To support predictive maintenance with real-time insights (inference), Lookout for Equipment supports live evaluation of online data via inference schedules. This requires that sensor data is uploaded to Amazon S3 periodically, and then Lookout for Equipment performs inference on the data with the trained model, providing real-time anomaly scoring. The inference results, including a history of detected anomalous events, can be viewed on the Lookout for Equipment console.
The results are also written to files in Amazon S3, allowing integration with other systems, for example a computerized maintenance management system (CMMS), or to notify operations and maintenance personnel in real time.
As you increase your Lookout for Equipment adoption, you’ll need to manage a larger number of models and inference schedules. To make this process easier, the Inference schedules page lists all schedulers currently configured for a project in a single view.
Clean up
When you’re finished evaluating Lookout for Equipment, we recommend cleaning up any resources. You can delete the Lookout for Equipment project along with the dataset and any models created by selecting the project, choosing Delete, and confirming the action.
Summary
In this post, we walked through the steps of ingesting a dataset in Lookout for Equipment, training a model on it, and evaluating its performance to understand the value it can uncover for individual assets. Specifically, we explored how Lookout for Equipment can inform predictive maintenance processes that result in reduced unplanned downtime and higher OEE.
If you followed along with your own data and are excited about the prospects of using Lookout for Equipment, the next step is to start a pilot project, with the support of your IT organization, your key partners, or our AWS Professional Services teams. This pilot should target a limited number of industrial equipment and then scale up to eventually include all assets in scope for predictive maintenance.
About the authors
Johann Füchsl is a Solutions Architect with Amazon Web Services. He guides enterprise customers in the manufacturing industry in implementing AI/ML use cases, designing modern data architectures, and building cloud-native solutions that deliver tangible business value. Johann has a background in mathematics and quantitative modeling, which he combines with 10 years of experience in IT. Outside of work, he enjoys spending time with his family and being out in nature.
Michaël Hoarau is an industrial AI/ML Specialist Solution Architect at AWS who alternates between data scientist and machine learning architect, depending on the moment. He is passionate about bringing the power of AI/ML to the shop floors of his industrial customers and has worked on a wide range of ML use cases, ranging from anomaly detection to predictive product quality or manufacturing optimization. When not helping customers develop the next best machine learning experiences, he enjoys observing the stars, traveling, or playing the piano.