The program offers recent PhD graduates an opportunity to advance research while working alongside experienced scientists with backgrounds in industry and academia.Read More
Improving forecasting by learning quantile functions
Learning the complete quantile function, which maps probabilities to variable values, rather than building separate models for each quantile level, enables better optimization of resource trade-offs.Read More
Edouard Belval: From AWS intern to research engineer
How he parlayed an internship to land an expanded role at Amazon while pursuing his master’s degree.Read More
Personalize cross-channel customer experiences with Amazon SageMaker, Amazon Personalize, and Twilio Segment
Today, customers interact with brands over an increasingly large digital and offline footprint, generating a wealth of interaction data known as behavioral data. As a result, marketers and customer experience teams must work with multiple overlapping tools to engage and target those customers across touchpoints. This increases complexity, creates multiple views of each customer, and makes it more challenging to provide an individual experience with relevant content, messaging, and product suggestions to each customer. In response, marketing teams use customer data platforms (CDPs) and cross-channel campaign management tools (CCCMs) to simplify the process of consolidating multiple views of their customers. These technologies provide non-technical users with an accelerated path to enable cross-channel targeting, engagement, and personalization, while reducing marketing teams’ dependency on technical teams and specialist skills to engage with customers.
Despite this, marketers find themselves with blind spots in customer activity when these technologies aren’t integrated with systems from other parts of the business. This is particularly true with non-digital channels, for example, in-store transactions or customer feedback from customer support. Marketing teams and their customer experience counterparts also struggle to integrate predictive capabilities developed by data scientists into their cross-channel campaigns or customer touchpoints. As a result, customers receive messaging and recommendations that aren’t relevant or are inconsistent with their expectations.
This post outlines how cross-functional teams can work together to address these challenges using an omnichannel personalization use case. We use a fictional retail scenario to illustrate how those teams interlock to provide a personalized experience at various points along the customer journey. We use Twilio Segment in our scenario, a customer data platform built on AWS. There are more than 12 CDPs in the market to choose from, many of which are also AWS partners, but we use Segment in this post because they provide a self-serve free tier that allows you to explore and experiment. We explain how to combine the output from Segment with in-store sales data, product metadata, and inventory information. Building on this, we explain how to integrate Segment with Amazon Personalize to power real-time recommendations. We also describe how we create scores for churn and repeat-purchase propensity using Amazon SageMaker. Lastly, we explore how to target new and existing customers in three ways:
- With banners on third-party websites, also known as display advertising, using a propensity-to-buy score to attract similar customers.
- On web and mobile channels presented with personalized recommendations powered by Amazon Personalize, which uses machine learning (ML) algorithms to create content recommendations.
- With personalized messaging using Amazon Pinpoint, an outbound and inbound marketing communications service. These messages target disengaged customers and those showing a high propensity to churn.
Solution overview
Imagine you are a product owner leading the charge on cross-channel customer experience for a retail company. The company has a diverse set of online and offline channels, but sees digital channels as its primary opportunity for growth. They want to grow the size and value of their customer base with the following methods:
- Attract new, highly qualified customers who are more likely to convert
- Increase the average order value of all their customers
- Re-attract disengaged customers to return and hopefully make repeat purchases
To ensure those customers receive a consistent experience across channels, you as a product owner need to work with teams such as digital marketing, front-end development, mobile development, campaign delivery, and creative agencies. To ensure customers receive relevant recommendations, you also need to work with data engineering and data science teams. Each of these teams are responsible for interacting with or developing features within the architecture illustrated in the following diagram.
The solution workflow contains the following high-level steps:
- Collect data from multiple sources to store in Amazon Simple Storage Service (Amazon S3).
- Use AWS Step Functions to orchestrate data onboarding and feature engineering.
- Build segments and predictions using SageMaker.
- Use propensity scores for display targeting.
- Send personalized messaging using Amazon Pinpoint.
- Integrate real-time personalized suggestions using Amazon Personalize.
In the following sections, we walk through each step, explain the activities of each team at a high level, provide references to related resources, and share hands-on labs that provide more detailed guidance.
Collect data from multiple sources
Digital marketing, front-end, and mobile development teams can configure Segment to capture and integrate web and mobile analytics, digital media performance, and online sales sources using Segment Connections. Segment Personas allows digital marketing teams to resolve the identity of users by stitching together interactions across these sources into a single user profile with one persistent identifier. These profiles, along with calculated metrics called Computed Traits and raw events, can be exported to Amazon S3. The following screenshot shows how identity rules are set up in Segment Personas.
In parallel, engineering teams can use AWS Data Migration Service (AWS DMS) to replicate in-store sales, product metadata, and inventory data sources from databases such as Microsoft SQL or Oracle and store the output in Amazon S3.
Data onboarding and feature engineering
After data is collected and stored in the landing zone on Amazon S3, data engineers can use components from the serverless data lake framework (SDLF) to accelerate data onboarding and build out the foundational structure of a data lake. With SDLF, engineers can automate the preparation of user-item data used to train Amazon Personalize or create a single view of customer behavior by joining online and offline behavioral data and sales data, using attributes such as customer ID or email address as a common identifier.
Step Functions is the key orchestrator driving these transformation jobs within SDLF. You can use Step Functions to build and orchestrate both scheduled and event-driven data workflows. The engineering team can orchestrate the tasks of other AWS services within a data pipeline. The outputs from this process are stored in a trusted zone on Amazon S3 to use for ML development. For more information on implementing the serverless data lake framework, see AWS serverless data analytics pipeline reference architecture.
Build segments and predictions
The process of building segments and predictions can be broken down into three steps: access the environment, build propensity models, and create output files.
Access the environment
After the engineering team has prepared and transformed the ML development data, the data science team can build propensity models using SageMaker. First, they build, train, and test an initial set of ML models. This allows them to see early results, decide which direction to go next, and reproduce experiments.
The data science team needs an active Amazon SageMaker Studio instance, an integrated development environment (IDE) for rapid ML experimentation. It unifies all the key features of SageMaker and offers an environment to manage the end-to-end ML pipelines. It removes complexity and reduces the time it takes to build ML models and deploy them into production. Developers can use SageMaker Studio notebooks, which are one-click Jupyter notebooks that you can quickly spin up to enable the entire ML workflow from data preparation to model deployment. For more information on SageMaker for ML, see Amazon SageMaker for Data Science.
Build the propensity models
To estimate churn and repeat-purchase propensity, the customer experience and data science teams should agree on the known driving factors for either outcome.
The data science team validates these known factors while also discovering unknown factors through the modeling process. An example of a factor driving churn can be the number of returns in the last 3 months. An example of a factor driving repurchases can be the number of items saved on the website or mobile app.
For our use case, we assume that the digital marketing team wants to create a target audience using lookalike modeling to find customers most likely to repurchase in the next month. We also assume that the campaign team wants to send an email offer to customers who will likely end their subscription in the next 3 months to encourage them to renew their subscription.
The data science team can start by analyzing the data (features) and summarizing the main characteristics of the dataset to understand the key data behaviors. They can then shuffle and split the data into training and test and upload these datasets into the trusted zone. You can use an algorithm such as the XGBoost classifier to train the model and automatically provide the feature selection, which is the best set of candidates to determine the propensity scores (or predicted values).
You can then tune the model by optimizing the algorithm metrics (such as hyperparameters) based on the ranges provided within the XGBoost framework. Test data is used to evaluate the model’s performance and estimate how well it generalizes to new data. For more information on evaluation metrics, see Tune an XGBoost Model.
Lastly, the propensity scores are calculated for each customer and stored in the trusted S3 zone to be accessed, reviewed, and validated by the marketing and campaign teams. This process also provides a prioritized evaluation of feature importance, which helps to explain how the scores were produced.
Create the output files
After the data science team has completed the model training and tuning, they work with the engineering team to deploy the best model to production. We can use SageMaker batch transform to run predictions as new data is collected and generate scores for each customer. The engineering team can orchestrate and automate the ML workflow using Amazon SageMaker Pipelines, a purpose-built continuous integration and continuous delivery (CI/CD) service for ML, which offers an environment to manage the end-to-end ML workflow. It saves time and reduces errors typically caused by manual orchestration.
The output of the ML workflow is imported by Amazon Pinpoint for sending personalized messaging and exported to Segment to use when targeting on display channels. The following illustration provides a visual overview of the ML workflow.
The following screenshot shows an example output file.
Use propensity scores for display targeting
The engineering and digital marketing teams can create the reverse data flow back to Segment to increase reach. This uses a combination of AWS Lambda and Amazon S3. Every time a new output file is generated by the ML workflow and saved in the trusted S3 bucket, a Lambda function is invoked that triggers an export to Segment. Digital marketing can then use regularly updated propensity scores as customer attributes to build and export audiences to Segment destinations (see the following screenshot). For more information on the file structure of the Segment export, see Amazon S3 from Lambda.
When the data is available in Segment, digital marketing can see the propensity scores developed in SageMaker as attributes when they create customer segments. They can generate lookalike audiences to target them with digital advertising. To create a feedback loop, digital marketing must ensure that impressions, clicks, and campaigns are being ingested back into Segment to optimize performance.
Send personalized outbound messaging
The campaign delivery team can implement and deploy AI-driven win-back campaigns to re-engage customers at risk of churn. These campaigns use the list of customer contacts generated in SageMaker as segments while integrating with Amazon Personalize to present personalized product recommendations. See the following diagram.
The digital marketing team can experiment using Amazon Pinpoint journeys to split win-back segments into subgroups and reserve a percentage of users as a control group that isn’t exposed to the campaign. This allows them to measure the campaign’s impact and creates a feedback loop.
Integrate real-time recommendations
To personalize inbound channels, the digital marketing and engineering teams work together to integrate and configure Amazon Personalize to provide product recommendations at different points in the customer’s journey. For example, they can deploy a similar item recommender on product detail pages to suggest complementary items (see the following diagram). Additionally, they can deploy a content-based filtering recommender in the checkout journey to remind customers of products they would typically buy before completing their order.
First, the engineering team needs to create RESTful microservices that respond to web, mobile, and other channel application requests with product recommendations. These microservices call Amazon Personalize to get recommendations, resolve product IDs into more meaningful information like name and price, check inventory stock levels, and determine which Amazon Personalize campaign endpoint to query based on the user’s current page or screen.
The front-end and mobile development teams need to add tracking events for specific customer actions to their applications. They can then use Segment to send those events directly to Amazon Personalize in real time. These tracking events are the same as the user-item data we extracted earlier. They allow Amazon Personalize solutions to refine recommendations based on live customer interactions. It’s essential to capture impressions, product views, cart additions, and purchases because these events create a feedback loop for the recommenders. Lambda is an intermediary, collecting user events from Segment and sending them to Amazon Personalize. Lambda also facilitates the reverse data exchange, relaying updated recommendations for the user back to Segment. For more information on configuring real-time recommendations with Segment and Amazon Personalize, see the Segment Real-time data and Amazon Personalize Workshop.
Conclusion
This post described how to deliver an omnichannel customer experience using a combination of Segment customer data platform and AWS services such as Amazon SageMaker, Amazon Personalize, and Amazon Pinpoint. We explored the role cross-functional teams play at each stage in the customer journey and in the data value chain. The architecture and approach discussed are focused on a retail environment, but you can apply it to other verticals such as financial services or media and entertainment. If you’re interested in trying out some of what we discussed, check out the Retail Demo Store, where you can find hands-on workshops that include Segment and other AWS partners.
Additional references
For additional information, see the following resources:
- Omnichannel personalization with Segment
- Omnichannel personalization with Amazon Personalize
- Build, tune, and deploy an end-to-end churn prediction model using Amazon SageMaker Pipelines
- Automate feature engineering pipelines with Amazon SageMaker
- Monitor model quality
About Segment
Segment is an AWS Advanced Technology Partner and holder of the following AWS Independent Software Vendor (ISV) competencies: Data & Analytics, Digital Customer Experience, Retail, and Machine Learning. Brands such as Atlassian and Digital Ocean use real-time analytics solutions powered by Segment.
About the Authors
Dwayne Browne is a Principal Analytics Platform Specialist at AWS based in London. He is part of the Data-Driven Everything (D2E) customer program, where he helps customers become more data-driven and customer experience focused. He has a background in digital analytics, personalization, and marketing automation. In his spare time, Dwayne enjoys indoor climbing and exploring nature.
Hara Gavriliadi is a Senior Data Analytics Strategist at AWS Professional Services based in London. She helps customers transform their business using data, analytics, and machine learning. She specializes in customer analytics and data strategy. Hara loves countryside walks and enjoys discovering local bookstores and yoga studios in her free time.
Kenny Rajan is a Senior Partner Solution Architect. Kenny helps customers get the most from AWS and its partners by demonstrating how AWS partners and AWS services work better together. He’s interested in machine learning, data, ERP implementation, and voice-based solutions on the cloud. Outside of work, Kenny enjoys reading books and helping with charity activities.
Automated, scalable, and cost-effective ML on AWS: Detecting invasive Australian tree ferns in Hawaiian forests
This is blog post is co-written by Theresa Cabrera Menard, an Applied Scientist/Geographic Information Systems Specialist at The Nature Conservancy (TNC) in Hawaii.
In recent years, Amazon and AWS have developed a series of sustainability initiatives with the overall goal of helping preserve the natural environment. As part of these efforts, AWS Professional Services establishes partnerships with organizations such as The Nature Conservancy (TNC), offering financial support and consulting services towards environmental preservation efforts. The advent of big data technologies is rapidly scaling up ecological data collection, while machine learning (ML) techniques are increasingly utilized in ecological data analysis. AWS is in a unique position to help with data storage and ingestion as well as with data analysis.
Hawaiian forests are essential as a source of clean water and for preservation of traditional cultural practices. However, they face critical threats from deforestation, species extinction, and displacement of native species by invasive plants. The state of Hawaii spends about half a billion dollars yearly fighting invasive species. TNC is helping to address the invasive plant problem through initiatives such as the Hawaii Challenge, which allows anyone with a computer and internet access to participate in tagging invasive weeds across the landscape. AWS has partnered with TNC to build upon these efforts and develop a scalable, cloud-based solution that automates and expedites the detection and localization of invasive ferns.
Among the most aggressive species invading the Hawaiian forests is the Australian tree fern, originally introduced as an ornamental, but now rapidly spreading across several islands by producing numerous spores that are easily transported by the wind. The Australian tree fern is fast growing and outcompetes other plants, smothering the canopy and affecting several native species, resulting in a loss of biological diversity.
Currently, detection of the ferns is accomplished by capturing images from fixed wing aircraft surveying the forest canopy. The imagery is manually inspected by human labelers. This process takes significant effort and time, potentially delaying the mitigation efforts by ground crews by weeks or longer. One of the advantages of utilizing a computer vision (CV) algorithm is the potential time savings because the inference time is expected to take only a few hours.
Machine learning pipeline
The following diagram shows the overall ML workflow of this project. The first goal of the AWS-TNC partnership was to automate the detection of ferns from aerial imagery. A second goal was to evaluate the potential of CV algorithms to reliably classify ferns as either native or invasive. The CV model inference can then form the basis of a fully automated AWS Cloud-native solution that enhances the capacity of TNC to efficiently and in a timely manner detect invasive ferns and direct resources to highly affected areas. The following diagram illustrates this architecture.
In the following sections, we cover the following topics:
- The data processing and analysis tools utilized.
- The fern detection model pipeline, including training and evaluation.
- How native and invasive ferns are classified.
- The benefits TNC experienced through this implementation.
Data processing and analysis
Aerial footage is acquired by TNC contractors by flying fixed winged aircraft above affected areas within the Hawaiian Islands. Heavy and persistent cloud cover prevents use of satellite imagery. The data available to TNC and AWS consists of raw images and metadata allowing the geographical localization of the inferred ferns.
Images and geographical coordinates
Images received from aerial surveys are in the range of 100,000 x 100,000 pixels and are stored in the JPEG2000 (JP2) format, which incorporates geolocation and other metadata. Each pixel can be associated to specific Universal Transverse Mercator (UTM) geospatial coordinates. The UTM coordinate system divides the world into north-south zones, each 6 degrees of longitude wide. The first UTM coordinate (northing) refers to the distance between a geographical position and the equator, measured with the north as the positive direction. The second coordinated (easting) measures the distance, in meters, towards east, starting from a central meridian that is uniquely assigned for each zone. By convention, the central meridian in each region has a value of 500,000, and a meter east of the region central meridian therefore has the value 500,001. To convert between pixel coordinates and UTM coordinates, we utilize the affine transform as outlined in the following equation, where x’, y’ are UTM coordinates and x, y are pixel coordinates. The parameters a, b, c, d, e, and f of the affine transform are provided as part of the JP2 file metadata.
For the purposes of labeling, training and inference of the raw JP2 files are divided into non-overlapping 512 x 512-pixel JPG files. The extraction of smaller sub-images from the original JP2 necessitates the creation of an individual affine transform directly from each individual extracted JPG file. These operations were performed utilizing the rasterio
and affine
Python packages with AWS Batch and facilitated the reporting of the position of inferred ferns in UTM coordinates.
Data labeling
Visual identification of ferns in the aerial images is complicated by several factors. Most of the information is aggregated in the green channel, and there is a high density of foliage with frequent partial occlusion of ferns by both nearby ferns and other vegetation. The information of interest to TNC is the relative density of ferns per acre, therefore it’s important to count each individual fern even in the presence of occlusion. Given these goals and constraints, we chose to utilize an object detection CV framework.
To label the data, we set up an Amazon SageMaker Ground Truth labeling job. Each bounding box was intended to be centered in the center of the fern, and to cover most of the fern branches, while at the same time attempting to minimize the inclusion of other vegetation. The labeling was performed by the authors following consultation with TNC domain experts. The initial labeled dataset included 500 images, each typically containing several ferns, as shown in the following example images. In this initial labeled set we did not distinguish between native and invasive ferns.
Fern Object detection model training and update
In this section we discuss training the initial fern detection model, data labeling in Ground Truth and the model update through retraining. We also discuss using Amazon Augmented AI (Amazon A2I) for the model update, and using AWS Step Functions for the overall fern detection inference pipeline.
Initial fern detection model training
We utilized the Amazon SageMaker object detection algorithm because it provides state-of-the-art performance and can be easily integrated with other SageMaker services such as Ground Truth, endpoints, and Batch Transform jobs. We utilized the Single Shot MultiBox Detector (SSD) framework and base network vgg-16
. This network comes pre-trained on millions of images and thousands of classes from the ImageNet dataset. We break all the given TNC JP2 images into 512 x 512-pixel tiles as the training dataset. There are about 5,000 small JPG images, and we randomly selected 4,500 images as the training dataset and 500 images as the validation dataset. After hyperparameter tuning, we chose the following hyperparameters for the model training: class=1
, overlap_threshold=0.3
, learning_rate=0.001
, and epochs=50
. The initial model’s mean average precision (mAP) computed on the validation set is 0.49. After checking the detection results and TNC labels, we discovered that many ferns that were detected as ferns by our object detection model were not labeled as ferns from TNC fern labels, as shown in the following images.
Therefore, we decided to use Ground Truth to relabel a subset of the fern dataset in the attempt to improve model performance and then compare ML inference results with this initial model to check which approach is better.
Data labeling in Ground Truth
To label the fern dataset, we set up a Ground Truth job of 500 randomly selected 512 x 512-pixel images. Each bounding box was intended to be centered in the center of the fern, and to cover most of the fern branches, while at the same time attempting to minimize the inclusion of other vegetation. The labeling was performed by AWS data scientists following consultation with TNC domain experts. In this labeled dataset, we didn’t distinguish between native and invasive ferns.
Retraining the fern detection model
The first model training iteration utilized a set of 500 labeled images, of which 400 were in the training set and 100 in the validation set. This model achieved a mAP (computed on the validation set) score of 0.46, which isn’t very high. We next used this initial model to produce predictions on a larger set of 3,888 JPG images extracted from the available JP2 data. With this larger image set for training, the model achieved a mAP score of 0.87. This marked improvement (as shown in the following example images) illustrates the value of automated labeling and model iteration.
Based on these findings we determined that Ground Truth labeling plus automated labeling and model iteration appear to significantly increase prediction performance. To further quantify the performance of the resulting model, a set of 300 images were randomly selected for an additional round of validation. We found that when utilizing a threshold of 0.3 for detection confidence, 84% of the images were deemed by the labeler to have the correct number of predicted ferns, with 6.3% being overcounts and 9.7% being undercounts. In most cases, the over/undercounting was off by only one or two ferns out of five or six present in an image, and is therefore not expected to significantly affect the overall estimation of fern density per acre.
Amazon A2I for fern detection model update
One challenge for this project is that the images coming in every year are taken from aircraft, so the altitude, angles, and light condition of the images may be different. The model trained on the previous dataset needs to be retrained to maintain good performance, but labeling ferns for a new dataset is labor-intensive. Therefore, we used Amazon A2I to integrate human review to ensure accuracy with new data. We used 360 images as a test dataset; 35 images were sent back to for review because these images didn’t have predictions with a confidence score over 0.3. We relabeled these 35 images and retrained the model using incremental learning in Amazon A2I. The retrained model showed significant improvement from the previous model on many aspects, such as detections under darker light conditions, as shown in the following images. These improvements made the new model perform fairly well on new dataset with very few human reviews and relabeling work.
Fern detection inference pipeline
The overall goal of the TNC-AWS partnership is the creation of an automated pipeline that takes as input the JP2 files and produces as output UTM coordinates of the predicted ferns. There are three main tasks:
- The first is the ingestion of the large JP2 file and its division into smaller 512 x 512 JPG files. Each of these has an associated affine transform that can generate UTM coordinates from the pixel coordinates.
- The second task is the actual inference and detection of potential ferns and their locations.
- The final task assembles the inference results into a single CSV file that is delivered to TNC.
The orchestration of the pipeline was implemented using Step Functions. As is the case for the inference, this choice automates many of the aspects of provisioning and releasing computing resources on an as-needed basis. Additionally, the pipeline architecture can be visually inspected, which enhances dissemination to the customer. Finally, as updated models potentially become available in the future, they can be swapped in with little or no disruption to the workflow. The following diagram illustrates this workflow.
When the inference pipeline was used in batch mode on a source image of 10,000 x 10,000 pixels, and allocating an m4.large instance to SageMaker batch transform, the whole inference workflow ran within 25 minutes. Of these, 10 minutes was taken by the batch transform, and the rest by Step Functions steps and AWS Lambda functions. TNC expects sets up to 24 JP2 images at one time, about twice a year. By adjusting the size and number of instances to be used by the batch transform, we expect that the inference pipeline can be fully run within 24 hours.
Fern classification
In this section, we discuss how we applied the SageMaker Principal Component Analysis (PCA) algorithm to the bounding boxes and validated the classification results.
Application of PCA to fern bounding boxes
To determine whether it is possible to distinguish between the Australian tree fern and native ferns without the substantial effort of labeling a large set of images, we implemented an unsupervised image analysis procedure. For each predicted fern, we extracted the region inside the bounding box and saved it as a separate image. Next, these images were embedded in a high dimensional vector space by utilizing the img2vec
approach. This procedure generated a 2048-long vector for each input image. These vectors were analyzed by utilizing Principal Component Analysis as implemented in the SageMaker PCA algorithm. We retained for further analysis the top three components, which together accounted for more than 85% of the variance in the vector data.
For each of the top three components, we extracted the associated images with the highest and lowest scores along the component. These images were visually inspected by AWS data scientists and TNC domain experts, with the goal of identifying whether the highest and lowest scores are associated with native or invasive ferns. We further quantified the classification power of each principal component by manually labeling a small set of 100 fern images as either invasive or native and utilizing the scikit-learn
utility to obtain metrics such as area under the precision-recall curve for each of the three PCA components. When the PCA scores were used as inputs to a binary classifier (see the following graph), we found that PCA2 was the most discriminative, followed by PCA3, with PCA1 displaying only modest performance in distinguishing between native and invasive ferns.
Validation of classification results
We then examined images with the biggest and smallest PCA2 values with TNC domain experts to check if the algorithm can differentiate native and invasive ferns effectively. After going over 100 sample fern images, TNC experts determined that the images with the smallest PCA2 values are very likely to be native ferns, and the images with the largest PCA2 values are very likely to be invasive ferns (see the following example images). We would like to further investigate this approach with TNC in the near future.
Conclusion
The major benefits to TNC from adopting the inference pipeline proposed in this post are twofold. First, substantial cost savings is achieved by replacing months-long efforts by human labelers with an automatic pipeline that incurs minimal inference costs. Although exact costs can depend on several factors, we estimate the cost reductions to be at least of an order of magnitude. The second benefit is the reduction of time from data collection to the initiation of mitigation efforts. Currently, manual labeling for a dozen large JP2 files take several weeks to complete, whereas the inference pipeline is expected to take a matter of hours, depending on the number and size of inference instances allocated. A faster turnaround time would impact the capacity of TNC to plan routes for the crews responsible for treating the invasive ferns in a timely manner, and potentially find appropriate treatment windows considering the seasonality and weather patterns on the islands.
To get started using Ground Truth, see Build a highly accurate training dataset with Amazon SageMaker Ground Truth. Also learn more about Amazon ML by going to the Amazon SageMaker product page, and explore visual workflows for modern applications by going to the AWS Step Functions product page.
About the Authors
Dan Iancu is a data scientist with AWS. He has joined AWS three years ago and has worked with a variety of customers including in Health Care and Life Sciences, the space industry and the public sector. He believes in the importance of bringing value to the customer as well as contributing to environmental preservation by utilizing ML tools.
Kara Yang is a Data Scientist in AWS Professional Services. She is passionate about helping customers achieve their business goals with AWS cloud services. She has helped organizations build ML solutions across multiple industries such as manufacturing, automotive, environmental sustainability and aerospace.
Arkajyoti Misra is a Data Scientist at Amazon LastMile Transportation. He is passionate about applying Computer Vision techniques to solve problems that helps the earth. He loves to work with non-profit organizations and is a founding member of ekipi.org.
Annalyn Ng is a Senior Solutions Architect based in Singapore, where she designs and builds cloud solutions for public sector agencies. Annalyn graduated from the University of Cambridge, and blogs about machine learning at algobeans.com. Her book, Numsense! Data Science for the Layman, has been translated into multiple languages and is used in top universities as reference text.
Theresa Cabrera Menard is an Applied Scientist/Geographic Information Systems Specialist at The Nature Conservancy (TNC) in Hawai`i, where she manages a large dataset of high-resolution imagery from across the Hawaiian Islands. She was previously involved with the Hawai`i Challenge that used armchair conservationists to tag imagery for weeds in the forests of Kaua`i.
Veronika Megler is a Principal Consultant, Big Data, Analytics & Data Science, for AWS Professional Services. She holds a PhD in Computer Science, with a focus on spatio-temporal data search. She specializes in technology adoption, helping customers use new technologies to solve new problems and to solve old problems more efficiently and effectively.
Automatically generate model evaluation metrics using SageMaker Autopilot Model Quality Reports
Amazon SageMaker Autopilot helps you complete an end-to-end machine learning (ML) workflow by automating the steps of feature engineering, training, tuning, and deploying an ML model for inference. You provide SageMaker Autopilot with a tabular data set and a target attribute to predict. Then, SageMaker Autopilot automatically explores your data, trains, tunes, ranks and finds the best model. Finally, you can deploy this model to production for inference with one click.
What’s new?
The newly launched feature, SageMaker Autopilot Model Quality Reports, now reports your model’s metrics to provide better visibility into your model’s performance for regression and classification problems. You can leverage these metrics to gather more insights about the best model in the Model leaderboard.
These metrics and reports that are available in a new “Performance” tab under the “Model details” of the best model include confusion matrices, an area under the receiver operating characteristic (AUC-ROC) curve and an area under the precision-recall curve (AUC-PR). These metrics help you understand the false positives/false negatives (FPs/FNs), tradeoffs between true positives (TPs) and false positives (FPs), as well as the tradeoffs between precision and recall to assess the best model performance characteristics.
Running the SageMaker Autopilot experiment
The Data Set
We use UCI’s bank marketing data set to demonstrate SageMaker Autopilot Model Quality Reports. This data contains customer attributes, such as age, job type, marital status, and others that we’ll use to predict if the customer will open an account with the bank. The data set refers to this account as a term deposit. This makes our case a binary classification problem – the prediction will either be “yes” or “no”. SageMaker Autopilot will generate several models on our behalf to best predict potential customers. Then, we’ll examine the Model Quality Report for SageMaker Autopilot’s best model.
Prerequisites
To initiate a SageMaker Autopilot experiment, you must first place your data in an Amazon Simple Storage Service (Amazon S3) bucket. Specify the bucket and prefix that you want to use for training. Make sure that the bucket is in the same Region as the SageMaker Autopilot experiment. You must also make sure that the Identity and Access Management (IAM) role Autopilot has permissions to access the data in Amazon S3.
Creating the experiment
You have several options for creating a SageMaker Autopilot experiment in SageMaker Studio. By opening a new launcher, you may be able to access SageMaker Autopilot directly. If not, then you can select the SageMaker resources icon on the left-hand side. Next, you can select Experiments and trials from the drop-down menu.
- Give your experiment a name.
- Connect to your data source by selecting the Amazon S3 bucket and file name.
- Choose the output data location in Amazon S3.
- Select the target column for your data set. In this case, we’re targeting the “y” column to indicate yes/no.
- Optionally, provide an endpoint name if you wish to have SageMaker Autopilot automatically deploy a model endpoint.
- Leave all of the other advanced settings as default, and select Create Experiment.
Once the experiment completes, you can see the results in SageMaker Studio. SageMaker Autopilot will present the best model among the different models that it trains. You can view details and results for different trials, but we’ll use the best model to demonstrate the use of Model Quality Reports.
- Select the model, and right-click to Open in model details.
- Within the model details, select the Performance tab. This shows model metrics through visualizations and plots.
- Under Performance, select Download Performance Reports as PDF.
Interpreting the SageMaker Autopilot Model Quality Report
The Model Quality Report summarizes the SageMaker Autopilot job and model details. We’ll focus on the report’s PDF format, but you can also access the results as JSON. Because SageMaker Autopilot determined our data set as a binary classification problem, SageMaker Autopilot aimed to maximize the F1 quality metric to find the best model. SageMaker Autopilot chooses this by default. However, there is flexibility to choose other objective metrics, such as accuracy and AUC. Our model’s F1 score is 0.61. To interpret an F1 score, it helps to first understand a confusion matrix, which is explained by the Model Quality Report in the outputted PDF.
Confusion Matrix
A confusion matrix helps to visualize model performance by comparing different classes and labels. The SageMaker Autopilot experiment created a confusion matrix that shows the actual labels as rows, and the predicated labels as columns in the Model Quality Report. The upper-left box shows customers that didn’t open an account with the bank that were correctly predicted as ‘no’ by the model. These are true negatives (TN). The lower-right box shows customers that did open an account with the bank that were correctly predicted as ‘yes’ by the model. These are true positives (TP).
The bottom-left corner shows the number of false negatives (FN). The model predicted that the customer wouldn’t open an account, but the customer did. The upper-right corner shows the number of false positives (FP). The model predicted that the customer would open an account, but the customer did not actually do so.
Model Quality Report Metrics
The Model Quality Report explains how to calculate the false positive rate (FPR) and the true positive rate (TPR).
Recall or False Positive Rate (FPR) measures the proportion of actual negatives that were falsely predicted as opening an account (positives). The range is 0 to 1, and a smaller value indicates a better predictive accuracy.
Note that the FPR is also expressed as 1-Specificity, where Specificity or True Negative Rate (TNR) is the proportion of the TNs correctly identified as not opening an account (negatives).
Recall/Sensitivity/True Positive Rate (TPR) measures the fraction of actual positives that were predicted as opening an account. The range is also 0 to 1, and a larger value indicates a better predictive accuracy. This is also known as Recall/Sensitivity. This measure expresses the ability to find all of the relevant instances in a dataset.
Precision measures the fraction of actual positives that were predicted as positives out of all of those predicted as positive. The range is 0 to 1, and a larger value indicates better accuracy. Precision expresses the proportion of the data points that our model says was relevant and that were actually relevant. Precision is a good measure to consider, especially when the costs of FP is high – for example with email spam detection.
Our model shows a precision of 0.53 and a recall of 0.72.
F1 Score demonstrates our target metric, which is the harmonic mean of precision and recall. Because our data set is imbalanced in favor of many ‘no’ predictions, F1 takes both FP and FN into account to give the same weight to precision and recall.
The report explains how to interpret these metrics. This can help if you’re unfamiliar with these terms. In our example, precision and recall are important metrics for a binary classification problem, as they’re used to calculate the F1 score. The report explains that an F1 score can vary between 0 and 1. The best possible performance will score 1, whereas 0 will indicate the worst. Remember that our model’s F1 score is 0.61.
Fβ Score is the weighted harmonic mean of precision and recall. Moreover, the F1 score is the same as Fβ with β=1. The report provides the Fβ Score of the classifier, where β takes 0.5, 1, and 2.
Metrics Table
Depending on the problem, you may find that SageMaker Autopilot maximizes another metric, such as accuracy, for a multi-class classification problem. Regardless of the problem type, Model Quality Reports produce a table that summarizes your model’s metrics available both inline and in the PDF report. You can learn more about the metric table in the documentation.
The best constant classifier – a classifier that serves as a simple baseline to compare against other more complex classifiers – always predicts a constant majority label that is provided by the user. In our case, a ‘constant’ model would predict ‘no’, since that is the most frequent class and considered to be a negative label. The metrics for the trained classifier models (such as f1, f2, or recall) can be compared to those for the constant classifier, i.e., the baseline. This makes sure that the trained model performs better than the constant classifier. Fβ scores (f0_5, f1, and f2, where β takes the values of 0.5, 1, and 2 respectively) are the weighted harmonic mean of precision and recall. This reaches its optimal value at 1 and its worst value at 0.
In our case, the best constant classifier always predicts ‘no’. Therefore, accuracy is high at 0.89, but the recall, precision, and Fβ scores are 0. If the dataset is perfectly balanced where there is no single majority or minority class, we would have seen much more interesting possibilities for the precision, recall, and Fβ scores of the constant classifier.
Furthermore, you can view these results in JSON format as shown in the following sample. Υou can access both the PDF and JSON files through the UI, as well as Amazon SageMaker Python SDK using the S3OutputPath element in OutputDataConfig structure in the CreateAutoMLJob/DescribeAutoMLJob API response.
{
"version" : 0.0,
"dataset" : {
"item_count" : 9152,
"evaluation_time" : "2022-03-16T20:49:18.661Z"
},
"binary_classification_metrics" : {
"confusion_matrix" : {
"no" : {
"no" : 7468,
"yes" : 648
},
"yes" : {
"no" : 295,
"yes" : 741
}
},
"recall" : {
"value" : 0.7152509652509652,
"standard_deviation" : 0.00439996600081394
},
"precision" : {
"value" : 0.5334773218142549,
"standard_deviation" : 0.007335840278445563
},
"accuracy" : {
"value" : 0.8969624125874126,
"standard_deviation" : 0.0011703516093899595
},
"recall_best_constant_classifier" : {
"value" : 0.0,
"standard_deviation" : 0.0
},
"precision_best_constant_classifier" : {
"value" : 0.0,
"standard_deviation" : 0.0
},
"accuracy_best_constant_classifier" : {
"value" : 0.8868006993006993,
"standard_deviation" : 0.0016707401772078998
},
"true_positive_rate" : {
"value" : 0.7152509652509652,
"standard_deviation" : 0.00439996600081394
},
"true_negative_rate" : {
"value" : 0.9201577131591917,
"standard_deviation" : 0.0010233756436643213
},
"false_positive_rate" : {
"value" : 0.07984228684080828,
"standard_deviation" : 0.0010233756436643403
},
"false_negative_rate" : {
"value" : 0.2847490347490348,
"standard_deviation" : 0.004399966000813983
},
………………….
ROC and AUC
Depending on the problem type, you may have varying thresholds for what’s acceptable as an FPR. For example, if you’re trying to predict if a customer will open an account, then it may be more acceptable to the business to have a higher FP rate. It can be riskier to miss extending offers to customers who were incorrectly predicted ‘no’, as opposed to offering customers who were incorrectly predicted ‘yes’. Changing these thresholds to produce different FPRs requires you to create new confusion matrices.
Classification algorithms return continuous values known as prediction probabilities. These probabilities must be transformed into a binary value (for binary classification). In binary classification problems, a threshold (or decision threshold) is a value that dichotomizes the probabilities to a simple binary decision. For normalized projected probabilities in the range of 0 to 1, the threshold is set to 0.5 by default.
For binary classification models, a useful evaluation metric is the area under the Receiver Operating Characteristic (ROC) curve. The Model Quality Report includes a ROC graph with the TP rate as the y-axis and the FPR as the x-axis. The area under the receiver operating characteristic (AUC-ROC) represents the trade-off between the TPRs and FPRs.
You create a ROC curve by taking a binary classification predictor, which uses a threshold value, and assigning labels with prediction probabilities. As you vary the threshold for a model, you cover from the two extremes. When the TPR and the FPR are both 0, it implies that everything is labeled “no”, and when both the TPR and FPR are 1 it implies that everything is labeled “yes”.
A random predictor that labels “Yes” half of the time and “No” the other half of the time would have a ROC that’s a straight diagonal line (red-dotted line). This line cuts the unit square into two equally-sized triangles. Therefore, the area under the curve is 0.5. An AUC-ROC value of 0.5 would mean that your predictor was no better at discriminating between the two classes than randomly guessing whether a customer would open an account or not. The closer the AUC-ROC value is to 1.0, the better its predictions are. A value below 0.5 indicates that we could actually make our model produce better predictions by reversing the answer that it gives us. For our best model, the AUC is 0.93.
Precision Recall Curve
The Model Quality Report also created a Precision Recall (PR) Curve to plot the precision (y-axis) and the recall (x-axis) for different thresholds – much like the ROC curve. PR Curves, often used in Information Retrieval, are an alternative to ROC curves for classification problems with a large skew in the class distribution.
For these class imbalanced datasets, PR Curves especially become useful when the minority positive class is more interesting than the majority negative class. Remember that our model shows a precision of 0.53 and a recall of 0.72. Furthermore, remember that the best constant classifier can’t discriminate between ‘yes’ and ‘no’. It would predict a random class or a constant class every time.
The curve for a balanced dataset between ‘yes’ and ‘no’ would be a horizontal line at 0.5, and thus would have an area under the PR curve (AUPRC) as 0.5. To create the PRC, we plot various models on the curve at varying thresholds, in the same way as the ROC curve. For our data, the AUPRC is 0.61.
Model Quality Report Output
You can find the Model Quality Report in the Amazon S3 bucket that you specified when designating the output path before running the SageMaker AutoPilot experiment. You’ll find the reports under the documentation/model_monitor/output/<autopilot model name>/ prefix
saved as a PDF.
Conclusion
SageMaker Autopilot Model Quality Reports makes it easy for you to quickly see and share the results of a SageMaker Autopilot experiment. You can easily complete model training and tuning using SageMaker Autopilot, and then reference the generated reports to interpret the results. Whether you end up using SageMaker Autopilot’s best model, or another candidate, these results can be a helpful starting point to evaluating a preliminary model training and tuning job. SageMaker Autopilot Model Quality Reports helps reduce the time needed to write code and produce visuals for performance evaluation and comparison.
You can easily incorporate autoML into your business cases today without having to build a data science team. SageMaker documentation provides numerous samples to help you get started.
About the Authors
Peter Chung is a Solutions Architect for AWS, and is passionate about helping customers uncover insights from their data. He has been building solutions to help organizations make data-driven decisions in both the public and private sectors. He holds all AWS certifications as well as two GCP certifications. He enjoys coffee, cooking, staying active, and spending time with his family.
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
Ali Takbiri is an AI/ML specialist Solutions Architect, and helps customers by using Machine Learning to solve their business challenges on the AWS Cloud.
Pradeep Reddy is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot, SageMaker Automatic Model Tuner. Outside of work, Pradeep enjoys reading, running and geeking out with palm sized computers like raspberry pi, and other home automation tech.
Amazon to host StatML Oxford Imperial ML Workshop in Berlin office
Workshop provides opportunity for students to showcase their work and for connections to be established between academics and Amazon researchers.Read More
Build a mental health machine learning risk model using Amazon SageMaker Data Wrangler
This post is co-written by Shibangi Saha, Data Scientist, and Graciela Kravtzov, Co-Founder and CTO, of Equilibrium Point.
Many individuals are experiencing new symptoms of mental illness, such as stress, anxiety, depression, substance use, and post-traumatic stress disorder (PTSD). According to Kaiser Family Foundation, about half of adults (47%) nationwide have reported negative mental health impacts during the pandemic, a significant increase from pre-pandemic levels. Also, certain genders and age groups are among the most likely to report stress and worry, at rates much higher than others. Additionally, a few specific ethnic groups are more likely to report a “major impact” to their mental health than others.
Several surveys, including those collected by the Centers for Disease Control (CDC), have shown substantial increases in self-reported behavioral health symptoms. According to one CDC report, which surveyed adults across the US in late June of 2020, 31% of respondents reported symptoms of anxiety or depression, 13% reported having started or increased substance use, 26% reported stress-related symptoms, and 11% reported having serious thoughts of suicide in the past 30 days.
Self-reported data, while absolutely critical in diagnosing mental health disorders, can be subject to influences related to the continuing stigma surrounding mental health and mental health treatment. Rather than rely solely on self-reported data, we can estimate and forecast mental distress using data from health records and claims data to try to answer a fundamental question: can we predict who will likely need mental health help before they need it? If these individuals can be identified, early intervention programs and resources can be developed and deployed to respond to any new or increase in underlying symptoms to mitigate the effects and costs of mental disorders.
Easier said than done for those who have struggled with managing and processing large volumes of complex, gap-riddled claims data! In this post, we share how Equilibrium Point IoT used Amazon SageMaker Data Wrangler to streamline claims data preparation for our mental health use case, while ensuring data quality throughout each step in the process.
Solution overview
Data preparation or feature engineering is a tedious process, requiring experienced data scientists and engineers spending a lot of time and energy on formulating recipes for the various transformations (steps) needed to get the data into its right shape. In fact, research shows that data preparation for machine learning (ML) consumes up to 80% of data scientists’ time. Typically, scientists and engineers use various data processing frameworks, such as Pandas, PySpark, and SQL, to code their transformations and create distributed processing jobs. With Data Wrangler, you can automate this process. Data Wrangler is a component of Amazon SageMaker Studio that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data flow into your existing ML workflows to simplify and streamline data processing and feature engineering using little to no coding.
In this post, we walk through the steps to transform original raw datasets into ML-ready features to use for building the prediction models in the next stage. First, we delve into the nature of the various datasets used for our use case and how we joined these datasets via Data Wrangler. After the joins and the dataset consolidation, we describe the individual transformations we applied on the dataset like de-duplication, handling missing values, and custom formulas, followed by how we used the built-in Quick Model analysis to validate the current state of transformations for predictions.
Datasets
For our experiment, we first downloaded patient data from our behavioral health client. This data includes the following:
- Claims data
- Emergency room visit counts
- Inpatient visit counts
- Drug prescription counts related to mental health
- Hierarchical condition coding (HCC) diagnoses counts related to mental health
The goal was to join these separate datasets based on patient ID and utilize the data to predict a mental health diagnosis. We used Data Wrangler to create a massive dataset of several million rows of data, which is a join of five separate datasets. We also used Data Wrangler to perform several transformations to allow for column calculations. In the following sections, we describe the various data preparation transformations that we applied.
Drop duplicate columns after a join
Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning, transforming, and featurizing your data. When you add a transform, it adds a step to the data flow. Each transform you add modifies your dataset and produces a new dataframe. All subsequent transforms apply to the resulting dataframe. Data Wrangler includes built-in transforms, which you can use to transform columns without any code. You can also add custom transformations using PySpark, Pandas, and PySpark SQL. Some transforms operate in place, while others create a new output column in your dataset.
For our experiments, since after each join on the patient ID, we were left with duplicate patient ID columns. We needed to drop these columns. We dropped the right patient ID column, as shown in the following screenshot using the pre-built Manage Columns –>Drop column transform, to maintain only one patient ID column (patient_id in the final dataset).
Pivot a dataset using Pandas
Claims datasets were patient level with emergency visit (ER), inpatient (IP), prescription counts, and diagnoses data already grouped by their correspondent HCC codes (approximately 189 codes). To build a patient datamart, we aggregate the claims HCC codes by patient and pivot the HCC code from rows to columns. We used Pandas to pivot the dataset, count the number of HCC codes by patient, and then join to the primary dataset on patient ID. We used the custom transform option in Data Wrangler choosing Python (Pandas) as the framework of choice.
The following code snippet shows the transformation logic to pivot the table:
Create new columns using custom formulas
We studied research literature to determine which HCC codes are deterministic in mental health diagnoses. We then wrote this logic using a Data Wrangler custom formula transform that uses a Spark SQL expression to calculate a Mental Health Diagnosis target column (MH), which we added to the end of the DataFrame.
We used the following transformation logic:
Drop columns from the DataFrame using PySpark
After calculation of the target (MH) column, we dropped all the unnecessary duplicate columns. We preserved the patient ID and the MH column to join to our primary dataset. This was facilitated by a custom SQL transform that uses PySpark SQL as a framework of our choice.
We used the following logic:
Move the MH column to start
Our ML algorithm requires that the labeled input is in the first column. Therefore, we moved the MH calculated column to the start of the DataFrame to be ready for export.
Fill in blanks with 0 using Pandas
Our ML algorithm also requires that the input data has no empty fields. Therefore, we filled the final dataset’s empty fields with 0s. We can easily do this via a custom transform (Pandas) in Data Wrangler.
We used the following logic:
Cast column from float to long
You can also parse and cast a column to any new data type easily in Data Wrangler. For memory optimization purposes, we cast our mental health label input column as float.
Quick Model analysis: Feature importance graph
After creating our final dataset, we utilized the Quick Model analysis type in Data Wrangler to quickly identify data inconsistencies and if our model accuracy was in the expected range, or if we needed to continue feature engineering before spending the time of training the model. The model returned an F1 score of 0.901, with 1 being the highest. An F1 score is a way of combining the precision and recall of the model, and it’s defined as the harmonic mean of the two. After inspecting these initial positive results, we were ready to export the data and proceed with model training using the exported dataset.
Export the final dataset to Amazon S3 via a Jupyter notebook
As a final step, to export the dataset in its current form (transformed) to Amazon Simple Storage Service (Amazon S3) for future use on model training, we use the Save to Amazon S3 (via Jupyter Notebook) export option. This notebook starts a distributed and scalable Amazon SageMaker Processing job that applies the created recipe (data flow) to specified inputs (usually larger datasets) and saves the results in Amazon S3. You can also export your transformed columns (features) to Amazon SageMaker Feature Store or export the transformations as a pipeline using Amazon SageMaker Pipelines, or simply export the transformations as Python code.
To export data to Amazon S3, you have three options:
- Export the transformed data directly to Amazon S3 via the Data Wrangler UI
- Export the transformations as a SageMaker Processing job via a Jupyter notebook (as we do for this post).
- Export the transformations to Amazon S3 via a destination node. A destination node tells Data Wrangler where to store the data after you’ve processed it. After you create a destination node, you create a processing job to output the data.
Conclusion
In this post, we showcased how Equilibrium Point IoT uses Data Wrangler to speed up the loading process of large amounts of our claims data for data cleaning and transformation in preparation for ML. We also demonstrated how to incorporate feature engineering with custom transformations using Pandas and PySpark in Data Wrangler, allowing us to export data step by step (after each join) for quality assurance purposes. The application of these easy-to-use transforms in Data Wrangler cut down the time spent on end-to-end data transformation by nearly 50%. Also, the Quick Model analysis feature in Data Wrangler allowed us to easily validate the state of transformations as we cycle through the process of data preparation and feature engineering.
Now that we have prepped the data for our mental health risk modeling use case, as next step, we plan to build an ML model using SageMaker and the built-in algorithms it offers, utilizing our claims dataset to identify members who should seek mental health services before they get to a point where they need it. Stay tuned!
About the Authors
Shibangi Saha is a Data Scientist at Equilibrium Point. She combines her expertise in healthcare payor claims data and machine learning to design, implement, automate, and document for health data pipelines, reporting, and analytics processes that drive insights and actionable improvements in the healthcare delivery system. Shibangi received her Master of Science in Bioinformatics from Northeastern University College of Science and a Bachelor of Science in Biology and Computer Science from Khoury College of Computer Science and Information Sciences.
Graciela Kravtzov is the Co-Founder and CTO of Equilibrium Point. Grace has held C-level/VP leadership positions within Engineering, Operations, and Quality, and served as an executive consultant for business strategy and product development within the healthcare and education industries and the IoT industrial space. Grace received a Master of Science degree in Electromechanical Engineer from the University of Buenos Aires and a Master of Science degree in Computer Science from Boston University.
Arunprasath Shankar is an Artificial Intelligence and Machine Learning (AI/ML) Specialist Solutions Architect with AWS, helping global customers scale their AI solutions effectively and efficiently in the cloud. In his spare time, Arun enjoys watching sci-fi movies and listening to classical music.
Ajai Sharma is a Senior Product Manager for Amazon SageMaker where he focuses on SageMaker Data Wrangler, a visual data preparation tool for data scientists. Prior to AWS, Ajai was a Data Science Expert at McKinsey and Company where he led ML-focused engagements for leading finance and insurance firms worldwide. Ajai is passionate about data science and loves to explore the latest algorithms and machine learning techniques.
Improve search accuracy with Spell Checker in Amazon Kendra
Amazon Kendra is an intelligent search service powered by machine learning. You can receive spelling suggestions for misspelled terms in your queries by utilizing the Amazon Kendra Spell Checker. Spell Checker helps reduce the frequency of queries returning irrelevant results by providing spelling suggestions for unrecognized terms.
In this post, we explore how to use Amazon Kendra Spell Checker on the AWS Management Console, as well as how to enable Spell Checker in an Amazon Kendra-powered search application through the AWS Command Line Interface (AWS CLI) and AWS SDK.
Use Amazon Kendra Spell Checker on the console
You can automatically receive spelling suggestions for your misspelled Amazon Kendra queries when querying through the console.
On the Amazon Kendra console, choose your desired index, then choose Search indexed content in the navigation pane. Make sure that the selected index has ingested documents; in this post, we use the sample AWS documentation found in the Data sources section of the navigation pane.
On the Amazon Kendra search console, simply submit a query as you usually would. Misspelled terms in the query are substituted with suggested terms in the “Did you mean” section of the search console.
Choosing the suggested query submits a new query with the corrected spelling.
As you can see, the query results provided through the suggested query are significantly more relevant, thanks to Spell Checker!
Use Amazon Kendra Spell Checker in search applications
Search applications powered by Amazon Kendra can quickly and easily enable Spell Checker through the AWS CLI or AWS SDK, which we walk through in this section. Additionally, we go over an example of how to process the Spell Checker response.
AWS CLI
Let’s look at how AWS CLI users can opt in to Amazon Kendra Spell Checker to receive spelling suggestions for misspelled query terms. We use the AWS CLI to query Amazon Kendra as usual, with only one small change: we include the --spell-correction-configuration IncludeQuerySpellCheckSuggestions=true
argument:
In addition to the normal query results, the response from Amazon Kendra now contains a SpellCorrectedQueries
object, if there are any spelling suggestions for the query. For more information, see SpellCorrectedQuery.
AWS SDK
Next, let’s walk through how Amazon Kendra provides spell check functionality for AWS SDK users. For this example, we use Python 3. We submit a query with a few spelling errors, and print out the SpellCorrectedQueries
object in the response:
The response from Amazon Kendra now contains the expected spelling suggestions:
Process the Amazon Kendra Spell Check response
Now that we’ve gone over how to programmatically get spelling suggestions through either the AWS CLI or AWS SDK, we can examine how we turn the response into a human-readable suggested query. For this example, we use the sample output from the previous section:
Each SpellCorrectedQuery
has two keys: SuggestedQueryText
and Corrections
.
-
SuggestedQueryText
maps to a string containing the updated query with the suggested spelling corrections. -
Corrections
maps to a list ofCorrection
objects, which contains the beginning and ending offset of the correction, as well as the original term from the query and the spelling suggestion for that term.
For our example, we want to show the suggested query text with the newly suggested terms italicized, similar to what is done on the Amazon Kendra console. To achieve this, we can add HTML italics opening tags <i>
at the BeginOffset
of each Correction
and HTML italics closing tags </i>
at the EndOffset
of each Correction
in the Corrections
list. Note that BeginOffset
and EndOffset
are based on the length of the corrected terms, not the original terms.
Adding the italics tags to SuggestedQueryText
gives us the following suggested query text:
As you can see, Amazon Kendra Spell Checker makes it simple to add spell check functionality to your search application.
Conclusion
Spell Checker is a new, powerful feature offered by Amazon Kendra. Spell Checker is a simple, effective way to quickly reduce the number of unhelpful queries by providing spelling suggestions to end-users for misspelled terms.
Spell Checker is available in all AWS Regions where Amazon Kendra is available, and supports all languages currently supported by Amazon Kendra.
To learn more about Amazon Kendra, visit the Amazon Kendra product page.
About the Author
Matthew Peretick is a Software Development Engineer at Amazon Web Services based in New York City. Matthew is a member of the Amazon Kendra team focused on enhancing the Amazon Kendra query experience.
How AWS uses graph neural networks to meet customer needs
Information extraction, drug discovery, and software analysis are just a few applications of this versatile tool.Read More