Automated system tripled the number of facts in a product graph.Read More
Amazon Textract now available in Asia Pacific (Mumbai) and EU (Frankfurt) Regions
You can now use Amazon Textract, a machine learning (ML) service that quickly and easily extracts text and data from forms and tables in scanned documents, for workloads in the AWS Asia Pacific (Mumbai) and EU (Frankfurt) Regions.
Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented. The Amazon Textract API supports multiple image formats like scans, PDFs, and photos, and you can use it with other AWS ML services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Augmented AI, and Amazon Translate to derive deeper meaning from the extracted text and data. You can also use this text and data to build smart searches on large archives of documents, or load it into a database for use by applications, such as accounting, auditing, and compliance software.
An in-country infrastructure is critical for customers with data residency requirements and regulations, such as those operating in government, insurance, healthcare, and financial services. With this launch, customers in the AWS Asia Pacific (Mumbai) and EU (Frankfurt) Regions can benefit from Amazon Textract while complying with data residency requirements and integrating with other services and applications available in these Regions.
Perfios is a leading product technology company enabling businesses to aggregate, curate, and analyze structured and unstructured data to help in decision-making. “We have been testing Amazon Textract since its early days and are very excited to see it launch in India to help us address data sovereignty requirements for the region, which now unblocks us to use it at scale,” says Ramgopal Cillanki, Vice President, Head of Engineering at Perfios Software. “We believe that the service will help to transform the banking, financial services, and insurance (BFSI) industry from operations-heavy, human-in-the-loop processes to machine learning-powered API automation with minimal manual operations. Textract will not only help us reduce lenders’ decision-making turnaround time but also create business impact for our end-users in the long run.”
For more information about Amazon Textract and its Region availability, see Amazon Textract FAQs. To get started with Amazon Textract, see the Amazon Textract Developer Guide.
About the Author
Raj Copparapu is a Product Manager focused on putting machine learning in the hands of every developer.
Accessing data sources from Amazon SageMaker R kernels
Amazon SageMaker notebooks now support R out-of-the-box, without needing you to manually install R kernels on the instances. Also, the notebooks come pre-installed with the reticulate library, which offers an R interface for the Amazon SageMaker Python SDK and enables you to invoke Python modules from within an R script. You can easily run machine learning (ML) models in R using the Amazon SageMaker R kernel to access the data from multiple data sources. The R kernel is available by default in all Regions that Amazon SageMaker is available in.
R is a programming language built for statistical analysis and is very popular in data science communities. In this post, we will show you how to connect to the following data sources from the Amazon SageMaker R kernel using Java Database Connectivity (JDBC):
For more information about using Amazon SageMaker features using R, see R User Guide to Amazon SageMaker.
Solution overview
To build this solution, we first need to create a VPC with public and private subnets. This will allow us to securely communicate with different resources and data sources inside an isolated network. Next, we create the data sources in the custom VPC and the notebook instance with all necessary configuration and access to connect various data sources using R.
To make sure that the data sources are not reachable from the Internet, we create them inside a private subnet of the VPC. For this post, we create the following:
- Amazon EMR cluster inside a private subnet with Hive and Presto installed. For instructions, see Getting Started: Analyzing Big Data with Amazon EMR.
- Athena resources. For instructions, see Getting Started.
- Amazon Redshift cluster inside a private subnet. For instructions, see Create a sample Amazon Redshift cluster.
- Amazon Aurora MySQL-compatible cluster running inside a private subnet. For instructions, see Creating an Amazon Aurora DB Cluster.
Connect to the Amazon EMR cluster inside the private subnet using AWS Systems Manager Session Manager to create Hive tables.
To run the code using the R kernel in Amazon SageMaker, create an Amazon SageMaker notebook. Download the JDBC drivers for the data sources. Create a lifecycle configuration for the notebook containing the setup script for R packages, and attach the lifecycle configuration to the notebook on create and on start to make sure the setup is complete.
Finally, we can use the AWS Management Console to navigate to the notebook to run code using the R kernel and access the data from various sources. The entire solution is also available in the GitHub repository.
Solution architecture
The following architecture diagram shows how you can use Amazon SageMaker to run code using the R kernel by establishing connectivity to various sources. You can also use the Amazon Redshift query editor or Amazon Athena query editor to create data resources. You need to use the Session Manager in AWS Systems Manager to SSH to the Amazon EMR cluster to create Hive resources.
Launching the AWS CloudFormation template
To automate resource creation, you run an AWS CloudFormation template. The template gives you the option to create an Amazon EMR cluster, Amazon Redshift cluster, or Amazon Aurora MySQL-compatible cluster automatically, as opposed to executing each step manually. It will take a few minutes to create all the resources.
- Choose the following link to launch the CloudFormation stack, which creates the required AWS resources to implement this solution:
- On the Create stack page, choose Next.
- Enter a stack name.
- You can change the default values for the following stack details:
Stack Details | Default Values |
Choose Second Octet for Class B VPC Address (10.xxx.0.0/16) | 0 |
SageMaker Jupyter Notebook Instance Type | ml.t2.medium |
Create EMR Cluster Automatically? | “Yes” |
Create Redshift Cluster Automatically? | “Yes” |
Create Aurora MySQL DB Cluster Automatically? | “Yes” |
- Choose Next.
- On the Configure stack options page, choose Next.
- Select I acknowledge that AWS CloudFormation might create IAM resources.
- Choose Create stack.
You can now see the stack being created, as in the following screenshot.
When stack creation is complete, the status shows as CREATE_COMPLETE
.
- On the Outputs tab, record the keys and their corresponding values.
You use the following keys later in this post:
- AuroraClusterDBName – Aurora cluster database name
- AuroraClusterEndpointWithPort – Aurora cluster endpoint address with port number
- AuroraClusterSecret – Aurora cluster credentials secret ARN
- EMRClusterDNSAddress – EMR cluster DNS name
- EMRMasterInstanceId – EMR cluster primary instance ID
- PrivateSubnets – Private subnets
- PublicSubnets – Public subnets
- RedshiftClusterDBName – Amazon Redshift cluster database name
- RedshiftClusterEndpointWithPort – Amazon Redshift cluster endpoint address with port number
- RedshiftClusterSecret – Amazon Redshift cluster credentials secret ARN
- SageMakerNotebookName – Amazon SageMaker notebook instance name
- SageMakerRS3BucketName – Amazon SageMaker S3 data bucket
- VPCandCIDR – VPC ID and CIDR block
Creating your notebook with necessary R packages and JAR files
JDBC is an application programming interface (API) for the programming language Java, which defines how you can access a database. RJDBC is a package in R that allows you to connect to various data sources using the JDBC interface. The notebook instance that the CloudFormation template created ensures that the necessary JAR files for Hive, Presto, Amazon Athena, Amazon Redshift and MySQL are present in order to establish a JDBC connection.
- In the Amazon SageMaker Console, under Notebook, choose Notebook instances.
- Search for the notebook that matches the
SageMakerNotebookName
key you recorded earlier.
- Select the notebook instance.
- Click on “Open Jupyter” under “Actions” to locate the “jdbc” directory.
The CloudFormation template downloads the JAR files for Hive, Presto, Athena, Amazon Redshift, and Amazon Aurora MySQL-compatible inside the “jdbc” directory.
- Locate the lifecycle configuration attached.
A lifecycle configuration allows you to install packages or sample notebooks on your notebook instance, configure networking and security for it, or otherwise use a shell script for customization. A lifecycle configuration provides shell scripts that run when you create the notebook instance or when you start the notebook.
- Inside the Lifecycle configuration section, choose View script to see the lifecycle configuration script that sets up the R kernel in Amazon SageMaker to make JDBC connections to data sources using R.
It installs the RJDBC package and dependencies in the Anaconda environment of the Amazon SageMaker notebook.
Connecting to Hive and Presto
Amazon EMR is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.
You can create a test table in Hive by logging in to the EMR master node from the AWS console using the Session Manager capability in Systems Manager. Systems Manager gives you visibility and control of your infrastructure on AWS. Systems Manager also provides a unified user interface so you can view operational data from multiple AWS services and allows you to automate operational tasks across your AWS resources. Session Manager is a fully managed Systems Manager capability that lets you manage your Amazon Elastic Compute Cloud (Amazon EC2) instances, on-premises instances, and virtual machines (VMs) through an interactive, one-click browser-based shell or through the AWS Command Line Interface (AWS CLI).
You use the following values from the AWS CloudFormation Outputs tab in this step:
- EMRClusterDNSAddress – EMR cluster DNS name
- EMRMasterInstanceId – EMR cluster primary instance ID
- SageMakerNotebookName – Amazon SageMaker notebook instance name
- On the Systems Manager Console, under Instances & Nodes, choose Session Manager.
- Choose Start Session.
- Start an SSH session with the EMR primary node by locating the instance ID as specified by the value of the key
EMRMasterInstanceId
.
This starts the browser-based shell.
- Run the following SSH commands:
# change user to hadoop whoami sudo su - hadoop
- Create a test table in Hive from the EMR master node as you have already logged in using SSH:
# Run on the EMR master node to create a table called students in Hive hive -e "CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));" # Run on the EMR master node to insert data to students created above hive -e "INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);" # Verify hive -e "SELECT * from students;" exit exit
The following screenshot shows the view in the browser-based shell.
- Close the browser after exiting the shell.
To query the data from Amazon EMR using the Amazon SageMaker R kernel, you open the notebook the CloudFormation template created.
- On the Amazon SageMaker Console, under Notebook, chose Notebook instances.
- Find the notebook as specified by the value of the key
SageMakerNotebookName
. - Choose Open Jupyter.
- To demonstrate connectivity from the Amazon SageMaker R kernel, choose Upload and upload the ipynb notebook.
- Alternatively, from the New drop-down menu, choose R to open a new notebook.
- Enter the code as mentioned in “hive_connect.ipynb”, replacing the
emr_dns
value with the value from keyEMRClusterDNSAddress
:
- Alternatively, from the New drop-down menu, choose R to open a new notebook.
- Run all the cells in the notebook to connect to Hive on Amazon EMR using the Amazon SageMaker R console.
You follow similar steps to connect Presto:
- On the Amazon SageMaker Console, open the notebook you created.
- Choose Open Jupyter.
- Choose Upload to upload the ipynb notebook.
- Alternatively, from the New drop-down menu, choose R to open a new notebook.
- Enter the code as mentioned in “presto_connect.ipynb”, replacing the
emr_dns
value with the value from keyEMRClusterDNSAddress
:
- Run all the cells in the notebook to connect to PrestoDB on Amazon EMR using the Amazon SageMaker R console.
Connecting to Amazon Athena
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Amazon Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. To connect to Amazon Athena from the Amazon SageMaker R kernel using RJDBC, we use the Amazon Athena JDBC driver, which is already downloaded to the notebook instance via the lifecycle configuration script.
You also need to set the query result location in Amazon S3. For more information, see Working with Query Results, Output Files, and Query History.
- On the Amazon Athena Console, choose Get Started.
- Choose Set up a query result location in Amazon S3.
- For Query result location, enter the Amazon S3 location as specified by the value of the key
SageMakerRS3BucketName
. - Optionally, add a prefix, such as
results
. - Choose Save.
- Create a database or schema and table in Athena with the example Amazon S3 data.
- Similar to connecting to Hive and Presto, to establish a connection from Athena to Amazon SageMaker using the R kernel, you can upload the ipynb notebook.
- Alternatively, open a new notebook and enter the code in “athena_connect.ipynb”, replacing the
s3_bucket
value with the value from keySageMakerRS3BucketName
:
- Alternatively, open a new notebook and enter the code in “athena_connect.ipynb”, replacing the
- Run all the cells in the notebook to connect to Amazon Athena from the Amazon SageMaker R console.
Connecting to Amazon Redshift
Amazon Redshift is a fast, fully managed cloud data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution. To connect to Amazon Redshift from the Amazon SageMaker R kernel using RJDBC, we use the Amazon Redshift JDBC driver, which is already downloaded to the notebook instance via the lifecycle configuration script.
You need the following keys and their values from the AWS CloudFormation Outputs tab:
- RedshiftClusterDBName – Amazon Redshift cluster database name
- RedshiftClusterEndpointWithPort – Amazon Redshift cluster endpoint address with port number
- RedshiftClusterSecret – Amazon Redshift cluster credentials secret ARN
The CloudFormation template creates a secret for the Amazon Redshift cluster in AWS Secrets Manager, which is a service that helps you protect secrets needed to access your applications, services, and IT resources. Secrets Manager lets you easily rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle.
- On the AWS Secrets Manager Console, choose Secrets.
- Choose the secret denoted by the
RedshiftClusterSecret
key value.
- In the Secret value section, choose Retrieve secret value to get the user name and password for the Amazon Redshift cluster.
- On the Amazon Redshift Console, choose Editor (which is essentially the Amazon Redshift query editor).
- For Database name, enter
redshiftdb
. - For Database password, enter your password.
- Choose Connect to database.
- Run the following SQL statements to create a table and insert a couple of records:
CREATE TABLE public.students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2)); INSERT INTO public.students VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
- On the Amazon SageMaker Console, open your notebook.
- Choose Open Jupyter.
- Upload the ipynb notebook.
- Alternatively, open a new notebook and enter the code as mentioned in “redshift_connect.ipynb”, replacing the values for
RedshiftClusterEndpointWithPort
,RedshiftClusterDBName
, andRedshiftClusterSecret
:
- Alternatively, open a new notebook and enter the code as mentioned in “redshift_connect.ipynb”, replacing the values for
- Run all the cells in the notebook to connect to Amazon Redshift on the Amazon SageMaker R console.
Connecting to Amazon Aurora MySQL-compatible
Amazon Aurora is a MySQL-compatible relational database built for the cloud, which combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open-source databases. To connect to Amazon Aurora from the Amazon SageMaker R kernel using RJDBC, we use the MariaDB JDBC driver, which is already downloaded to the notebook instance via the lifecycle configuration script.
You need the following keys and their values from the AWS CloudFormation Outputs tab:
- AuroraClusterDBName – Aurora cluster database name
- AuroraClusterEndpointWithPort – Aurora cluster endpoint address with port number
- AuroraClusterSecret – Aurora cluster credentials secret ARN
The CloudFormation template creates a secret for the Aurora cluster in Secrets Manager.
- On the AWS Secrets Manager Console, locate the secret as denoted by the
AuroraClusterSecret
key value.
- In the Secret value section, choose Retrieve secret value to get the user name and password for the Aurora cluster.
To connect to the cluster, you follow similar steps as with other services.
- On the Amazon SageMaker Console, open your notebook.
- Choose Open Jupyter.
- Upload the ipynb notebook.
- Alternatively, open a new notebook and enter the code as mentioned in “aurora_connect.ipynb”, replacing the values for
AuroraClusterEndpointWithPort
,AuroraClusterDBName
, andAuroraClusterSecret
:
- Alternatively, open a new notebook and enter the code as mentioned in “aurora_connect.ipynb”, replacing the values for
- Run all the cells in the notebook to connect Amazon Aurora on the Amazon SageMaker R console.
Conclusion
In this post, we demonstrated how to connect to various data sources, such as Hive and PrestoDB on Amazon EMR, Amazon Athena, Amazon Redshift, and Amazon Aurora MySQL-compatible cluster, in your environment to analyze, profile, run statistical computions using R from Amazon SageMaker. You can extend this method to other data sources via JDBC.
Author Bio
Kunal Ghosh is a Solutions Architect at AWS. His passion is building efficient and effective solutions on the cloud, especially involving analytics, AI, data science, and machine learning. Besides family time, he likes reading, swimming, biking, and watching movies, and he is a foodie.
Gagan Brahmi is a Specialist Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.
Training a custom single class object detection model with Amazon Rekognition Custom Labels
Customers often need to identify single objects in images; for example, to identify their company’s logo, find a specific industrial or agricultural defect, or locate a specific event, like hurricanes, in satellite scans. In this post, we showcase how to train a custom model to detect a single object using Amazon Rekognition Custom Labels.
Amazon Rekognition is a fully managed service that provides computer vision (CV) capabilities for analyzing images and video at scale, using deep learning technology without requiring machine learning (ML) expertise. Amazon Rekognition Custom Labels lets you extend the detection and classification capabilities of the Amazon Rekognition pre-trained APIs by using data to train a custom CV model specific to your business needs. With the latest update to support single object training, Amazon Rekognition Custom Labels now lets you create a custom object detection model with single object classes.
Solution overview
To show you how the single class object detection feature works, we create a custom model to detect pizza in images. Because we only care about finding pizza in our images, we don’t want to create labels for other food types or create a “not pizza” label.
To create our custom model, we follow these steps:
- Create a project in Amazon Rekognition Custom Labels.
- Create a dataset with images containing one or more pizzas.
- Label the images by applying bounding boxes on all pizzas in the images using the user interface provided by Amazon Rekognition Custom Labels.
- Train the model and evaluate the performance.
- Test the new custom model using the automatically generated API endpoint.
Amazon Rekognition Custom Labels lets you manage the ML model training process on the Amazon Rekognition console, which simplifies the end-to-end process.
Creating your project
To create your pizza-detection project, complete the following steps:
- On the Amazon Rekognition console, choose Custom Labels.
- Choose Get Started.
- For Project name, enter
PizzaDetection
. - Choose Create project
You can also create a project on the Projects page. You can access the Projects page via the left navigation pane.
Creating your dataset
To create your pizza model, you first need to create a dataset to train the model with. For this post, our dataset is composed of 39 images that contain pizza. We sourced our images from pexels.com.
To create your dataset:
- Choose Create dataset.
- Select Upload images from your computer.
- Choose Add Images.
- Upload your images. You can always add more images later.
Labeling the images with bounding boxes
You’re now ready to label the images by applying bounding boxes on all images with pizza.
- Add
Pizza
as a label to your dataset via the labels list on the left side of the gallery.
- Apply the label to the pizzas in the images by selecting all the images with pizza and choosing Draw Bounding Box.
You can use the Shift key to automatically select multiple images between the first and last selected images.
Make sure to draw a bounding box that covers the pizza as tightly as possible.
Training your model
After you label your images, you’re ready to train your model.
- Choose Train Model.
- For Choose project, choose your
PizzaDetection
project. - For Choose training dataset, choose your
PizzaImages
dataset.
As part of the training, Amazon Rekognition Custom Labels requires a labeled test dataset. You use the text dataset to verify how well the trained model predicts the correct labels and generate evaluation metrics. You don’t use the images in the test dataset to train your model; they should represent the types of images you want your model to analyze.
- For Create test set, choose how you want to provide your test dataset.
Amazon Rekognition Custom Labels provides three options:
- Choose an existing test dataset
- Create a new test dataset
- Split training dataset
For this post, we select Split training dataset and let Amazon Rekognition hold back 20% of the images for testing and use the remaining 80% of the images to train the model.
Our model took approximately 1 hour to train. The training time required for your model depends on many factors, including the number of images provided in the dataset and the complexity of the model.
When training is complete, Amazon Rekognition Custom Labels outputs key metrics with every training, including F1 score, precision, recall, and the assumed threshold for each label. For more information about metrics, see Metrics for Evaluating Your Model.
Looking at our evaluation results, our model has a precision of 1.0, which means that no objects were mistakenly identified as pizza (false positives) in our test set. Our model did miss some pizzas in our test set (false negatives), which is reflected in our recall score of 0.81. You can often use the F1 score as an overall quality score because it takes both precision and recall into account. Finally, we see that our assumed threshold to generate the F1 score, precision, and recall metrics for Pizza is 0.61. By default, our model returns predictions above this assumed threshold. We can increase the recall for this model if we lower the confidence threshold. However, this would most likely cause a drop in precision.
We can also choose View Test Results to see each test image and how our model performed. The following screenshot shows an example of a correctly identified image of pizza during the model testing (true positive).
Testing your model
Your custom pizza detection model is now ready for use. Amazon Rekogntion Custom Labels provides the API calls for starting and using the model; you don’t need to deploy, provision, or manage any infrastructure. The following screenshot shows the API calls for using the model.
By using the API, we tried our model on a new test set of images from pexels.com.
For example, the following image shows a pizza on a table with other objects.
The model detects the pizza with a confidence of 91.72% and a correct bounding box. The following code is the JSON response received by the API call:
{
"CustomLabels": [
{
"Name": "Pizza",
"Confidence": 91.7249984741211,
"Geometry": {
"BoundingBox": {
"Width": 0.7824199795722961,
"Height": 0.3644999861717224,
"Left": 0.11868999898433685,
"Top": 0.37672001123428345
}
}
}
]
}
The following image has a confidence score of 98.40.
The following image has a confidence score of 96.51.
The following image has an empty JSON result, as expected, because the image doesn’t contain pizza.
The following image also has an empty JSON result.
In addition to using the API, you can also use the Custom Labels Demonstration. This AWS CloudFormation template enables you to set up a custom, password-protected UI where you can start and stop your models and run demonstration inferences.
Conclusion
In this post, we showed you how to create a single class object detection model with Amazon Rekognition Custom Labels. This feature makes it easy to train a custom model that can detect an object class without needing to specify other objects or losing accuracy in its results.
For more information about using custom labels, see What Is Amazon Rekognition Custom Labels?
About the Author
Woody Borraccino is a Senior AI Solutions Architect at AWS.
Amazon’s Machine Learning University is making its online courses available to the public
Classes previously only available to Amazon employees will now be available to the community.Read More
Increasing the relevance of your Amazon Personalize recommendations by leveraging contextual information
Getting relevant recommendations in front of your users at the right time is a crucial step for the success of your personalization strategy. However, your customer’s decision-making process shifts depending on the context at the time when they’re interacting with your recommendations. In this post, I show you how to set up and query a context-aware Amazon Personalize deployment.
Amazon Personalize allows you to easily add sophisticated personalization capabilities to your applications by using the same machine learning (ML) technology used on Amazon.com for over 20 years. No ML experience is required. Amazon Personalize supports the automatic adjustment of recommendations based on contextual information about your user, such as device type, location, time of day, or other information you provide.
The Harvard study How Context Affects Choice defines context as factors that can influence the choice outcome by altering the process by which a decision is made. As a business owner, you can identify this context by analyzing how your customers shop differently when accessing your catalog from a phone vs. a computer, or seeing the shift in your customer’s content consumption on rainy vs. sunny days.
Leveraging your user’s context allows you to provide a more personalized experience for existing users and helps decrease the cold-start phase for new or unidentified users. The cold-start phase refers to the period when your recommendation engine provides non-personalized recommendations due to the lack of historical information regarding that user.
Adding context to Amazon Personalize
You can set up and use context in Amazon Personalize in four simple steps:
- Include your user’s context in the historical user-item interactions dataset.
- Train a context aware solution with a User Personalization or Personalized Ranking recipe. A recipe refers to the algorithm your recommender is trained on using the behavioral data specified in your interactions dataset plus any user or items metadata.
- Specify the user’s context when querying for real-time recommendations using the GetRecommendations or GetPersonalizedRanking
- Include your user’s context when recording events using the event tracker.
The following diagram illustrates the architecture of these steps.
You want to be explicit about the context to consider when constructing datasets. A common example of context customers actively use is device type, such as a phone, tablet, or desktop. The study The Effect of Device Type on Buying Behavior in Ecommerce: An Exploratory Study from the University of Twente in the Netherlands has proven that device type has an influence on buying behavior and people might postpone a buying decision if they’re online with the wrong device type. Embedding device type context in your datasets allows Amazon Personalize to learn this pattern and, at inference time, recommend the most appropriate content with awareness of the user’s context.
Recommendations use case
For this use case, a travel enthusiast is our potential customer. They look at a few things when deciding which airline to travel with to their given destination. For example, is it a short or a long flight? Will the trip be booked with cash or with miles? Are they traveling alone? Where are they be departing and returning to? After they answer these initial questions, the next big decision is picking the cabin type to fly in. If our travel enthusiast is flying in a high-end cabin type, we can assume they’re looking at which airline provides the best experience possible. Now that we have a good idea on what our user is looking for, it’s shopping time!
Consider some of the variables that go into the decision-making process of this use case. We can’t control many of these factors, but we can use some to tailor our recommendations. First, identify common denominators that might affect a user’s behavior. In this case, flight duration and cabin type are good candidates to use as context, and traveler type and traveler residence are good candidates for user metadata when building our recommendation datasets. Metadata is information you know about your users and items that stays somewhat constant over a period of time, whereas context is environmental information that can shift rapidly across time, influencing your customer’s perception and behavior.
Selecting the most relevant metadata fields in your training datasets and enriching your interactions datasets with context is important for generating relevant user recommendations. In this post, we build an Amazon Personalize deployment that returns a list of airline recommendations for a customer. We add cabin type as the context and traveler residence as the metadata field and observe how recommendations shift based on context and metadata.
Prerequisites
We first need to set up the following Amazon Personalize resources. For full instructions, see Getting Started (Console) to complete the following steps:
- Create a dataset group. In this post, we name it
airlines-blog-example
. - Create an
Interactions
dataset using the following schema and import data using the interactions_dataset.csv file:{ "type": "record", "name": "Interactions", "namespace": "com.amazonaws.personalize.schema", "fields": [ { "name": "ITEM_ID", "type": "string" }, { "name": "USER_ID", "type": "string" }, { "name": "TIMESTAMP", "type": "long" }, { "name":"CABIN_TYPE", "type": "string", "categorical": true }, { "name": "EVENT_TYPE", "type": "string" }, { "name": "EVENT_VALUE", "type": "float" } ], "version": "1.0" }
- Create a
Users
dataset using the following schema and import data using the users_dataset.csv file:{ "type": "record", "name": "Users", "namespace": "com.amazonaws.personalize.schema", "fields": [ { "name": "USER_ID", "type": "string" }, { "name": "USER_RESIDENCE", "type": "string", "categorical": true } ], "version": "1.0" }
- Create a solution. In this post, we use the default solution configurations, except for the following:
- Recipe –
aws-hrnn-metadata
- Event type – RATING
- Perform HPO – True
- Recipe –
Hyperparameter optimization (HPO) is recommended if you want Amazon Personalize to run parallel trainings and experiments to identify the most performant hyperparameters. For more information, see Hyperparameters and HPO.
- Create a campaign.
You can set up the preceding resources on the Amazon Personalize console or by following the Jupyter notebook personalize_hrnn_metadata_contextual_example.ipynb example on the GitHub repo.
Exploring your Amazon Personalize resources
We have now created several Amazon Personalize resources, including a dataset group called airlines-blog-example
. The dataset group contains two datasets: interactions
and users
, which contain the data used to train your Amazon Personalize model (also known as a solution). We also created a campaign to provide real-time recommendations.
We can now explore how the interactions
and users
dataset schemas help our model learn from the context and metadata embedded in the datasets.
Interactions dataset
We provide Amazon Personalize an interactions dataset with a numeric rating (combination of EVENT_TYPE + EVENT_VALUE
) that a user (USER_ID
) has given an airline (ITEM_ID
) when flying in a certain cabin type (CABIN_TYPE
) at a given time (TIMESTAMP
). By providing this information to Amazon Personalize in the dataset and schema, we can add CABIN_TYPE
as the context when querying the recommendations for a user and recording new interactions through the event tracker. At training time, the model automatically identifies important features from this data (for our use case, the highest rated airlines across cabin types).
The following screenshot showcases a small portion of the interactions_dataset.csv
file.
User dataset
We also provide Amazon Personalize a user dataset with the users (USER_ID
) who provided the ratings in the interactions dataset, assuming that they gave the rating from their country of residence (USER_RESIDENCE
). In this use case, USER_RESIDENCE
is the metadata we picked for these users. By providing USER_RESIDENCE
as user metadata, the model can learn which airlines are interacted with the most by users across countries and regions, so when we query for recommendations, it takes USER_RESIDENCE
in consideration. For example, users in Asia see different airline options compared to users in South America or Europe.
The following screenshot shows a small portion of the user_dataset.csv
file.
The raw dataset of user airlines ratings from Skytrax contains 20 columns with over 40,000 records. In this post, we use a modified version of this dataset and split the most relevant columns of the raw dataset into two datasets (users
and interactions
). For more information about splitting the data in a Jupyter notebook, see personalize_hrnn_metadata_contextual_example.ipynb on the GitHub repo.
The next section shows how context and metadata influence the real-time recommendations provided by your Amazon Personalize campaign.
Applying context to your Amazon Personalize real-time recommendations queries
During this test, we observe the effect that context has on the recommendations provided to users. In our use case, we have an interactions dataset of numerical airline ratings from multiple users. In our schemas, the cabin type is included as a categorical value for the interactions
dataset and the user residence as a metadata field in the users
dataset. Our theory is that by adding the cabin type as context, the airline recommendations will shift to account for it.
- On your Amazon Personalize dataset group dashboard, choose View campaigns.
- Choose your newly created campaign.
- For User ID, enter
JDowns
. - Choose Get recommendations.
You should see a Test campaign results page similar to the following screenshot.
We initially queried a list of airlines for our user without any context. We now focus on the top 10 recommendations and verify that they shift based on the context. We can add the context via the console by providing a key and value pair. In our use case, the key is CABIN_TYPE
and the value can be one of the following:
- Economy
- Premium Economy
- Business Class
- First Class
The following two screenshots show our results for querying recommendations for the same user with Economy
and First Class
as values for the CABIN_TYPE
context. The economy context doesn’t shift the top 10 list, but the first class context does have an effect—bumping Alaska Airlines to first place on the list.
You can explore your users_dataset.csv
file for additional users to test your recommendations API, and a very similar shift of recommendations based on the context you include in the API call. You can also find that the airlines list shifts based on the User Residency
metadata field. For example, the following screenshots show the top 10 recommendations for our JDowns
user, who has United States
as the value for User Residency
, compared to the PhillipHarris
user, who has France
as the value for User Residency
.
Conclusion
As shown in this post, adding context to your recommendation strategy is a very powerful and easy-to-implement exercise when using Amazon Personalize. The benefits of enriching your recommendations with context can result in an increase in your user engagement, which eventually leads to an increase in the revenue influenced by your recommendations.
This post showed you how to create an Amazon Personalize context-aware deployment and an end-to-end test of getting real-time recommendations applying context via the Amazon Personalize console. For instructions on using a Jupyter environment to set up the Amazon Personalize infrastructure and get recommendations using the Boto3 Python SDK, see personalize_hrnn_metadata_contextual_example.ipynb on the GitHub repo.
There’s even more that you can do with Amazon Personalize. For more information about core use cases and automation examples, see the GitHub repo.
If this post helps you or inspires you to solve a problem, share your thoughts and questions in the comments.
About the Author
Luis Lopez Soria is an AI/ML specialist solutions architect working with the AWS machine learning team. He works with AWS customers to help them adopt machine learning on a large scale. He enjoys playing sports, traveling around the world, and exploring new foods and cultures.
Amazon Forecast can now use Convolutional Neural Networks (CNNs) to train forecasting models up to 2X faster with up to 30% higher accuracy
We’re excited to announce that Amazon Forecast can now use Convolutional Neural Networks (CNNs) to train forecasting models up to 2X faster with up to 30% higher accuracy. CNN algorithms are a class of neural network-based machine learning (ML) algorithms that play a vital role in Amazon.com’s demand forecasting system and enable Amazon.com to predict demand for over 400 million products every day. For more information about Amazon.com’s journey building demand forecasting technology using CNN models, watch the re:MARS 2019 keynote video. Forecast brings the same technology used at Amazon.com into the hands of everyday developers as a fully managed service. Anyone can start using Forecast, without any prior ML experience, by using the Forecast console or the API.
Forecasting is the science of predicting the future. By examining historical trends, businesses can make a call on what might happen and when, and build that into their future plans for everything from product demand to inventory to staffing. Given the consequences of forecasting, accuracy matters. If a forecast is too high, businesses over-invest in products and staff, which ends up as wasted investment. If the forecast is too low, they under-invest, which leads to a shortfall in inventory and a poor customer experience. Today, businesses try to use everything from simple spreadsheets to complex financial planning software to generate forecasts, but high accuracy remains elusive for two reasons:
- Traditional forecasts struggle to incorporate very large volumes of historical data, missing out on important signals from the past that are lost in the noise.
- Traditional forecasts rarely incorporate related but independent data, which can offer important context (such as sales, holidays, locations, and marketing promotions). Without the full history and the broader context, most forecasts fail to predict the future accurately.
At Amazon, we have learned over the years that no one algorithm delivers the most accurate forecast for all types of data. Traditional statistical models have been useful in predicting demand for products that have regular demand patterns, such as sunscreen lotions in the summer and woolen clothes in the winter. However, statistical models can’t deliver accurate forecasts for more complex scenarios, such as frequent price changes, differences between regional versus national demand, products with different selling velocities, and the addition of new products. Sophisticated deep learning models can provide higher accuracy in these use cases. Forecast automatically examines your data and selects the best algorithm across a set of statistical and deep learning algorithms to train the more accurate forecasting model for your data. With the addition of the CNN-based deep learning algorithm, Forecast can now further improve accuracy by up to 30% and train models up to 2X faster compared to the currently supported algorithms. This new algorithm can more accurately detect leading indicators of demand, such as pre-order information, product page visits, price changes, and promotional spikes, to build more accurate forecasts.
More Retail, a market leader in the fresh food and grocery category in India, participated in a beta test of the new CNN algorithm, with the help of Ganit, an analytics partner. Supratim Banerjee, Chief Transformation Officer at More Retail Limited, says, “At More, we rapidly innovate to sustain our business and beat competition. We have been looking for opportunities to reduce wastage due to over stocking, while continuing to meet customer demand. In our experiments for the fresh produce category, we found the new CNN algorithm in Amazon Forecast to be 1.7X more accurate compared to our existing forecasting system. This translates into massive cost savings for our business.”
Training a CNN predictor and creating forecasts
You can start using CNNs in Forecast through the CreatePredictor API or on the Forecast console. In this section, we walk through a series of steps required to train a CNN predictor and create forecasts within Forecast.
- On the Forecast console, create a dataset group.
- Upload your dataset.
- Choose Predictors from the navigation pane.
- Choose Train predictor.
- For Algorithm selection, select Manual.
- For Algorithm, choose CNN-QR.
To manually select CNN-QR through the CreatePredictor API, use arn:aws:forecast:::algorithm/CNN-QR for the AlgorithmArn.
When you choose CNN-QR from the drop-down menu, the Advanced Configuration section auto-expands.
- To let Forecast train the most optimized and accurate CNN model for your data, select Perform hyperparameter optimization (HPO).
- After you enter all your details on the Predictors page, choose Train predictor.
After your predictor is trained, you can view its details by choosing your predictor on the Predictors page. On the predictor’s details page, you can view the accuracy metrics and optimized hyperparameters for your model.
- Now that your model is trained, choose Forecasts from the navigation name.
- Choose Create a forecast.
- Create a forecast using your trained predictor.
You can generate forecasts at any quantile to balance your under-forecasting and over-forecasting costs.
Choosing the most accurate model with Forecast
With this launch, Forecast now supports one proprietary CNN model, one proprietary RNN model, and four other statistical models: Prophet, NPTS (Amazon proprietary), ARIMA, and ETS. The new CNN model is part of AutoML. We recommend always starting your experimentation with AutoML, in which Forecast finds the most optimized and accurate model for your dataset.
- On the Train predictor page, for Algorithm selection, select Automatic (AutoML).
- After your predictor is trained using AutoML, choose the predictor to see more details on the chosen algorithm.
- On the predictor’s details page, in the Algorithm metrics section, choose different algorithms from the drop-down menu to view their accuracy for comparison.
Tips and best practices
As you begin to experiment with CNNs and build your demand planning solutions on top of Forecast, consider the following tips and best practices:
- For experimentation, start by identifying the most important item IDs for your business that you are looking to improve your forecasting accuracy. Measure the accuracy of your existing forecasting methodology as a baseline.
- Use Forecast with only your target time series and assess the wQuantileLoss accuracy metric. We recommend selecting AutoML in Forecast to find the most optimized and accurate model for your data. For more information, see Evaluating Predictor Accuracy.
- AutoML optimizes for accuracy and not training time, so AutoML may take longer to optimize your model. If training time is a concern for you, we recommend manually selecting CNN-QR and assessing its accuracy and training time. A slight degradation in accuracy may be an acceptable trade-off for considerable gains in training time.
- After you see an increase in accuracy over your baseline, we recommend experimenting to find the right forecasting quantile that balances your under-forecasting and over-forecasting costs to your business.
- We recommend deploying your model as a continuous workload within your systems to start reaping the benefits of more accurate forecasts. You can continue to experiment by adding related time series and item metadata to further improve the accuracy.
- Incrementally add related time series or item metadata to train your model to assess whether additional information improves accuracy. Different combinations of related time series and item metadata can give you different results.
Conclusion
The new CNN algorithm is available in all Regions where Forecast is publicly available. For more information about Region availability, see Region Table. For more information about the CNN algorithm, see CNN-QR algorithm documentation.
About the authors
Namita Das is a Sr. Product Manager for Amazon Forecast. Her current focus is to democratize machine learning by building no-code/low-code ML services. She frequently advises startups and has started dabbling in baking.
Danielle Robinson is an Applied Scientist on the Amazon Forecast team. Her research is in time series forecasting and in particular how we can apply new neural network-based algorithms within Amazon Forecast. Her thesis research was focused on developing new, robust, and physically accurate numerical models for computational fluid dynamics. Her hobbies include cooking, swimming, and hiking.
Aaron Spieler is a working student in the Amazon Forecast team. He is starting his masters degree at the University of Tuebingen, and studied Data Engineering at Hasso Plattner Institute after obtaining a BS in Computer Science from University of Potsdam. His research interests span time series forecasting (especially using neural network models), machine learning, and computational neuroscience.
Gunjan Garg: Gunjan Garg is a Sr. Software Development Engineer in the AWS Vertical AI team. In her current role at Amazon Forecast, she focuses on engineering problems and enjoys building scalable systems that provide the most value to end-users. In her free time, she enjoys playing Sudoku and Minesweeper.
Chinmay Bapat is a Software Development Engineer in the Amazon Forecast team. His interests lie in the applications of machine learning and building scalable distributed systems. Outside of work, he enjoys playing board games and cooking.
Securing Amazon Comprehend API calls with AWS PrivateLink
Amazon Comprehend now supports Amazon Virtual Private Cloud (Amazon VPC) endpoints via AWS PrivateLink so you can securely initiate API calls to Amazon Comprehend from within your VPC and avoid using the public internet.
Amazon Comprehend is a fully managed natural language processing (NLP) service that uses machine learning (ML) to find meaning and insights in text. You can use Amazon Comprehend to analyze text documents and identify insights such as sentiment, people, brands, places, and topics in text. No ML expertise required.
Using AWS PrivateLink, you can access Amazon Comprehend easily and securely by keeping your network traffic within the AWS network, while significantly simplifying your internal network architecture. It enables you to privately access Amazon Comprehend APIs from your VPC in a scalable manner by using interface VPC endpoints. A VPC endpoint is an elastic network interface in your subnet with a private IP address that serves as the entry point for all Amazon Comprehend API calls.
In this post, we show you how to set up a VPC endpoint and enforce the use of this private connectivity for all requests to Amazon Comprehend using AWS Identity and Access Management (IAM) policies.
Prerequisites
For this example, you should have an AWS account and sufficient access to create resources in the following services:
- Amazon Comprehend
- AWS IAM
- AWS Lambda
- AWS PrivateLink
- Amazon Simple Storage Service (Amazon S3)
Solution overview
The walkthrough includes the following high-level steps:
- Deploy your resources.
- Create VPC endpoints.
- Enforce private connectivity with IAM.
- Use Amazon Comprehend via AWS PrivateLink.
Deploying your resources
For your convenience, we have supplied an AWS CloudFormation template to automate the creation of all prerequisite AWS resources. We use the us-east-2
Region in this post, so the console and URLs may differ depending on the Region you select. To use this template, complete the following steps:
- Choose Launch Stack:
- Confirm the following parameters, which you can leave at the default values:
- SubnetCidrBlock1 – The primary IPv4 CIDR block assigned to the first subnet. The default value is 10.0.1.0/24.
- SubnetCidrBlock2 – The primary IPv4 CIDR block assigned to the second subnet. The default value is 10.0.2.0/24.
- Acknowledge that AWS CloudFormation may create additional IAM resources.
- Choose Create stack.
The creation process should take roughly 10 minutes to complete.
The CloudFormation template creates the following resources on your behalf:
- A VPC with two private subnets in separate Availability Zones
- VPC endpoints for private Amazon S3 and Amazon Comprehend API access
- IAM roles for use by Lambda and Amazon Comprehend
- An IAM policy to enforce the use of VPC endpoints to interact with Amazon Comprehend
- An IAM policy for Amazon Comprehend to access data in Amazon S3
- An S3 bucket for storing open-source data
The next two sections detail how to manually create a VPC endpoint for Amazon Comprehend and enforce usage with an IAM policy. If you deployed the CloudFormation template and prefer to skip to testing the API calls, you can advance to the Using Amazon Comprehend via AWS PrivateLink section.
Creating VPC endpoints
To create a VPC endpoint, complete the following steps:
- On the Amazon VPC console, choose Endpoints.
- Choose Create Endpoint.
- For Service category, select AWS services.
- For Service Name, choose amazonaws.us-east-2.comprehend.
- For VPC, enter the VPC you want to use.
- For Availability Zone, select your preferred Availability Zones.
- For Enable DNS name, select Enable for this endpoint.
This creates a private hosted zone that enables you to access the resources in your VPC using custom DNS domain names, such as example.com
, instead of using private IPv4 addresses or private DNS hostnames provided by AWS. The Amazon Comprehend DNS hostname that the AWS Command Line Interface (CLI) and Amazon Comprehend SDKs use by default (https://comprehend.Region.amazonaws.com
) resolves to your VPC endpoint.
- For Security group, choose the security group to associate with the endpoint network interface.
If you don’t specify a security group, the default security group for your VPC is associated.
- Choose Create Endpoint.
When the Status changes to available
, your VPC endpoint is ready for use.
- Choose the Policy tab to apply more restrictive access control to the VPC endpoint.
The following example policy limits VPC endpoint access to an IAM role used by a Lambda function in our deployment. You should apply the principle of least privilege when defining your own policy. For more information, see Controlling access to services with VPC endpoints.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"comprehend:DetectEntities",
"comprehend:CreateDocumentClassifier"
],
"Resource": [
"*"
],
"Effect": "Allow",
"Principal": {
"AWS": [
"arn:aws:iam::#########:role/ComprehendPrivateLink-LambdaExecutionRole"
]
}
}
]
}
Enforcing private connectivity with IAM
To allow or deny access to Amazon Comprehend based on the use of a VPC endpoint, we include an aws:sourceVpce condition in the IAM policy. The following example policy provides access specifically to the DetectEntities and CreateDocumentClassifier APIs only when the request utilizes your VPC endpoint. You can include additional Amazon Comprehend APIs in the “Action” section of the policy or use “comprehend:*” to include them all. You can attach this policy to an IAM role to enable compute resources hosted within your VPC to interact with Amazon Comprehend.
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ComprehendEnforceVpce",
"Effect": "Allow",
"Action": [
"comprehend:CreateDocumentClassifier",
"comprehend:DetectEntities"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:SourceVpce": "vpce-xxxxxxxx"
}
}
},
{
"Sid": "PassRole",
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::#########:role/ComprehendDataAccessRole"
}
]
}
You should replace the VPC endpoint ID with the endpoint ID you created earlier. Permission to invoke the PassRole API is required for asynchronous operations in Amazon Comprehend like CreateDocumentClassifer
and should be scoped to your specific data access role.
Using Amazon Comprehend via AWS PrivateLink
To start using Amazon Comprehend with AWS PrivateLink, you perform the following high-level steps:
- Review the Lambda function for API testing.
- Create the
DetectEntities
test event. - Train a custom classifier.
Reviewing the Lambda function
To review your Lambda function, on the Lambda console, choose the Lambda function that contains ComprehendPrivateLink
in its name.
The VPC section of the Lambda console provides links to the various networking components automatically created for you during the CloudFormation deployment.
The function code includes a sample program that takes user input to invoke the specific Amazon Comprehend APIs supported by our example IAM policy.
Creating a test event
In this section, we create an event to detect entities within sample text using a pretrained model.
- From the Test drop-down menu, choose Create new test event.
- For Event name, enter a name (for example, DetectEntities).
- Replace the event JSON with the following code:
{ "comprehend_api": "DetectEntities", "language_code": "en", "text": "Amazon.com, Inc. is located in Seattle, WA and was founded July 5th, 1994 by Jeff Bezos, allowing customers to buy everything from books to blenders." }
- Choose Save to store the test event.
- Choose Save to update the Lambda function.
- Choose Test to invoke the
DetectEntities
API.
The response should include results similar to the following code:
{
"Entities": [
{
"Score": 0.9266431927680969,
"Type": "ORGANIZATION",
"Text": "Amazon.com, Inc.",
"BeginOffset": 0,
"EndOffset": 16
},
{
"Score": 0.9952651262283325,
"Type": "LOCATION",
"Text": "Seattle, WA",
"BeginOffset": 31,
"EndOffset": 42
},
{
"Score": 0.9998188018798828,
"Type": "DATE",
"Text": "July 5th, 1994",
"BeginOffset": 59,
"EndOffset": 73
},
{
"Score": 0.9999810457229614,
"Type": "PERSON",
"Text": "Jeff Bezos",
"BeginOffset": 77,
"EndOffset": 87
}
]
}
You can update the test event to identify entities from your own text.
Training a custom classifier
We now demonstrate how to build a custom classifier. For training data, we use a version of the Yahoo answers corpus that is preprocessed into the format expected by Amazon Comprehend. This corpus, available on the AWS Open Data Registry, is cited in the paper Text Understanding from Scratch by Xiang Zhang and Yann LeCun. It is also used in the post Building a custom classifier using Amazon Comprehend.
- Retrieve the training data from Amazon S3.
- On the Amazon S3 console, choose the example S3 bucket created for you.
- Choose Upload and add the file you retrieved.
- Choose the uploaded object and note the Key.
- Return to the test function on the Lambda console.
- From the Test drop-down menu, choose Create new test event.
- For Event name, enter a name (for example, TrainCustomClassifier).
- Replace the event input with the following code:
{ "comprehend_api": "CreateDocumentClassifier", "custom_classifier_name": "custom-classifier-example", "language_code": "en", "training_data_s3_key": "comprehend-train.csv" }
- If you changed the default file name, update the
training_data_s3_key
to match. - Choose Save to store the test event.
- Choose Save to update the Lambda function.
- Choose Test to invoke the
CreateDocumentClassifier
API.
The response should include results similar to the following code:
{
"DocumentClassifierArn": "arn:aws:comprehend:us-east-2:0123456789:document-classifier/custom-classifier-example"
}
- On the Amazon Comprehend console, choose Custom classification to check the status of the document classifier training.
After approximately 20 minutes, the document classifier is trained and available for use.
Cleaning Up
To avoid incurring future charges, delete the resources you created during this walkthrough after concluding your testing.
- On the Amazon Comprehend console, delete the custom classifier.
- On the Amazon S3 console, empty the bucket created for you.
- If you launched the automated deployment, on the AWS CloudFormation console, delete the appropriate stack.
The deletion process takes approximately 10 minutes.
Conclusion
You have now successfully invoked Amazon Comprehend APIs using AWS PrivateLink. The use of IAM policies prevents requests from leaving your VPC and further improves your security posture. You can extend this solution to securely test additional features like Amazon Comprehend custom entity recognition real-time endpoints.
All Amazon Comprehend API calls are now supported via AWS PrivateLink. This feature exists in all commercial Regions where AWS PrivateLink and Amazon Comprehend are available. To learn more about securing Amazon Comprehend, see Security in Amazon Comprehend.
About the Authors
Dave Williams is a Cloud Consultant for AWS Professional Services. He works with public sector customers to securely adopt AI/ML services. In his free time, he enjoys spending time with his family, traveling, and watching college football.
Adarsha Subick is a Cloud Consultant for AWS Professional Services based out of Virginia. He works with public sector customers to help solve their AI/ML-focused business problems. In his free time, he enjoys archery and hobby electronics.
Saman Zarandioon is a Sr. Software Development Engineer for Amazon Comprehend. He earned a PhD in Computer Science from Rutgers University.
How Marinus Analytics uses knowledge graphs powered by Amazon Neptune to combat human trafficking
Traffic Jam leverages machine learning technologies from Amazon Web Services to find patterns in ads posted by sexual traffickers on the internet every day.Read More
Machine learning best practices in financial services
We recently published a new whitepaper, Machine Learning Best Practices in Financial Services, that outlines security and model governance considerations for financial institutions building machine learning (ML) workflows. The whitepaper discusses common security and compliance considerations and aims to accompany a hands-on demo and workshop that walks you through an end-to-end example. Although the whitepaper focuses on financial services considerations, much of the information around authentication and access management, data and model security, and ML operationalization (MLOps) best practices may be applicable to other regulated industries, such as healthcare.
A typical ML workflow, as shown in the following diagram, involves multiple stakeholders. To successfully govern and operationalize this workflow, you should collaborate across multiple teams, including business stakeholders, sysops administrators, data engineers, and software and devops engineers.
In the whitepaper, we discuss considerations for each team and also provide examples and illustrations of how you can use Amazon SageMaker and other AWS services to build, train, and deploy ML workloads. More specifically, based on feedback from customers running workloads in regulated environments, we cover the following topics:
- Provisioning a secure ML environment – This includes the following:
- Compute and network isolation – How to deploy Amazon SageMaker in a customer’s private network, with no internet connectivity.
- Authentication and authorization – How to authenticate users in a controlled fashion and authorize these users based on their AWS Identity and Access Management (IAM) permissions, with no multi-tenancy.
- Data protection – How to encrypt data in transit and at rest with customer-provided encryption keys.
- Auditability – How to audit, prevent, and detect who did what at any given point in time to help identify and protect against malicious activities.
- Establishing ML governance – This includes the following:
- Traceability – Methods to trace ML model lineage from data preparation, model development, and training iterations, and how to audit who did what at any given point in time.
- Explainability and interpretability –Methods that may help explain and interpret the trained model and obtain feature importance.
- Model monitoring – How to monitor your model in production to protect against data drift, and automatically react to rules that you define.
- Reproducibility – How to reproduce the ML model based on model lineage and the stored artifacts.
- Operationalizing ML workloads – This includes the following:
- Model development workload – How to build automated and manual review processes in the dev environment.
- Preproduction workload – How to build automated CI/CD pipelines using the AWS CodeStar suite and AWS Step Functions.
- Production and continuous monitoring workload – How to combine continuous deployment and automated model monitoring.
- Tracking and alerting – How to track model metrics (operational and statistical) and alert appropriate users if anomalies are detected.
Provisioning a secure ML environment
A well-governed and secure ML workflow begins with establishing a private and isolated compute and network environment. This may be especially important in regulated industries, particularly when dealing with PII data for model building or training. The Amazon Virtual Private Cloud (VPC) that hosts Amazon SageMaker and its associated components, such as Jupyter notebooks, training instances, and hosting instances, should be deployed in a private network with no internet connectivity.
Furthermore, you can associate these Amazon SageMaker resources with your VPC environment, which allows you to apply network-level controls, such as security groups to govern access to Amazon SageMaker resources and control ingress and egress of data into and out of the environment. You can establish connectivity between Amazon SageMaker and other AWS services, such as Amazon Simple Storage Service (Amazon S3), using VPC endpoints or AWS PrivateLink. The following diagram illustrates a suggested reference architecture of a secure Amazon SageMaker deployment.
The next step is to ensure that only authorized users can access the appropriate AWS services. IAM can help you create preventive controls for many aspects of your ML environment, including access to Amazon SageMaker resources, your data in Amazon S3, and API endpoints. You can access AWS services using a RESTful API, and every API call is authorized by IAM. You grant explicit permissions through IAM policy documents, which specify the principal (who), the actions (API calls), and the resources (such as Amazon S3 objects) that are allowed, as well as the conditions under which the access is granted. For a deeper dive into building secure environments for financial services as well as other well-architected pillars, also refer to this whitepaper.
In addition, as ML environments may contain sensitive data and intellectual property, the third consideration for a secure ML environment is data encryption. We recommend that you enable data encryption both at rest and in transit with your own encryption keys. And lastly, another consideration for a well-governed and secure ML environment is a robust and transparent audit trail that logs all access and changes to the data and models, such as a change in the model configuration or the hyperparameters. More details on all those fronts are included in the whitepaper.
To enable self-service provisioning and automation, administrators can use tools such as the AWS Service Catalog to create these secure environments in a repeatable manner for their data scientists. This way, data scientists can simply log in to a secure portal using AWS Single Sign-On, and create a secure Jupyter notebook environment provisioned for their use with appropriate security guardrails in place.
Establishing ML governance
In this section of the whitepaper, we discuss the considerations around ML governance, which includes four key aspects: traceability, explainability, real-time model monitoring, and reproducibility. The financial services industry has various compliance and regulatory obligations that may touch on these aspects of ML governance. You should review and understand those obligations with your legal counsel, compliance personnel, and other stakeholders involved in the ML process.
As an example, if Jane Smith is denied a bank loan, the lender may be required to explain how that decision was made in order to comply with regulatory requirements. If the financial services industry customer is using ML as part of the loan review process, the prediction made by the ML model may need to be interpreted or explained in order to meet these requirements. Generally, an ML model’s interpretability or explainability refers to people’s ability to understand and explain the processes that the model uses to arrive at its predictions. It is also important to note that many ML models make predictions of a likely answer, rather than the answer itself. Therefore, it may be appropriate to have human review of predictions made by ML models before any action is taken. The model may also need to be monitored, so that if the underlying data changes, the model is periodically retrained on new data. Finally, the ML model may need to be reproducible, such that if the steps leading to the model’s output are retraced, the model outputs don’t change.
Operationalizing ML workloads
In the final section, we discuss some best practices around operationalizing ML workloads. We begin with a high-level discussion and then dive deeper into a specific architecture that uses AWS native tools and services. In addition to the process of deploying models, or what in traditional software deployments is referred to as CI/CD (continuous integration/deployment), deploying ML models into production for regulated industries may have additional implications from an implementation perspective.
The following diagram captures some of the high-level requirements that an enterprise ML platform might have to address guidelines around governance, auditing, logging, and reporting:
- A data lake for managing raw data and associated metadata
- A feature store for managing ML features and associated metadata (mapping from raw data to generated features such as one-hot encodings or scaling transformations)
- A model and container registry containing trained model artifacts and associated metadata (such as hyperparameters, training times, and dependencies)
- A code repository (such as Git, AWS CodeCommit, or Artifactory) for maintaining and versioning source code
- A pipeline registry to version and maintain training and deployment pipelines
- Logging tools for maintaining access logs
- Production monitoring and performance logs
- Tools for auditing and reporting
The following diagram illustrates a specific implementation that uses AWS native tools and services. Although several scheduling and orchestration tools are on the market, such as Airflow or Jenkins, for concreteness, we focus predominantly on Step Functions.
In the whitepaper, we dive deeper into each part of the preceding diagram, and more specifically into the following workloads:
- Model development
- Pre-production
- Production and continuous monitoring
Summary
The Machine Learning Best Practices in Financial Services whitepaper is available here. Start using it today to help illustrate how you can build secure and well-governed ML workflows and feel free to reach out to the authors if you have any questions. As you progress on your journey, also refer to this whitepaper for a lens on the AWS well-architected principles applied to machine learning workloads. You can also use the video demo walkthrough, and the following two workshops:
- Build a secure ML environment with SageMaker
- Provision a secure ML environment with SageMaker and run an end-to-end example
About the authors
Stefan Natu is a Sr. Machine Learning Specialist at Amazon Web Services. He is focused on helping financial services customers build and operationalize end-to-end machine learning solutions on AWS. His academic background is in theoretical physics, and in the past, he worked on a number of data science problems in retail and energy verticals. In his spare time, he enjoys reading machine learning blogs, traveling, playing the guitar, and exploring the food scene in New York City.
Kosti Vasilakakis is a Sr. Business Development Manager for Amazon SageMaker, the AWS fully managed service for end-to-end machine learning, and he focuses on helping financial services and technology companies achieve more with ML. He spearheads curated workshops, hands-on guidance sessions, and pre-packaged open-source solutions to ensure that customers build better ML models quicker and safer. Outside of work, he enjoys traveling the world, philosophizing, and playing tennis.
Alvin Huang is a Capital Markets Specialist for Worldwide Financial Services Business Development at Amazon Web Services with a focus on data lakes and analytics, and artificial intelligence and machine learning. Alvin has over 19 years of experience in the financial services industry, and prior to joining AWS, he was an Executive Director at J.P. Morgan Chase & Co, where he managed the North America and Latin America trade surveillance teams and led the development of global trade surveillance. Alvin also teaches a Quantitative Risk Management course at Rutgers University and serves on the Rutgers Mathematical Finance Master’s program (MSMF) Advisory Board.
David Ping is a Principal Machine Learning Architect and Sr. Manager of AI/ML Solutions Architecture at Amazon Web Services. He helps enterprise customers build and operate machine learning solutions on AWS. David enjoys hiking and following the latest machine learning advancement.