University of Minnesota professor and Amazon Scholar, together with coauthor, receives recognition for paper that proposes novel approach to algorithm that generates high-quality recommendations for e-commerce products at high speeds.Read More
Incremental learning: Optimizing search relevance at scale using machine learning
Amazon Kendra is releasing incremental learning to automatically improve search relevance and make sure you can continuously find the information you’re looking for, particularly when search patterns and document trends change over time.
Data proliferation is real, and it’s growing. In fact, International Data Corporation (IDC) predicts that 80% of all data will be unstructured by 2025. However, mining data for accurate answers continues to be a challenge for many organizations. In an uncertain business environment there is mounting pressure find relevant information quickly, and use it to sustain and enhance business performance.
Organizations need solutions that deliver accurate answers fast and evolve the process of knowledge discovery from being a painstaking chore that typically results in dead ends, into a delightful experience for customers and employees.
Amazon Kendra is an intelligent search service powered by machine learning (ML). Amazon Kendra has reimagined enterprise search for your websites and applications so employees and customers can easily find what they’re looking for, even when answers could be scattered across multiple locations and content repositories within your organization.
Intelligent search helps you consolidate unstructured data from across your business into a single, secure, and searchable index. However, data ingestion and data source consolidation is just one aspect of upgrading a conventional search solution to a ML-powered intelligent search solution.
A more unique aspect of intelligent search is its ability to deliver relevant answers without the tuning complexities typically needed for keyword-based engines. Amazon Kendra’s deep learning language models and algorithms deliver accuracy out of the box, and automatically tune search relevance on a continuous basis.
Continuous improvement
A large part of this is incremental learning, which now comes built-in to Amazon Kendra. Incremental learning creates a mechanism for Amazon Kendra to observe user activity, search patterns, and user interactions. Amazon Kendra then uses these fundamentally important data points to understand user preferences for various documents and answers so it can take action and optimize search results. For example, the following screenshot shows how users can rate how helpful a search result is.
By transparently capturing user queries along with their preferred answers and documents, Amazon Kendra can learn from these patterns and take actions to improve future search results. For example, if an employee performs a search “What is our company expense policy?” without being specific about what kind of expense policy they’re interested in, they may see a host of varying topics.
Their results could include “airfare policies,” “hotels and accommodation,” or meals and entertainment policy,” and although each topic is technically related to company expense policy, the most commonly sought document may not necessarily be at the top of the list.
However, if it turns out that when employees typically ask that question, they’re searching for content related to “home-office reimbursements,” Amazon Kendra can learn from how users interact with results and adapt its models to re-rank information so “home-office expense policy” content gets promoted at the top of the search results page in future searches.
Scaling lessons learned
Incremental learning autonomously optimizes search results over time without the need to develop, train, and deploy ML models. It tunes future search results quickly in a way that’s data-driven and cost effective.
Consider the process for optimizing search accuracy in non-ML powered solutions, it would require significant effort, machine learning skill, and maintenance, even when using “ML” plugins added on top of legacy engines.
As unstructured data continues to dominate and grow at exponential speeds within the enterprise, implementing an adaptive, intelligent, and nimble enterprise search solution becomes critical to keeping up with the pace of change.
Conclusion
To learn more about how organizations are using intelligent search to boost workforce productivity, accelerate research and development, and enhance customer experiences download our ebook, 7 Reasons Why Your Organization Needs Intelligent Search. For more information about incremental learning, see Submitting feedback, or to learn more about Amazon Kendra visit the website or you can watch this video “What is Intelligent Search?”
About the Authors
Jean-Pierre Dodel leads product management for Amazon Kendra, a new ML-powered enterprise search service from AWS. He brings 15 years of Enterprise Search and ML solutions experience to the team, having worked at Autonomy, HP, and search startups for many years prior to joining Amazon 4 years ago. JP has led the Kendra team from its inception, defining vision, roadmaps, and delivering transformative semantic search capabilities to customers like Dow Jones, Liberty Mutual, 3M, and PwC.
Tom McMahon is a Product Marketing Manager on the AI Services team at AWS. He’s passionate about technology and storytelling and has spent time across a wide-range of industries including healthcare, retail, logistics, and eCommerce. In his spare time he enjoys spending time with family, music, playing golf, and exploring the amazing Pacific northwest and its surrounds.
Getting started with the Amazon Kendra Google Drive connector
Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). To simplify the process of connecting data sources to your index, Amazon Kendra offers several native data source connectors to help get your documents easily ingested.
For many organizations, Google Drive is a core part of their productivity suite, and often contains important documents and presentations. In this post, we illustrate how you can use the Google Drive connector in Amazon Kendra to synchronize content between Google Drive and your Amazon Kendra index, making it searchable using Amazon Kendra’s intelligent search capabilities.
The Google Drive connector indexes documents stored in shared drives as well as documents stored in a user’s own drive (such as My Drives). By default, Amazon Kendra indexes all documents in your Google Drive, but it also provides the flexibility to exclude documents from the index based on certain criteria, including the ID of a shared drive, document owner, the MIME type of the document, or the document path.
Prerequisites
The Amazon Kendra Google Drive connector supports Google Docs and Google Slides. We demonstrate how to search a Google Drive Workspace in Amazon Kendra using an AWS Whitepaper dataset.
First, we set up the necessary permissions within your Google Drive Workspace. We then illustrate how to create the Amazon Kendra Google Drive connector on the AWS Management Console, followed by creating the Amazon Kendra Google Drive connector via the (Python) API. Lastly, we perform some example search queries with Amazon Kendra after ingesting the AWS Whitepaper dataset.
Setting up an Amazon Kendra Google Drive connector includes the following steps:
- Setting up a name and tags
- Entering the credentials for your Google service account
- Setting up a sync schedule
- Configuring the index field mappings
Setting up the necessary permissions within your Google Drive Workspace includes the following steps:
- Creating a Google Drive service account if one doesn’t exist
- Configuring the Google Drive service account
- Enabling the Admin and Google Drive APIs
- Enabling the Google API scope
If you haven’t previously created a service account, see the section Creating a Google Drive service account in this post.
Creating a Google Drive data source on the Amazon Kendra console
Before you create your data source, you must create an Amazon Kendra index. For instructions, see the section Creating an Amazon Kendra index in Getting started with the Amazon Kendra SharePoint Online connector.
After you create your index, you can create a Google Drive data source.
- On the Amazon Kendra console, under Data management¸ choose Data sources.
- Choose Create data source.
- Under Google Drive, choose Add connector.
- For Data source name¸ a name (for example,
MyGoogleDriveDataSource
). - Choose Next.
- In the Authentication section, you need information from the JSON document that was downloaded when you configured the service account. Make sure you include everything between ” ” for your private key.
The following screenshot shows what the JSON document looks like.
The following screenshot shows our configuration on the Authentication page.
- For IAM role¸ choose Create a new role to create a new AWS Identity and Access Management (IAM) role.
- For Role name, enter a name for your role.
- Choose Next.
- For Set sync scope, you can define which user accounts, shared drives, or file type to exclude. For this post, we don’t modify these settings.
- For Additional configuration, you can also include or exclude paths, files, or file types. For this post, I ingest everything I have on my Google Drive.
- In the Sync run schedule section, for Frequency, you can choose the frequency of data source synchronization—on demand, hourly, daily, weekly or monthly, or custom. For this post, I choose Run on demand.
- Choose Next.
- In the Field mapping section, you can define which file attributes you want to map into your index. For this post, I use the default field mapping.
The following table lists the available fields.
Google Drive Property Name | Suggested Amazon Kendra Field Name |
createdTime |
_created_at |
dataSize |
gd_data_size |
displayUrl |
gd_source_url |
fileExtension |
_file_type |
id |
_document_id |
mimeType |
gd_mime_type |
modifiedTime |
_last_updated_at |
name |
_document_title |
owner |
gd_owner |
version |
gd_version |
The following screenshot shows our configuration.
- Choose Next.
- Review your settings and choose Create.
- After the data source is created, you can start the sync process by choosing Sync now.
Creating an Amazon Kendra Google Drive connector with Python
You can create a new Amazon Kendra index Google Drive connector and sync it by using the AWS SDK for Python (Boto3). Boto3 makes it easy to integrate your Python application, library, or script with AWS services, including Amazon Kendra.
IAM roles requirements and overview
To create an index using the AWS SDK, you need to have the policy AmazonKendraFullAccess attached to the role you’re using.
At a high level, Amazon Kendra requires the following:
- IAM roles for indexes – Needed to write to Amazon CloudWatch Logs.
- IAM roles for data sources – Needed when you use the
CreateDataSource
These roles require a specific set of permissions depending on the connector you use. For our use case, it needs permissions to access the following:- AWS Secrets Manager, where the Google Drive credentials are stored.
- The AWS Key Management Service (AWS KMS) customer master key (CMK) to decrypt the credentials by Secrets Manager.
- The
BatchPutDocument
andBatchDeleteDocument
operations to update the index.
For more information, see IAM access roles for Amazon Kendra.
For this solution, you also need the following:
- An Amazon Kendra IAM role for CloudWatch
- An Amazon Kendra IAM role for the Google Drive connector
- Google Drive service account credentials stored on Secrets Manager
Creating an Amazon Kendra index
To create an index, use the following code:
import boto3
from botocore.exceptions import ClientError
import pprint
import time
kendra = boto3.client("kendra")
print("Creating an index")
description = "<YOUR INDEX DESCRIPTION>"
index_name = "<YOUR NEW INDEX NAME>"
role_arn = "KENDRA ROLE WITH CLOUDWATCH PERMISSIONS ROLE"
try:
index_response = kendra.create_index(
Description = description,
Name = index_name,
RoleArn = role_arn,
Edition = "DEVELOPER_EDITION",
Tags=[
{
'Key': 'Project',
'Value': 'Google Drive Test'
}
]
)
pprint.pprint(index_response)
index_id = index_response['Id']
print("Wait for Kendra to create the index.")
while True:
# Get index description
index_description = kendra.describe_index(
Id = index_id
)
# If status is not CREATING quit
status = index_description["Status"]
print(" Creating index. Status: "+status)
if status != "CREATING":
break
time.sleep(60)
except ClientError as e:
print("%s" % e)
print("Done creating index.")
While your index is being created, you get regular updates (every 60 seconds; check line 38) until the process is complete. See the following code:
Creating an index{'Id': '3311b507-bfef-4e2b-bde9-7c297b1fd13b','ResponseMetadata': {'HTTPHeaders': {'content-length': '45','content-type': 'application/x-amz-json-1.1','date': 'Mon, 20 Jul 2020 19:58:19 GMT','x-amzn-requestid': 'a148a4fc-7549-467e-b6ec-6f49512c1602'},'HTTPStatusCode': 200,'RequestId': 'a148a4fc-7549-467e-b6ec-6f49512c1602','RetryAttempts': 2}}
Wait for Kendra to create the index.
Creating index. Status: CREATING
Creating index. Status: CREATING
Creating index. Status: CREATING
Creating index. Status: CREATING
Creating index. Status: ACTIVE
Done creating index
When your index is ready, it provides an ID 3311b507-bfef-4e2b-bde9-7c297b1fd13b
on the response. Your index ID will be different than the ID in this post.
Providing the Google Drive service account credentials
You also need to have GetSecretValue
for your secret stored in Secrets Manager.
If you need to create a new secret in Secrets Manager to store the Google service account credentials, make sure the role you use has permissions to create a secret and tagging. See the following policy code:
{"Version": "2012-10-17","Statement": [{"Sid": "SecretsManagerWritePolicy","Effect": "Allow","Action": ["secretsmanager:UntagResource","secretsmanager:CreateSecret","secretsmanager:TagResource"],"Resource": "*"}]}
To create a secret on Secrets Manager, enter the following code:
secretsmanager = boto3.client('secretsmanager')
SecretName = "<YOUR_SECRETNAME>"
GoogleDriveCredentials= "{'clientAccount': '<YOUR SERVICE ACCOUNT EMAIL>','adminAccount': '<YOUR GSUITE ADMINISTRATOR EMAIL>','privateKey': '<YOUR SERVICE ACCOUNT PRIVATE KEY>'}"
try:
create_secret_response = secretsmanager.create_secret(
Name=SecretName,
Description='Secret for a Google Drive data source connector',
SecretString=GoogleDriveCredentials,
Tags=[{'Key': 'Project','Value': 'Google Drive Test'}])
except ClientError as e:
print('%s' % e)
pprint.pprint(create_secret_response)
If everything goes well, you get a response with your secret’s ARN:
{'ARN': <YOUR_SECRET_ARN>,
'Name': 'YOUR_SECRETNAME',
'ResponseMetadata': {'HTTPHeaders': {'connection': 'keep-alive',
'content-length': '161',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 25 Nov 2020 14:23:54 GMT',
'x-amzn-requestid': 'a2f7af73-be54-4388-bc53-427b5f201b8f'},
'HTTPStatusCode': 200,
'RequestId': 'a2f7af73-be54-4388-bc53-427b5f201b8f',
'RetryAttempts': 0},
'VersionId': '90c1f8b7-6c26-4d42-ba4c-e1470b648c5c'}
Creating the Amazon Kendra Google Drive data source
Your Amazon Kendra index is up and running and you have established the attributes that you want to map to your Google Drive document’s attributes.
You now need an IAM role with Kendra:BatchPutDocument
and kendra:BatchDeleteDocument
permissions. For more information, see IAM access roles for Amazon Kendra. We use the ARN for this IAM role when invoking the CreateDataSource API.
Make sure the role you use for your data source connector has a trust relationship with Amazon Kendra. See the following code:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "kendra.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
The following code is the policy structure used:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"secretsmanager:GetSecretValue"
],
"Resource": [
"arn:aws:secretsmanager:<REGION>-<YOUR ACCOUNT NUMBER>:secret:<YOUR-SECRET-ID>"
]
},
{
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": [
"arn:aws:kms:<REGION>-<YOUR ACCOUNT NUMBER>:index/<YOUR-INDEX-ID>"
],
"Condition": {
"StringLike": {
"kms:ViaService": [
"secretsmanager.*.amazonaws.com"
]
}
}
},
{
"Effect": "Allow",
"Action": [
"kendra:BatchPutDocument",
"kendra:BatchDeleteDocument"
],
"Resource": "arn:aws:kendra:<REGION>-<YOUR ACCOUNT NUMBER>:index/<YOUR-INDEX-ID>"
}
]
}
The following code is my role’s ARN:
arn:aws:iam::<YOUR ACCOUNT NUMBER>:role/Kendra-Datasource
Following the least privilege principle, we only allow our role to put and delete documents in our index and read the credentials of the Google service account.
When creating a data source, you can specify the sync schedule, which indicates how often your index syncs with the data source we create. This schedule is defined on the Schedule
key of our request. You can use schedule expressions for rules to define how often you want to sync your data source. For this use case, the ScheduleExpression
is 'cron(0 11 * * ? *)'
, which sets the data source to sync every day at 11:00 AM.
I use the following code. Make sure you match your SiteURL
and SecretARN
, and IndexID
.
import boto3
from botocore.exceptions import ClientError
import pprint
import time
print('Create a Google Drive data source')
SecretArn= "<YOUR-SECRET-ARN>"
DSName= "<YOUR-DATASOURCE-NAME>"
IndexId= "<YOUR-INDEX-ID>"
DSRoleArn= "<YOUR-DATASOURCE-ROLE-ARN>"
ScheduleExpression='cron(0 11 * * ? *)'
try:
datasource_response = kendra.create_data_source(
Name=DSName,
IndexId=IndexId,
Type='GOOGLEDRIVE',
Configuration={
'GoogleDriveConfiguration': {
'SecretArn': SecretArn,
},
},
Description='My GoogleDrive Datasource',
RoleArn=DSRoleArn,
Schedule=ScheduleExpression,
Tags=[
{
'Key': 'Project',
'Value': 'GoogleDrive Test'
}
]
)
pprint.pprint(datasource_response)
print('Waiting for Kendra to create the DataSource.')
datasource_id = datasource_response['Id']
while True:
# Get index description
datasource_description = kendra.describe_data_source(
Id=datasource_id,
IndexId=IndexId
)
# If status is not CREATING quit
status = datasource_description["Status"]
print(" Creating index. Status: "+status)
if status != "CREATING":
break
time.sleep(60)
except ClientError as e:
print('%s' % e)
You should get a response like the following code:
'ResponseMetadata': {'HTTPHeaders': {'content-length': '45',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 02 Dec 2020 19:03:17 GMT',
'x-amzn-requestid': '8d19fa35-adb6-41e2-92d6-0df2797707d8'},
'HTTPStatusCode': 200,
'RequestId': '8d19fa35-adb6-41e2-92d6-0df2797707d8',
'RetryAttempts': 0}}
Syncing the data source
Even though you defined a schedule for syncing the data source, you can sync on demand by using start_data_source_sync_job
:
DSId=<YOUR DATA SOURCE ID>
IndexId=<YOUR INDEX ID>
try:
ds_sync_response = kendra.start_data_source_sync_job(
Id=DSId,
IndexId=IndexId
)
except ClientError as e:
print('%s' % e)
pprint.pprint(ds_sync_response)
You get a result similar to the following code:
{'ExecutionId': '99bdd945-fe1e-4401-a9d6-a0272ce2dae7',
'ResponseMetadata': {'HTTPHeaders': {'content-length': '54',
'content-type': 'application/x-amz-json-1.1',
'date': 'Wed, 02 Dec 2020 19:12:25 GMT',
'x-amzn-requestid': '68a05d7b-26bf-4821-ae43-1a491f4cf314'},
'HTTPStatusCode': 200,
'RequestId': '68a05d7b-26bf-4821-ae43-1a491f4cf314',
'RetryAttempts': 0}}
Testing
Now that you have ingested the AWS Whitepapers dataset into your Amazon Kendra index, you can test some queries. I submit each test query first into the built-in Google Drive search bar and then retry the search with Amazon Kendra.
The first query I test is “What AWS service has 11 9s of durability?” The following screenshot shows the Google Drive output.
The following screenshot shows the query results in Amazon Kendra.
The next query is “How many pillars compose the well architected framework?” The following screenshot shows the response from Google Drive.
The following screenshot shows the results from Amazon Kendra.
The third query is “How can I get volume discounts?” The following screenshot shows the response from Google Drive.
The following screenshot shows the query results in Amazon Kendra.
The fourth query is “How can you control access to an RDS instance?” The following screenshot shows the Google Drive response.
The following screenshot shows the query results in Amazon Kendra.
Now let’s try something else. Instead of natural language search, let’s try the keyword search “volume discounts.” The following screenshot shows the Google Drive response.
The following screenshot shows the Amazon Kendra response.
Conclusion
Helping customers and employees find relevant information quickly increases workforce productivity and enhances overall customer experiences. In this post, we outlined how you can set up an Amazon Kendra Google Drive connector with Google Workspace through either the Amazon Kendra console or via AWS API.
To learn more about the Amazon Kendra Google Drive connector, see Amazon Kendra Google data source documentation, or you can explore other Amazon Kendra data source connectors by visiting the Amazon Kendra connector library. To get started with Amazon Kendra, visit the Amazon Kendra Essentials+ workshop for an interactive walkthrough.
Appendix
If you haven’t previously created a service account, complete the steps in this section before creating your Google Drive data source.
Creating a Google Drive service account
To ingest your documents store in Google Drive to your Amazon Kendra index, you need to have a Google Drive service account with sufficient permissions to access the documents stored within the Google Drive Workspace.
Follow these instructions:
- Log in to the Google Cloud Platform console with an account that has administrator privilege.
- On the menu, choose your project (for this post,
MyFirstProject
).
- Choose IAM & Admin and choose Service Accounts.
- Choose CREATE SERVICE ACCOUNT.
- Enter a service account name and description.
The service account ID, an email address, is generated automatically.
- Choose Create.
- Skip steps 2 (Grant this service account access to project) and 3 (Grant users access to this service account).
- Choose Done to continue.
Configuring Google Drive service account
Now that you have your service account created, it’s time configure it.
- Choose the service account name you created.
- Choose Edit.
- On the service account page, choose SHOW DOMAIN-WIDE DELEGATION to view the available options.
- Select Enable G Suite Domain-wide Delegation.
- For Product name for the consent screen, enter a name.
- In the Keys section, choose ADD KEY and choose Create new key.
- For Key type¸ select JSON.
- Choose Create.
A JSON file containing the service account email address and private key is downloaded to your computer.
- Choose CLOSE.
- On the service account details page, take note of the account’s unique ID, to use later.
Enabling the Admin and Google Drive APIs
You’re now ready to enable the Admin and Google Drive APIs.
- Choose APIs & Services and choose Library.
- Search for and choose Admin SDK.
- Choose Enable.
- Choose APIs & Services and choose Library.
- Search for and choose Google Drive API.
- Click on Enable.
Enabling Google API scopes
In this section, you configure the OAuth 2.0 scopes needed to access the Admin and Google Drive APIs required by the Amazon Kendra Google Drive connector.
- Log in to Google’s admin interface as your organization’s administrator user.
- Choose Security and choose API controls.
- Scroll down and choose MANAGE DOMAIN-WIDE DELEGATION in the Domain-wide delegation section.
- Choose Add new.
- For Client ID, enter the unique ID from your service account details.
- For OAuth scopes, enter the following code:
https://www.googleapis.com/auth/drive.readonly, https://www.googleapis.com/auth/drive.metadata.readonly, https://www.googleapis.com/auth/admin.directory.user.readonly, https://www.googleapis.com/auth/admin.directory.group.readonly
- Choose Authorize.
After you create a service account and configure it to use the Google API, you can create a Google Drive data source.
About the Authors
Juan Pablo Bustos is an AI Services Specialist Solutions Architect at Amazon Web Services, based in Dallas, TX. Outside of work, he loves spending time writing and playing music as well as trying random restaurants with his family.
David Shute is a Senior ML GTM Specialist at Amazon Web Services focused on Amazon Kendra. When not working, he enjoys hiking and walking on a beach.
How Thomson Reuters accelerated research and development of natural language processing solutions with Amazon SageMaker
This post is co-written by John Duprey and Filippo Pompili from Thomson Reuters.
Thomson Reuters (TR) is one of the world’s most trusted providers of answers, helping professionals make confident decisions and run better businesses. Teams of experts from TR bring together information, innovation, and confident insights to unravel complex situations, and their worldwide network of journalists and editors keeps customers up to speed on global developments. TR has over 150 years of rich, human-annotated data on law, tax, news, and other segments. TR’s data is the crown jewel of the business. It’s one of the aspects that distinguishes TR from its competitors.
In 2018, a team of research scientists from the Center for AI and Cognitive Computing at TR started an experimental project at the forefront of natural language understanding. The project is based on the latest scientific discoveries that brought wide disruptions in the field of machine reading comprehension (MRC) and aims to develop technologies that you can use to solve numerous tasks, including text classification and natural language question answering.
In this post, we discuss how TR used Amazon SageMaker to accelerate their research and development efforts, and did so with significant cost savings and flexibility. We explain how the team experimented with many variants of BERT to produce a powerful question-answering capability. Lastly, we describe TR’s Secure Content Workspace (SCW), which provided the team with easy and secure access to Amazon SageMaker resources and TR proprietary data.
Customer challenge
The research and development team at TR needed to iterate quickly and securely. Team members already had significant expertise developing question-answering solutions, both via dedicated feature engineering for shallow algorithms and with featureless neural-based solutions. They played a key role in developing the technology powering Westlaw Edge (legal) and Checkpoint Edge (tax), two well-received products from TR. These projects each required 15–18 months of intense research and development efforts and have reached remarkable performance levels. For MRC, the research team decided to experiment with BERT and several of its variants on two sets of TR’s data, one from the legal domain and another from the tax domain.
The legal training corpus was composed of tens of thousands of editorially reviewed questions. Each question was compared against several potential answers in the form of short, on-point, text summaries. These summaries were highly curated editorial material that was extracted from legal cases across many decades—resulting in a candidate training set of several hundred thousand question-answer (QA) pairs, drawn from tens of millions of text summaries. The tax corpus, comprised of more than 60,000 editorially curated documents on US federal tax law, contained thousands of questions and tens of thousands of QA pairs.
Model pretraining and fine-tuning against these datasets would be impossible without state-of-art compute power. Procuring these compute resources typically required a big upfront investment with long lead times. For research ideas that might or might not become a product, it was hard to justify such a significant cost for experimentation.
Why AWS and Amazon SageMaker?
TR chose Amazon SageMaker as the machine learning (ML) service for this project. Amazon SageMaker is a fully managed service to build, train, tune, and deploy ML models at scale. One of the key factors in TR’s decision to choose Amazon SageMaker was the benefit of a managed service with pay-as-you-go billing. Amazon SageMaker lets TR decide how many experiments to run, and helps control the cost of training. More importantly, when a training job completes, the team is no longer charged for the GPU instances they were using. This resulted in substantial cost savings compared to managing their own training resources, which would have resulted in low server utilization. The research team could spin up as many instances as required and let the framework take care of shutting down long-running experiments when they were done. This enabled rapid prototyping at scale.
In addition, Amazon SageMaker has a built-in capability to use managed Spot Instances, which reduced the cost of training in some cases by more than 50%. For some large natural language processing (NLP) experiments using models like BERT on vast proprietary datasets, training time is measured in days, if not weeks, and the hardware involved is expensive GPUs. A single experiment can cost a few thousand dollars. Managed Spot Training with Amazon SageMaker helped TR reduce training costs by 40–50% on average. In comparison to self-managed training, Amazon SageMaker also comes with a full set of built-in security capabilities. This saved the team countless hours of coding that would have been necessary on a self-managed ML infrastructure.
After they launched the training jobs, TR could easily monitor them on the Amazon SageMaker console. The logging and hardware utilization metering facilities allowed the team to have a quick overview of their jobs’ status. For example, they could ensure the training loss was evolving as expected and see how well the allocated GPUs were utilized.
Amazon SageMaker provided TR easy access to state-of-the-art underlying GPU infrastructure without having to provision their own infrastructure or shoulder the burden of managing a set of servers, their security posture, and their patching levels. As faster and cheaper GPU instances become available going forward, TR can use them to reduce cost and training times with a simple configuration change to use the new type. On this project, the team was able to easily experiment with instances from the P2, P3, and G4 family based on their specific needs. AWS also gave TR a broad set of ML services, cost-effective pricing options, granular security controls, and technical support.
Solution overview
Customers operate in complex arenas that move society forward—law, tax, compliance, government, and media—and face increasing complexity as regulation and technology disrupts every industry. TR helps them reinvent the way they work. Using MRC, TR expects to offer natural language searches that outperform previous models that relied on manual feature engineering.
The BERT-based MRC models that the TR research team is developing run on text datasets exceeding several tens of GBs of compressed data. The deep learning frameworks of choice for TR are TensorFlow and PyTorch. The team uses GPU instances for time-consuming neural network training jobs, with runtimes ranging from tens of minutes to several days.
The MRC team has experimented with many variants of BERT. Initially starting from the base model, with 12 layers of stacked transformer encoders and 12 attention heads for 100 million parameters, up to the large model with 24 layers, 16 heads, and 300 million parameters. The availability of V100 GPUs with the largest amount of 32 GB of RAM was instrumental in training the largest model variants. The team formulated the question-answering problem as a binary classification task. Each QA pair is graded by a pool of subject matter experts (SMEs) assigning one of four different grades: A, C, D, and F, where A is for perfect answers and F for completely wrong errors. The grades of each QA pair are converted to numbers, averaged across graders, and binarized.
Because each question-answering system is domain-specific, the research team used transfer learning and domain-adaptation techniques to enable this capability across different sub-domains (for example, law isn’t a single domain). TR used Amazon SageMaker for both language model pretraining and fine-tuning of their BERT models. When compared to the available on-premises hardware, the Amazon SageMaker P3 instance shrunk the training time from many hours to less than 1 hour for fine-tuning jobs. The pretraining of BERT on the domain-specific corpus was reduced from an estimated several weeks to only a few days. Without the dramatic time savings and cost savings provided by Amazon SageMaker, the TR research team would likely not have completed the extensive experimentation required for this project. With Amazon SageMaker, they made breakthroughs that drove key improvements to their applications, enabling faster and more accurate searches by their users.
For inference, TR used the Amazon SageMaker batch transform function for model scoring on vast amounts of test samples. When testing of model performance was satisfactory, Amazon SageMaker managed hosting enabled real-time inference. TR is taking the results of the research and development effort and moving it to production, where they expect to use Amazon SageMaker endpoints to handle millions of requests per day on highly specialized professional domains.
Secure, easy, and continuous access to the vast amounts of proprietary data
Protecting TR’s intellectual property is very important to the long-term success of the business. Because of this, TR has clear, ever-evolving standards around security and ways of working in the cloud that must be followed to protect their assets.
This raises some key questions for TR’s scientists. How can they create an instance of an Amazon SageMaker notebook (or launch a training job) that’s secure and compliant with TR’s standards? How can a scientist get secure access to TR’s data within Amazon SageMaker? TR needed to ensure scientists could do this consistently, securely, and with minimal effort.
Enter Secure Content Workspaces. SCW is a web-based tool developed by TR’s research and development team and answers these questions. The following diagram shows SCW in the context of TR’s research effort described earlier.
SCW enables secure and controlled access to TR’s data. It also provisions services, like Amazon SageMaker, in ways that are compliant with TR’s standards. With the help of SCW, scientists can work in the cloud with peace of mind knowing they comply with security protocols. SCW lets them focus on what they’re good at—solving hard problems with artificial intelligence (AI).
Conclusion
Thomson Reuters is fully committed to the research and development of state-of-the-art AI capabilities to aid their customers’ work. The MRC research was the latest in these endeavors. Initial results indicate broad applications across TR’s product line—especially for natural language question answering. Whereas past solutions involved extensive feature engineering and complex systems, this new research shows simpler ML solutions are possible. The entire scientific community is very active in this space, and TR is proud to be a part of it.
This research would not have been possible without the significant computational power offered by GPUs and the ability to scale it on demand. The Amazon SageMaker suite of capabilities provided TR with the raw horsepower and necessary frameworks to build, train, and host models for testing. TR built SCW to support cloud-based research and development, like MRC. SCW sets up scientists’ working environment in the cloud and ensures compliance with all of TR’s security standards and recommendations. It made using tools like Amazon SageMaker with TR’s data safe.
Moving forward, the TR research team is looking at introducing a much wider range of AI/ML features based on these powerful deep learning architectures, using Amazon SageMaker and SCW. Examples of such advanced capabilities include on-the-fly answer generation, long text summarization, and fully interactive, conversational, question answering. These capabilities will enable a comprehensive assistive AI system that can guide users toward the best solution for all their information needs.
About the Authors
Mark Roy is a Machine Learning Specialist Solution Architect, helping customers on their journey to well-architected machine learning solutions at scale. In his spare time, Mark loves to play, coach, and follow basketball.
Qingwei Li is a Machine Learning Specialist at Amazon Web Services. He received his Ph.D. in Operations Research after he broke his advisor’s research grant account and failed to deliver the Noble Prize he promised. Currently he helps customers in financial service and insurance industry build machine learning solutions on AWS. In his spare time, he likes reading and teaching.
John Duprey is senior director of engineering for the Center for AI and Cognitive Computing (C3) at Thomson Reuters. John and the engineering team work alongside scientists and product technology teams to develop AI-based solutions to Thomson Reuters customers’ most challenging problems.
Filippo Pompili is Sr NLP Research Scientist at the Center for AI and Cognitive Computing (C3) at Thomson Reuters. Filippo has expertise in machine reading comprehension, information retrieval, and neural language modeling. He actively works on bringing state-of-the-art machine learning discoveries into Thomson Reuters’ most advanced products.
Using a test framework to design better experiences with Amazon Lex
Chatbots have become an increasingly important channel for businesses to service their customers. Chatbots provide 24/7 availability and can help customers interact with brands anywhere, anytime and on any device. To effectively utilize chatbots, they must be built with good design, development, test, and deployment practices. This post provides you with a framework that helps you automate the testing processes and reduce the overall bot development cycle for Amazon Lex bots.
Amazon Lex is a service for building conversational interfaces into any application using voice and text. Conversations with Amazon Lex bots can vary from simple, single-turn Q&A to a complex, multi-turn dialog. During the design phase, the conversation designer creates scripts and conversation flow diagrams that encapsulate the different ways a conversation can flow for a particular use case. Establishing an easy-to-use testing interface allows bot designers to iterate and validate their ideas quickly without depending on engineers. During the development and testing phase, an automated test framework helps engineers avoid manual testing and be more productive.
The test framework described in this post empowers designers and engineers to test many conversations in a few minutes, identify where the predicted intents are wrong, and implement improvements. The insights provided by this process allow designers to quickly review intents that may be performing poorly, prioritize intents by importance, and modify the bot design to ensure minimal overlap between intents.
Solution architecture
The following diagram illustrates the architecture of our solution.
A test framework for chatbots can empower builders with the ability to upload test suites, run tests, and get test results comprised of accuracy information and test case level outcomes. The solution architecture provides you with the following capabilities:
- Test suites comprised of CSV files are uploaded to an Amazon Simple Storage Service (Amazon S3) bucket. These test suites adhere to a predefined format described later in this post.
- Tests are triggered using an Amazon API Gateway endpoint path
/test/run
, which runs the test cases against Amazon Lex and returns a test ID, confusion matrix, and summary metrics. The results are also stored in an Amazon DynamoDB - Test details are retrieved from the DynamoDB table using another API path
/test/details/{id}
, which returns test case outcomes for the specified test ID.
Deploying the AWS CloudFormation template
You can deploy this architecture using the provided AWS CloudFormation template in us-east-1
.
- Choose Launch Stack.
- Choose Next.
- For Name, enter a stack name.
- Choose Next.
- In the Capabilities and transforms section, select all three check boxes to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.
- Choose Create stack.
This process might take 5 minutes or more to complete. The stack creates the following resources:
- Two DynamoDB tables to store the testing results
- Four AWS Lambda functions
- An API Gateway endpoint that is called by the client application
API key and usage plan
After the CloudFormation template finishes deploying the infrastructure, you see the following values on the Outputs tab: ApiGWKey
and LexTestResultAPI
.
The LexTestResultAPI
requires an API key. The AWS CloudFormation output ApiGWKey
refers to the name of the API key. As of this writing, this API key is associated with a usage plan that allows 2,000 requests per month.
- On the stack Outputs tab, choose the link for
ApiGWKey
.
The API keys section of the API Gateway console opens.
- Choose Show next to the API key.
- Copy the API key to use when testing the API.
- You can manage the usage plan by following the instructions on Create, configure, and test usage plans with the API Gateway console.
- You can also add fine-grained authentication and authorization to your APIs. For more information about securing your APIs, see Controlling and managing access to a REST API in API Gateway.
Setting up your sample UI
You can build a user interface (UI) using your preferred technology stack to trigger the tests and view the results. This UI needs to be configured with the APIs created as part of running the CloudFormation template. You can also use the provided simple HTML file to follow along with this post, but we recommend building your own user interface for enterprise use.
- Download the sample UI project.
- In the index.html file, update the
APIKey
andAPIUrl
with the value created by the CloudFormation template:
var APIKey
= “<API Key>”
var APIUrl
= “<API URL, eg: https://xxxxxx.execute-api.us-east-1.amazonaws.com/prod/>”
Testing the solution
To demonstrate the test framework, we’ve created a simple banking bot that includes several intents, such as:
BusinessHours
CancelOrder
CancelTransfer
Check balance
MyFallbackIntent
OrderChecks
TransferFunds
This application has purposefully been designed to have failures, either from overlapping intents or missing utterances. This illustrates how the test framework surfaces issues to fix.
Setting up your test
To set up your test, complete the following steps:
- Download the sample bot with conflicting intents and import it on the console.
- Build the bot and create an alias.
- Create a set of test data in a CSV file (you can use the sample test file to follow along in this post).
- Upload the file to a S3 bucket in your account.
Running a test
To run your test, complete the following steps:
- Open the index.html from the sample UI code in a web browser.
- Choose the sample bot you created.
- Choose the alias you created.
- Enter the Amazon S3 URL for the sample test file.
- Choose Run.
Examining your results
In a few moments, you see a response that has a confusion matrix and test results. The intents that Amazon Lex predicted for each utterance are across the horizontal axis. The vertical axis contains the intents from the ground truth data specified in our test file. The center diagonal from the top left cell to the bottom right cell indicates where intents in the ground truth dataset match the predicted intents.
Any values that fall outside of that center diagonal indicate areas of improvement in the bot design.
In this post, we discuss a few examples from the sample banking bot, which was purposefully designed to have issues.
In the test data CSV file, the first column has a ConversationID
label. Each set of utterances making up a conversation is grouped by ID number. Some conversations are a single turn, meaning the request can be satisfied by the bot without the bot asking the user for additional or clarifying slot information. For example, in our simple banking app, the user can ask about the hours of operation and receive an answer in a single turn. In our test data, we’ve included several utterances expected to trigger the BusinessHours
intent.
The confusion matrix shows that all utterances that should trigger the BusinessHours
intent did so. There are no additional values on the predicted axis aside from the BusinessHours
intent, which means this intent is working well. Under the confusion matrix, a more detailed view shows which utterances succeeded and which failed from the test conversations. Again, each of our single-turn conversations 1, 2, and 3 are shown to have succeeded in the Result
column.
A quick scan of the other intents indicates that not all were so successful. Let’s take a look at a multi-turn conversation that didn’t perform as well. In the confusion matrix, the TransferFunds
row shows that none of our actual utterances were predicted to trigger the TransferFunds
intent. That is a problem.
Conversation 15 is a multi-turn conversation intended to trigger the TransferFunds
intent. However, it’s shown to have failed. The utterance tested is “Move money from one account to another.” That seems like a reasonable thing for someone to say if they’d like to transfer money, but our model is mapping it to one of our other intents.
To fix the problem, return to the Amazon Lex console and open the TransferFunds
intent. There are only a few sample utterances and none of the utterances include the words “move” or “money.”
It’s no wonder that the bot didn’t know to map an utterance like “Move money from one account to another” to this intent. The best way to fix this is to include additional sample utterances to cover the various ways people may communicate that they want to transfer funds. The other area to look at is those intents that were mis-predicted as being appropriate. Make sure that the sample utterances used for those intents don’t conflict or overlap with utterances that should be directed to TransferFunds
.
In the following examples, the bot may be having trouble as indicated in our test output. Slot values are important pieces of information that help the bot fulfill a user’s request, so it’s important that they’re accurately identified. In the Test Conversations section of the test framework, the columns Slots
and Predicted Slots
should match, otherwise there’s an issue. In our sample bot, conversation 13 indicates that there was a mismatch.
Finally, the SessionAttr
and PredictedSessionAttr
columns should match. Otherwise, there may be an issue in the validation or fulfillment Lambda function that is preventing session attributes from being captured. The following screenshot shows conversation 9, in which the SessionAttr
column has a forced inaccuracy to demonstrate the mismatch. There is only one session attribute captured in the PredictedSessionAttr
column.
The following is the full test conversations matrix. As an exercise, you can try modifying the sample bot design to turn the failure results to successes.
Cleaning up
To remove all the resources created throughout this process and prevent additional costs, delete the CloudFormation stack you created. This removes all the resources the template created.
Conclusion
Having a test framework that enables chatbot owners to automatically run test cases that cover different conversation pathways is extremely useful for expediting the launch of a well-tested chatbot. This reduces the time that you have to put into testing a chatbot comprised of different intents and slots. This post provides an architecture pattern for implementing a test framework for chatbots built using Amazon Lex to get you started with an important capability that can accelerate the delivery of your conversational AI experiences. Start building your conversational AI experiences with Amazon Lex.
About the Authors
Shanthan Kesharaju is a Senior Architect at AWS who helps our customers with AI/ML strategy and architecture. Shanthan has over a decade of experience managing diverse teams in both product and engineering. He is an award winning product manager and has built top trending Alexa skills. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.
Marty Jiang is a Conversational AI Consultant with AWS Professional Services. Outside of work, he loves spending time outdoors with his family and exploring new technologies.
Claire Mitchell is a Conversational AI Design Consultant with AWS Professional Services. Occasionally, she spends time exploring speculative design practices, and finding patterns in bits and beats.
NeurIPS: Shipra Agrawal on the appeal of reinforcement learning
This Amazon Scholar’s work spans two of the most popular topics at the most popular AI conference: reinforcement learning and bandit problems.Read More
Automated model refresh with streaming data
In today’s world, being able to quickly bring on-premises machine learning (ML) models to the cloud is an integral part of any cloud migration journey. This post provides a step-by-step guide for launching a solution that facilitates the migration journey for large-scale ML workflows. This solution was developed by the Amazon ML Solutions Lab for customers with streaming data applications (e.g., predictive maintenance, fleet management, autonomous driving). Some of the AWS services used in this solution include Amazon SageMaker, which is a fully managed service that provides every developer and data scientist with the ability to build, train, and deploy ML models quickly, and Amazon Kinesis, which helps with real-time data ingestion at scale.
Being able to automatically refresh ML models with new data can be of high value to any business when an ML model drifts. Amazon SageMaker Model Monitor continuously monitors the quality of Amazon SageMaker ML models in production. It enables you to set alerts for when deviations in the model quality occur. The solution presented in this post provides a model refresh architecture that is launched with one-click via an AWS CloudFormation template, and enables capabilities on the fly. You can quickly connect your real-time streaming data via Kinesis, store the data on Amazon Redshift, schedule training and deployment of ML models using Amazon EventBridge, orchestrate jobs with AWS Step Functions, take advantage of AutoML capabilities during model training via AutoGluon, and get real-time inference from your frequently updated models. All this is available in a matter of few minutes. The CloudFormation stack creates, configures, and connects the necessary AWS resources.
The rest of the post is structured as follows:
- Overview of the solution and how the services and architecture are set up
- Details of data ingestion, automated and scheduled model refresh, and real-time model inference modules
- Instructions on how to launch the solution on AWS via a Cloud Formation template
- Cost aspects
- Cleanup instructions
Solution overview
The following diagram depicts the solution architecture, which contains three fully integrated modules:
- Data ingestion – Enables real-time data ingestion from either an IoT device or data uploaded by the user, and real-time data storage on a data lake. This functionality is specifically tailored for situations where there is a need for storing and organizing large amounts of real-time data on a data lake.
- Scheduled model refresh – Provides scheduling and orchestrating ML workflows with data that is stored on a data lake, as well as training and deployment using AutoML capabilities.
- Real-time model inference – Enables getting real-time predictions from the model that is trained and deployed in the previous step.
In the following sections, we provide details of the workflow and services used in each module.
Data ingestion
In this module, data is ingested from either an IoT device or sample data uploaded into an S3 bucket. The workflow is as follows:
- The streaming option via data upload is mainly used to test the streaming capability of the architecture. In this case, a user uploads a sample CSV data into an Amazon Simple Storage Service (Amazon S3) bucket. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance.
- Uploading the data triggers an AWS Lambda function. Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume.
- When the Lambda function is triggered, it reads the data and sends it in streams to Amazon Kinesis Data Streams. Kinesis Data Streams is a massively scalable and durable real-time data streaming service. Alternatively, an external IoT device can also be connected directly to Kinesis Data Streams. The data can then be streamed via Kinesis Data Streams.
- The Kinesis streaming data is then automatically consumed by Amazon Kinesis Data Firehose. Kinesis Data Firehose loads streaming data into data lakes, data stores, and analytics services. It’s a fully managed service that automatically scales to match the throughput of your data and requires no ongoing administration. Data captured by this service can optionally be transformed and stored into an S3 bucket as an intermediate process.
- The stream of data in the S3 bucket is loaded into an Amazon Redshift cluster and stored in a database. Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster. Each cluster runs an Amazon Redshift engine and contains one or more databases.
Scheduled model refresh
In this module, you can schedule events using EventBridge, which is a serverless event bus that makes it easy to build event-driven applications. In this solution, we use EventBridge as a scheduler to regularly run the ML pipeline to refresh the model. This keeps the ML model up to date.
The architecture presented in this post uses Step Functions and Lambda functions to orchestrate the ML workflows from data querying to model training. Step Functions is a serverless function orchestration service that makes it easy to sequence Lambda functions and multiple AWS services into business-critical applications. It reduces the amount of code you have to write by providing visual workflows to enable fast translation of business requirements into technical requirements. Additionally, it manages the logic of your application by managing state, checkpoints, and restarts, as well as error handing capabilities such as try and catch, retry, and rollback. Basic primitives such as branching, parallel execution, and timeouts are also implemented to reduce repeated code.
In this architecture, the following subsequent steps are triggered within each state machine:
- AWS Batch to run queries to Amazon Redshift to ETL data – This architecture triggers and controls an AWS Batch job to run SQL queries on the data lake using Amazon Redshift. The results of the queries are stored on the specified S3 bucket. Amazon Redshift is Amazon’s data warehousing solution. With Amazon Redshift, you can query petabytes of structured and semi-structured data across your data warehouse, operational database, and your data lake using standard SQL.
- Data preprocessing using Amazon SageMaker – Amazon SageMaker Processing is a managed data preprocessing solution within Amazon SageMaker. It processes the raw extract, transform, and load (ETL) data and makes it ingestible by ML algorithms. It launches a processing container, pulls the query results from the S3 bucket, and runs a custom preprocessing script to perform data processing tasks such as feature engineering, data validation, train/test split, and more. The output is then stored on the specified S3 bucket.
- Training and deploying ML models using Amazon SageMaker – This step in the architecture launches an ML training job using the AutoGluon Tabular implementation available through AWS Marketplace to train on the processed and transformed data, and then store the model artifacts on Amazon S3. It then deploys the best model trained via an automatic ML approach on an Amazon SageMaker endpoint.
Amazon SageMaker lets you build, train, and deploy ML models quickly by removing the heavy lifting from each step of the process.
AutoGluon is an automatic ML toolkit that enables you to use automatic hyperparameter tuning, model selection, and data processing. AutoGluon Tabular is an extension to AutoGluon that allows for automatic ML capabilities on tabular data. It’s suitable for regression and classification tasks with tabular data containing text, categorical, and numeric features. Accuracy is automatically boosted via multi-layer stack ensembling, deep learning, and data-splitting (bagging) to curb over-fitting.
AWS Marketplace is a digital catalog with software listings from independent software vendors that make it easy to find, test, buy, and deploy software that runs on AWS. The AWS Marketplace implementation of AutoGluon Tabular allows us to treat the algorithm as an Amazon SageMaker built-in algorithm, which speeds up development time.
Real-time model inference
The inference module of this architecture launches a REST API using Amazon API Gateway with Lambda integration, allowing you to immediately get real-time inference on the deployed AutoGluon model. The Lambda function accepts user input via the REST API and API Gateway, converts the input, and communicates with the Amazon SageMaker endpoint to obtain predictions from the trained model.
API Gateway is a fully managed service that makes it easy to create, publish, maintain, monitor, and secure APIs at any scale. It also provides tools for creating and documenting web APIs that route HTTP requests to Lambda functions.
The following diagram depicts the steps that are taken for an end-to-end run of the solution, from a task orchestration point of view. The graph indicator is available on the Step Functions console.
Launching the CloudFormation template
The following section explains the steps for launching this solution.
Before you get started, make sure you have the following:
- Access to an AWS account
- Permissions to create a CloudFormation stack
- Permissions to create an AWS Identity and Access Management (IAM) role and other AWS resources
Choose Launch Stack and follow the steps to create all the AWS resources to deploy the solution. This solution is deployed in the us-east-1
Region.
After a successful deployment, you can test the solution using sample data.
Testing the Solution
To demonstrate the capabilities of the solution, we have provided an example implementation using stocks data. The dataset consists of around 150,000 observations from the most popular stocks being bought and sold, with columns (ticker_symbol
, sector
, change
, and price
). The ML task is regression, and the target column price is a continuous variable. The following example shows how to start streaming such data using the data ingestion module; how to schedule an automated ML training and deployment with the scheduled model refresh module; and how to predict a stock price by providing its ticker symbol, sector, and change information using the inference module.
Note: This post is for demonstration purposes only. It does not attempt to build a viable stock prediction model for real world use. Nothing in this post should be construed as investment advice.
Before starting the testing process, you need to subscribe to AutoGluon on AWS Marketplace.
- Choose Continue to Subscribe.
- Choose Accept Offer.
- Choose Continue to configuration.
- For Software version, choose your software version.
- For Region, choose your Region.
- Choose View in Amazon SageMaker.
You’re redirected to the Amazon SageMaker console.
Data ingestion from streaming data
You can use an Amazon SageMaker notebook instance or an Amazon Elastic Compute Cloud (Amazon EC2) instance to run the following commands. For instructions in Amazon SageMaker, see Create a Notebook Instance. In the following code, replace <account-id>
with your AWS account and <region>
with your current Region, which for this post is us-east-1
.
- Copy the data and other artifacts of the solution in the newly created input S3 bucket by running the following command:
aws s3 cp --recursive s3://aws-ml-blog/artifacts/Automated-model-refresh-with-streaming-data/ s3://model-refresh-input-bucket-<region>-<account-id>/model-refresh/
On the Amazon S3 console, or using the AWS Command Line Interface (AWS CLI), modify the contents of the unload SQL script by replacing the <region>
and <account-id>
entries, and re-upload it to Amazon S3. The file is located at s3://model-refresh-input-bucket-<region>-<account-id>/model-refresh/sql/script_unload.sql
. It stores a script that is used to copy the contents of the stock_table
from the Amazon Redshift database into the newly created S3 bucket. This is used to train and evaluate a model. See the following code:
unload ('SELECT * FROM stock_table')
to 's3://model-refresh-output-bucket-<region>-<account-id>/model-refresh/base-table/' iam_role 'arn:aws:iam::<account-id>:role/RedshiftS3AccessR
ole' HEADER;
- On the Amazon Redshift console, create a new table within the newly created Amazon Redshift cluster named model-refresh-cluster.
- Connect to the database dev within the cluster with the following credentials (you can change the password later because this is automatically created from the CloudFormation template):
Database: dev
Username: awsuser
Password: Password#123 - Create a table named stock_table within the database. You can also change the table name and schema later. You can run the following command in the query editor after connecting to the Amazon Redshift cluster:
CREATE TABLE stock_table ( ticker_symbol VARCHAR(65535), sector VARCHAR(65535), change FLOAT, price FLOAT );
- Copy the data from the input bucket to the newly created Amazon Redshift table. The IAM role is the role associated with the cluster that has at least Amazon S3 read access. The following command is a query and is run via the Amazon Redshift query editor:
copy stock_table from 's3://aws-ml-blog/artifacts/Automated-model-refresh-with-streaming-data/data/data_all.csv' iam_role 'arn:aws:iam::<account-id>:role/RedshiftS3AccessRole' FORMAT AS CSV IGNOREHEADER AS 1;
- Check if the data is copied to the database. When you run the following SQL, you should get 153580 rows:
select count(*) from stock_table;
The next step is to test if the data streaming pipeline is working as expected.
- Copy a sample CSV data using the CLI from the input bucket into the newly created output S3 bucket named model-refresh-output-bucket-<region>-<account-id>. This sample CSV file only contains 10 observations for test purposes.
aws s3 cp s3://model-refresh-input-bucket-<region>-<account-id>/model-refresh/data/data_sample.csv s3://model-refresh-output-bucket-<region>-<account-id>/model-refresh/stream-input-data/
After it’s copied to the S3 bucket, the streaming functionality is triggered.
After 3–5 minutes, check if the streamed data is loaded into the Amazon Redshift table.
- On the Amazon Redshift console, run the following SQL statement to see the number of rows added to the table. 10 rows should be added from the previous one.
select count(*) from stock_table;
Scheduling automated ML training and deployment
In this section, you schedule the automated model training and deployment.
- On the EventBridge console, choose Create Rule.
- Enter a name for the role.
- For Define a pattern, choose Schedule.
- Define any frequency (such as 1 per day).
- Leave the event bus as its default.
- For the target, select Step Functions step machine.
- Choose model-refresh-state-machine.
This points the scheduler to the state machine that trains and deploys the model.
To configure the input, you need to pass the required parameters for the state machine.
- Enter the value of
ExecutionInput
from the CloudFormation stack outputs. The JSON looks like the following code:{ "DBConnection": { "jobName": "etl-job" ... }
You can track the progress of the model refresh by navigating to the Step Functions console and choosing the corresponding state machine. A sample of the visual workflow is shown in the Real-time model inference section of this post.
Inference from the deployed model
Let’s test the deployed model using Postman, which is an HTTP client for testing web services. Download the latest version.
During the previous step, an endpoint is created and is available for getting inference. You need to update the Lambda function for inference with this new endpoint.
- On the Amazon SageMaker console, choose Endpoints.
- Copy the name of the endpoint you created.
- On the Lambda console, choose InferenceLambdaFunction.
- Under Environment variables, update the
ENDPOINT_NAME
variable to the new endpoint name.
When you deployed your API Gateway, it provided the invoke URL that looks like the following:
https://{restapi_id}.execute-api.us-east-1.amazonaws.com/predict/lambda
It follows the format:
https://{restapi_id}.execute-api.{region}.amazonaws.com/{stage_name}/{resource_name}
You can locate this link on the API Gateway console under Stages.
- Enter the invoke URL into Postman.
- Choose Post as method.
- On the Body tab, enter the test data.
- Choose Send to see the returned result.
Customizing the solution
In this post, we use Amazon Redshift, which achieves efficient storage and optimum query performance through a combination of massively parallel processing, columnar data storage, and targeted data compression encoding schemes. Another option for the ETL process is AWS Glue, which is a fully managed ETL service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. You can easily use AWS Glue instead of Amazon Redshift by replacing a state on the state machine with one for AWS Glue. For more information, see Manage AWS Glue Jobs with Step Functions.
Similarly, you can add several components (such as an A/B testing module) to the state machine by editing the JSON text. You can also bring your own ML algorithm and use it for training instead of the AutoGluon automatic ML.
Service costs
Amazon S3, Lambda, Amazon SageMaker, Amazon API Gateway and Step Functions are included in the AWS Free Tier, with charges for additional use. For more information, see the following pricing pages:
EventBridge is free for AWS service events, with charges for custom, third-party, and cross-account events. For more information, see Amazon EventBridge pricing.
Kinesis Data Firehose charges vary based on amount of data ingested, format conversion, and VPC delivery. Kinesis Data Streams charges vary based on throughput and number of payload units. For more information, see Amazon Kinesis Data Firehose pricing and Amazon Kinesis Data Streams pricing.
Amazon Redshift charges vary by the AWS Region and compute instance used. For more information, see Amazon Redshift pricing.
There is no additional charge for AWS Batch. You only pay for the AWS resources you create to store and run your batch jobs.
Cleaning up
To avoid recurring charges, delete the input and output S3 buckets (model-refresh-input-bucket-<region>-<account-id>
and model-refresh-output-bucket-<region>-<account-id>
).
After the buckets are successfully removed, delete the created CloudFormation stack. Deleting a CloudFormation stack deletes all the created resources.
Conclusion
This post demonstrated a solution that facilitates cloud adoption and migration of existing on-premises ML workflows for large-scale data. The solution was launched via a CloudFormation template, and provided efficient ETL processes to capture high-velocity streaming data, easy and automated ways to build and orchestrate ML algorithms, and built endpoints for real-time inference from the deployed model.
If you’d like help accelerating your use of ML in your products and processes, please contact the Amazon ML Solutions Lab.
About the Authors
Mehdi Noori is a Data Scientist at the Amazon ML Solutions Lab, where he works with customers across various verticals, and helps them to accelerate their cloud migration journey, and to solve their ML problems using state-of-the-art solutions and technologies.
Yohei Nakayama is a Deep Learning Architect at the Amazon Machine Learning Solutions Lab, where he works with customers across different verticals accelerate their use of artificial intelligence and AWS cloud services to solve their business challenges. He is interested in applying ML/AI technologies to space industry.
Tesfagabir Meharizghi is a Data Scientist at the Amazon ML Solutions Lab where he helps customers across different industries accelerate their use of machine learning and AWS cloud services to solve their business challenges.
Ninad Kulkarni is a data scientist in the Amazon Machine Learning Solutions Lab. He helps customers adopt ML and AI by building solutions to address their business problems. Most recently, he has built predictive models for sports and automotive customers.
How a ‘Think Big’ idea helped bring Lookout for Vision to life
Learn about the science behind the new machine learning product for manufacturers — and how a unique approach solved a complex problem.Read More
Amazon Alexa scientists Yang Liu and Ruhi Sarikaya named IEEE Fellows
Scientists are recognized for their contributions to conversational understanding systems.Read More
Amazon takes top three spots in Audio Anomaly Detection Challenge
Team from Amazon Web Services also wins the best-paper award at the Workshop on Detection and Classification of Acoustic Scenes and Events.Read More