Five women who work at Amazon hosted a panel discussion on the importance of innovation and diversity in STEM during INFORMS 2020.Read More
Bringing your own custom container image to Amazon SageMaker Studio notebooks
Amazon SageMaker Studio is the first fully integrated development environment (IDE) for machine learning (ML). SageMaker Studio lets data scientists spin up Studio notebooks to explore data, build models, launch Amazon SageMaker training jobs, and deploy hosted endpoints. Studio notebooks come with a set of pre-built images, which consist of the Amazon SageMaker Python SDK and the latest version of the IPython runtime or kernel. With this new feature, you can bring your own custom images to Amazon SageMaker notebooks. These images are then available to all users authenticated into the domain. In this post, we share how to bring a custom container image to SageMaker Studio notebooks.
Developers and data scientists may require custom images for several different use cases:
- Access to specific or latest versions of popular ML frameworks such as TensorFlow, MxNet, PyTorch, or others.
- Bring custom code or algorithms developed locally to Studio notebooks for rapid iteration and model training.
- Access to data lakes or on-premises data stores via APIs, and admins need to include the corresponding drivers within the image.
- Access to a backend runtime, also called kernel, other than IPython such as R, Julia, or others. You can also use the approach outlined in this post to install a custom kernel.
In large enterprises, ML platform administrators often need to ensure that any third-party packages and code is pre-approved by security teams for use, and not downloaded directly from the internet. A common workflow might be that the ML Platform team approves a set of packages and frameworks for use, builds a custom container using these packages, tests the container for vulnerabilities, and pushes the approved image in a private container registry such as Amazon Elastic Container Registry (Amazon ECR). Now, ML platform teams can directly attach approved images to the Studio domain (see the following workflow diagram). You can simply select the approved custom image of your choice in Studio. You can then work with the custom image locally in your Studio notebook. With this release, a single Studio domain can contain up to 30 custom images, with the option to add a new version or delete images as needed.
We now walk through how you can bring a custom container image to SageMaker Studio notebooks using this feature. Although we demonstrate the default approach over the internet, we include details on how you can modify this to work in a private Amazon Virtual Private Cloud (Amazon VPC).
Prerequisites
Before getting started, you need to make sure you meet the following prerequisites:
- Have an AWS account.
- Ensure that the execution role you use to access Amazon SageMaker has the following AWS Identity and Access Management (IAM) permissions, which allow SageMaker Studio to create a repository in Amazon ECR with the prefix
smstudio
, and grant permissions to push and pull images from this repo. To use an existing repository, replace theResource
with the ARN of your repository. To build the container image, you can either use a local Docker client or create the image from SageMaker Studio directly, which we demonstrate here. To create a repository in Amazon ECR, SageMaker Studio uses AWS CodeBuild, and you also need to include the CodeBuild permissions shown below.{ "Effect": "Allow", "Action": [ "ecr:CreateRepository", "ecr:BatchGetImage", "ecr:CompleteLayerUpload", "ecr:DescribeImages", "ecr:DescribeRepositories", "ecr:UploadLayerPart", "ecr:ListImages", "ecr:InitiateLayerUpload", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:PutImage" ], "Resource": "arn:aws:ecr:*:*:repository/smstudio*" }, { "Effect": "Allow", "Action": "ecr:GetAuthorizationToken", "Resource": "*" } { "Effect": "Allow", "Action": [ "codebuild:DeleteProject", "codebuild:CreateProject", "codebuild:BatchGetBuilds", "codebuild:StartBuild" ], "Resource": "arn:aws:codebuild:*:*:project/sagemaker-studio*" }
- Your SageMaker role should also have a trust policy with AWS CodeBuild as shown below. For more information, see Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": [ "codebuild.amazonaws.com" ] }, "Action": "sts:AssumeRole" } ] }
- Install the AWS Command Line Interface (AWS CLI) on your local machine. For instructions, see Installing the AWS.
- Have a SageMaker Studio domain. To create a domain, use the CreateDomain API or the create-domain CLI command.
If you wish to use your private VPC to securely bring your custom container, you also need the following:
- A VPC with a private subnet
- VPC endpoints for the following services:
- Amazon Simple Storage Service (Amazon S3)
- Amazon SageMaker
- Amazon ECR
- AWS Security Token Service (AWS STS)
- CodeBuild for building Docker containers
To set up these resources, see Securing Amazon SageMaker Studio connectivity using a private VPC and the associated GitHub repo.
Creating your Dockerfile
To demonstrate the common need from data scientists to experiment with the newest frameworks, we use the following Dockerfile, which uses the latest TensorFlow 2.3 version as the base image. You can replace this Dockerfile with a Dockerfile of your choice. Currently, SageMaker Studio supports a number of base images, such as Ubuntu, Amazon Linux 2, and others. The Dockerfile installs the IPython runtime required to run Jupyter notebooks, and installs the Amazon SageMaker Python SDK and boto3.
In addition to notebooks, data scientists and ML engineers often iterate and experiment on their local laptops using various popular IDEs such as Visual Studio Code or PyCharm. You may wish to bring these scripts to the cloud for scalable training or data processing. You can include these scripts as part of your Docker container so they’re visible in your local storage in SageMaker Studio. In the following code, we copy the train.py
script, which is a base script for training a simple deep learning model on the MNIST dataset. You may replace this script with your own scripts or packages containing your code.
FROM tensorflow/tensorflow:2.3.0
RUN apt-get update
RUN apt-get install -y git
RUN pip install --upgrade pip
RUN pip install ipykernel &&
python -m ipykernel install --sys-prefix &&
pip install --quiet --no-cache-dir
'boto3>1.0<2.0'
'sagemaker>2.0<3.0'
COPY train.py /root/train.py #Replace with your own custom scripts or packages
import tensorflow as tf
import os
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1)
model.evaluate(x_test, y_test)
Instead of a custom script, you can also include other files, such as Python files that access client secrets and environment variables via AWS Secrets Manager or AWS Systems Manager Parameter Store, config files to enable connections with private PyPi repositories, or other package management tools. Although you can copy the script using the custom image, any ENTRYPOINT or CMD commands in your Dockerfile don’t run.
Setting up your installation folder
You need to create a folder on your local machine, and add the following files in that folder:
- The Dockerfile that you created in the previous step
- A file named
app-image-config-input.json
with the following content:"AppImageConfigName": "custom-tf2", "KernelGatewayImageConfig": { "KernelSpecs": [ { "Name": "python3", "DisplayName": "Python 3" } ], "FileSystemConfig": { "MountPath": "/root/data", "DefaultUid": 0, "DefaultGid": 0 } } }
We set the backend kernel for this Dockerfile as an IPython kernel, and provide a mount path to the Amazon Elastic File System (Amazon EFS). Amazon SageMaker recognizes kernels as defined by Jupyter. For example, for an R kernel, set Name in the preceding code to ir
.
- Create a file named
default-user-settings.json
with the following content. If you’re adding multiple custom images, just add to the list ofCustomImages
.{ "DefaultUserSettings": { "KernelGatewayAppSettings": { "CustomImages": [ { "ImageName": "tf2kernel", "AppImageConfigName": "custom-tf2" } ] } } }
Creating and attaching the image to your Studio domain
If you have an existing domain, you simply need to update the domain with the new image. In this section, we demonstrate how existing Studio users can attach images. For instructions on onboarding a new user, see Onboard to Amazon SageMaker Studio Using IAM.
First, we use the SageMaker Studio Docker build CLI to build and push the Dockerfile to Amazon ECR. Note that you can use other methods to push containers to ECR such as your local docker client, and the AWS CLI.
-
- Log in to Studio using your user profile.
- Upload your Dockerfile and any other code or dependencies you wish to copy into your container to your Studio domain.
- Navigate to the folder containing the Dockerfile.
- In a terminal window or in a notebook —>
!pip install sagemaker-studio-image-build
- Export a variable called
IMAGE_NAME
, and set it to the value you specified in thedefault-user-settings.json
sm-docker build . --repository smstudio-custom:IMAGE_NAME
- If you wish to use a different repository, replace
smstudio-custom
in the preceding code with your repo name.
SageMaker Studio builds the Docker image for you and pushes the image to Amazon ECR in a repository named smstudio-custom
, tagged with the appropriate image name. To customize this further, such as providing a detailed file path or other options, see Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks. For the pip
command above to work in a private VPC environment, you need a route to the internet or access to this package in your private repository.
- In the installation folder from earlier, create a new file called
create-and-update-image.sh
:ACCOUNT_ID=AWS ACCT ID # Replace with your AWS account ID REGION=us-east-2 #Replace with your region DOMAINID=d-####### #Replace with your SageMaker Studio domain name. IMAGE_NAME=tf2kernel #Replace with your Image name # Using with SageMaker Studio ## Create SageMaker Image with the image in ECR (modify image name as required) ROLE_ARN='The Execution Role ARN for the execution role you want to use' aws --region ${REGION} sagemaker create-image --image-name ${IMAGE_NAME} --role-arn ${ROLE_ARN} aws --region ${REGION} sagemaker create-image-version --image-name ${IMAGE_NAME} --base-image "${ACCOUNT_ID}.dkr.ecr.${REGION}.amazonaws.com/smstudio-custom:${IMAGE_NAME}" ## Create AppImageConfig for this image (modify AppImageConfigName and KernelSpecs in app-image-config-input.json as needed) aws --region ${REGION} sagemaker create-app-image-config --cli-input-json file://app-image-config-input.json ## Update the Domain, providing the Image and AppImageConfig aws --region ${REGION} sagemaker update-domain --domain-id ${DOMAINID} --cli-input-json file://default-user-settings.json
Refer to the AWS CLI to read more about the arguments you can pass to the create-image API. To check the status, navigate to your Amazon SageMaker console and choose Amazon SageMaker Studio from the navigation pane.
Attaching images using the Studio UI
You can perform the final step of attaching the image to the Studio domain via the UI. In this case, the UI will handle the setting up of the
- On the Amazon SageMaker console, choose Amazon SageMaker Studio.
On the Control Panel page, you can see that the Studio domain was provisioned, along with any user profiles that you created.
- Choose Attach image.
- Select whether you wish to attach a new or pre-existing image.
- If you select Existing image, choose an image from the Amazon SageMaker image store.
- If you select New image, provide the Amazon ECR registry path for your Docker image. The path needs to be in the same Region as the studio domain. The ECR repo also needs to be in the same account as your Studio domain or cross-account permissions for Studio need to be enabled.
- Choose Next.
- For Image name, enter a name.
- For Image display name, enter a descriptive name.
- For Description, enter a label definition.
- For IAM role, choose the IAM role required by Amazon SageMaker to attach Amazon ECR images to Amazon SageMaker images on your behalf.
- Additionally, you can tag your image.
- Choose Next.
- For Kernel name, enter Python 3.
- Choose Submit.
The green check box indicates that the image has been successfully attached to the domain.
The Amazon SageMaker image store automatically versions your images. You can select a pre-attached image and choose Detach to detach the image and all versions, or choose Attach image to attach a new version. There is no limit to the number of versions per image or the ability to detach images.
User experience with a custom image
Let’s now jump into the user experience for a Studio user.
- Log in to Studio using your user profile.
- To launch a new activity, choose Launcher.
- For Select a SageMaker image to launch your activity, choose tf2kernel.
- Choose the Notebook icon to open a new notebook with the custom kernel.
The notebook kernel takes a couple minutes to spin up and you’re ready to go!
Testing your custom container in the notebook
When the kernel is up and running, you can run code in the notebook. First, let’s test that the correct version of TensorFlow that was specified in the Dockerfile is available for use. In the following screenshot, we can see that the notebook is using the tf2kernel
we just launched.
Amazon SageMaker notebooks also display the local CPU and memory usage.
Next, let’s try out the custom training script directly in the notebook. Copy the training script into a notebook cell and run it. The script downloads the mnist
dataset from the tf.keras.datasets
utility, splits the data into training and test sets, defines a custom deep neural network algorithm, trains the algorithm on the training data, and tests the algorithm on the test dataset.
To experiment with the TensorFlow 2.3 framework, you may wish to test out newly released APIs, such as the newer feature preprocessing utilities in Keras. In the following screenshot, we import the keras.layers.experimental
library released with TensorFlow 2.3, which contains newer APIs for data preprocessing. We load one of these APIs and re-run the script in the notebook.
Amazon SageMaker also dynamically modifies the CPU and memory usage as the code runs. By bringing your custom container and training scripts, this feature allows you to experiment with custom training scripts and algorithms directly in the Amazon SageMaker notebook. When you’re satisfied with the experimentation in the Studio notebook, you can start a training job.
What about the Python files or custom files you included with the Dockerfile using the COPY command? SageMaker Studio mounts the elastic file system in the file path provided in the app-image-config-input.json
, which we set to root/data
. To avoid Studio from overwriting any custom files you want to include, the COPY command loads the train.py file in the path /root. To access this file, open a terminal or notebook and run the code:
! cat /root/train.py
You should see an output as shown in the screenshot below.
The train.py
file is in the specified location.
Logging in to CloudWatch
SageMaker Studio also publishes kernel metrics to Amazon CloudWatch, which you can use for troubleshooting. The metrics are captured under the /aws/sagemaker/studio
namespace.
To access the logs, on the CloudWatch console, choose CloudWatch Logs. On the Log groups page, enter the namespace to see logs associated with the Jupyter server and the kernel gateway.
Detaching an image or version
You can detach an image or an image version from the domain if it’s no longer supported.
To detach an image and all versions, select the image from the Custom images attached to domain table and choose Detach.
You have the option to also delete the image and all versions, which doesn’t affect the image in Amazon ECR.
To detach an image version, choose the image. On the Image details page, select the image version (or multiple versions) from the Image versions attached to domain table and choose Detach. You see a similar warning and options as in the preceding flow.
Conclusion
SageMaker Studio enables you to collaborate, experiment, train, and deploy ML models in a streamlined manner. To do so, data scientists often require access to the newest ML frameworks, custom scripts, and packages from public and private code repositories and package management tools. You can now create custom images containing all the relevant code, and launch these using Studio notebooks. These images will be available to all users in the Studio domain. You can also use this feature to experiment with other popular languages and runtimes besides Python, such as R, Julia, and Scala. The sample files are available on the GitHub repo. For more information about this feature, see Bring your own SageMaker image.
About the Authors
Stefan Natu is a Sr. Machine Learning Specialist at AWS. He is focused on helping financial services customers build end-to-end machine learning solutions on AWS. In his spare time, he enjoys reading machine learning blogs, playing the guitar, and exploring the food scene in New York City.
Jaipreet Singh is a Senior Software Engineer on the Amazon SageMaker Studio team. He has been working on Amazon SageMaker since its inception in 2017 and has contributed to various Project Jupyter open-source projects. In his spare time, he enjoys hiking and skiing in the Pacific Northwest.
Huong Nguyen is a Sr. Product Manager at AWS. She is leading the user experience for SageMaker Studio. She has 13 years’ experience creating customer-obsessed and data-driven products for both enterprise and consumer spaces. In her spare time, she enjoys reading, being in nature, and spending time with her family.
Amazon Alexa’s new wake word research at Interspeech
Work aims to improve accuracy of models both on- and off-device.Read More
Amazon Translate now enables you to mark content to not get translated
While performing machine translations, you may have situations where you wish to preserve specific sections of text from being translated, such as names, unique identifiers, or codes. We at the Amazon Translate team are excited to announce a tag modifications that allows you to specify what text should not be translated. This feature is available in both the real-time TranslateText
API and asynchronous batch TextTranslation
API. You can tag segments of text that you don’t want to translate in an HTML element. In this post, we walk through the step-by-step method to use this feature.
Using the translate-text operation in Command Line Interface
The following example shows you how to use the translate-text operation from the command line. This example is formatted for Unix, Linux, and macOS. For Windows, replace the backslash () Unix continuation character at the end of each line with a caret (^). At the command line, enter the following code:
aws translate translate-text
--source-language-code "en"
--target-language-code "es"
--region us-west-2
--text “This can be translated to any language. <p translate=no>But do not translate this!</p>”
You can specify any type of HTML element to do so, for example, paragraph <p>
, text section <span>
, or block section <div>
. When you run the command, you get the following output:
{
"TranslatedText": "Esto se puede traducir a cualquier idioma. <p translate=no>But do not translate this!</p>",
"SourceLanguageCode": "en",
"TargetLanguageCode": "es"
}
Using the span tag in Amazon Translate Console
In this example, we translate the following text from French to English:
Musée du Louvre, c’est ainsi que vous dites Musée du Louvre en français.
You don’t want to translate the first instance of “Musée du Louvre,” but you do want to translate the second instance to “Louvre Museum.” You can tag the first instance using a simple span tag:
Musée du Louvre, c'est ainsi que vous dites Musée du Louvre en français.
The following screenshot shows the output on the Amazon Translate console.
The following screenshot shows the output translated to Arabic.
Conclusion
In this post, we showed you how to tag and specify text that should not be translated. For more information, see the Amazon Translate Developer Guide and Amazon Translate resources. If you’re new to Amazon Translate, try it out using our Free Tier, which offers 2 million characters per month for free for the first 12 months, starting from your first translation request.
About the Author
Watson G. Srivathsan is the Sr. Product Manager for Amazon Translate, AWS’s natural language processing service. On weekends you will find him exploring the outdoors in the Pacific Northwest.
Intelligently connect to customers using machine learning in the COVID-19 pandemic
The pandemic has changed how people interact, how we receive information, and how we get help. It has shifted much of what used to happen in-person to online. Many of our customers are using machine learning (ML) technology to facilitate that transition, from new remote cloud contact centers, to chatbots, to more personalized engagements online. Scale and speed are important in the pandemic—whether it’s processing grant applications or limiting call wait times for customers. ML tools like Amazon Lex and Amazon Connect are just a few of the solutions helping to power this change with speed, scale, and accuracy. In this post, we explore companies who have quickly pivoted to take advantage of AI capabilities to engage more effectively online and deliver immediate impact.
Chatbots connect governments and their citizens
GovChat is South Africa’s largest citizen engagement platform, connecting over 50 million citizens to 10,000 public representatives in the government. Information flowing to and from the government has gained a new level of urgency, and this connection between citizens and the government is critical in how we adjust and respond to the pandemic. GovChat exists to meet that demand—working directly with the South African government to facilitate the digitization of their COVID-19 social relief grants, help citizens find their closest COVID-19 testing facility, and enable educational institutions to reopen safely.
GovChat uses a chatbot powered by Amazon Lex, a managed AI service for building conversational interfaces into any application using voice and text. The chatbot, available on popular social media platforms such as WhatsApp and Facebook, provides seamless communication between the government and its citizens.
At the beginning of the pandemic, GovChat worked with the South African Social Security Agency to digitize, facilitate, and track applications for a COVID-19 social relief grant. The plan was to create a chatbot that could help citizens easily file and track their grant applications. GovChat needed to act quickly and provide an infrastructure that could rapidly scale to support the unprecedented demand for government aid. To provide speed of delivery and scalability while keeping costs down, GovChat turned to Amazon Lex for voice and text conversational interfaces and AWS Lambda, a serverless compute service. Within days, the chatbot was handling up to 14.2 million messages a day across social media platforms in South Africa regarding the social relief grant.
More recently, the South African Human Rights Commission (SAHRC) turned to GovChat to help gauge schools’ readiness to reopen safety. Parents, students, teachers, and community members can use their mobile devices to provide first-hand, real-time details of their school’s COVID-19 safety checks and readiness as contact learning is resumed, with special attention paid to children with disabilities. In GovChat’s engagements during the COVID-19 pandemic, they found that 28% of service requests at schools have been in relation to a disruption in access to water, which is critical for effective handwashing—a preventative component to fight the spread of the virus. With the real-time data provided by citizens via the chatbot, the government was able to better understand the challenges schools faced and identify areas of improvement. GovChat has processed over 250 million messages through their platform, playing an important role in enabling more effective and timely communications between citizens and their government.
ML helps power remote call centers
Organizations of all kinds have also experienced a rapid increase in call volume to their call centers—from local government, to retail, to telecommunications, to healthcare providers. Organizations have also had to quickly shift to a remote work environment in response to the pandemic. Origin Energy, one of Australia’s largest integrated energy companies serving over 4 million customer accounts, launched an Amazon Connect contact center in March as part of their customer experience transformation. Amazon Connect is an omnichannel cloud contact center with AI/ML capabilities that understands context and can transcribe conversations.
This transition to Amazon Connect accelerated Origin’s move to remote working during the COVID-19 pandemic. This allowed their agents to continue to serve their customers, while also providing increased self-service and automation options such as bill payments, account maintenance, and plan renewals to customers. They deployed new AI/ML capabilities, including neural text-to-speech through Amazon Polly. Since the March 2020 launch, they’ve observed an increase in call quality scores, improved customer satisfaction, and agent productivity—all while managing up to 1,200 calls at a time. They’re now looking to further leverage natural language understanding with Amazon Lex and automated quality management with built-in speech-to-text and sentiment analysis from Contact Lens for Amazon Connect. Amazon Connect has supported Origin in their efforts to respond rapidly to opportunities and customer feedback as they focus on continually improving their customer experience with affordable, reliable, and sustainable energy.
Conclusion
Organizations are employing creative strategies to engage their customers and provide a more seamless experience. This is a two-way street; not only can organizations more effectively distribute key information, but—more importantly—they can listen. They can hear the evolving needs of their customers and adjust in real time to meet them.
To learn about another way AWS is working toward solutions from the COVID-19 pandemic, check out the blog article Introducing the COVID-19 Simulator and Machine Learning Toolkit for Predicting COVID-19 Spread.
About the Author
Taha A. Kass-Hout, MD, MS, is director of machine learning and chief medical officer at Amazon Web Services (AWS). Taha received his medical training at Beth Israel Deaconess Medical Center, Harvard Medical School, and during his time there, was part of the BOAT clinical trial. He holds a doctor of medicine and master’s of science (bioinformatics) from the University of Texas Health Science Center at Houston.
Announcing the AWS DeepComposer Chartbusters challenge, Keep Calm and Model On
We are back with another AWS DeepComposer Chartbusters challenge, Keep Calm and Model On! This challenge is open for submissions throughout AWS re:Invent until January 31, 2021. In this challenge, you can experiment with our newly launched Transformers algorithm and generate an original piece of music. Chartbusters is a global monthly challenge where you can use AWS DeepComposer to create compositions on the console using generative AI techniques, compete to top the charts, and win prizes. This challenge launches today and you can submit your compositions until January 31, 2021.
Participation is easy — in 5 simple steps you can generate your original piece of music on the AWS DeepComposer console using the Transformers technique. You can then use Edit melody and add or remove notes.
How to Compete
To participate in Keep Calm and Model On, just do the following:
- Go to AWS DeepComposer Music Studio and choose a sample melody recommended for Transformers.
- Under Generative AI technique choose Transformers. Then choose TransformerXIClassical for Model. You then have six advanced parameters that you can choose to adjust including Sampling technique that defines how the next note for your input melody is chosen, and Track extension duration that attempts to add additional seconds to your melody. After adjusting the values, choose Extend Input Melody.
- Use Edit melody to add or remove notes. You can also change the note duration and pitch. When finished, choose Apply changes. Repeat these steps until you’re satisfied with the generated music.
- When you’re happy with your composition, choose Download composition. You can choose to post-process your composition; however, one of the judging criteria is how close your final submission is to the track generated using AWS DeepComposer.
- In the navigation panel, choose Chartbusters; and on the chartbusters page, choose Submit a composition. Then, choose Import a post-processed audio track, upload your composition, provide a track name for your composition, and choose Submit.
AWS DeepComposer then submits your composition to the Keep Calm and Model On playlist on SoundCloud.
Congratulations!
You’ve successfully submitted your composition to the AWS DeepComposer Chartbusters challenge Keep Calm and Model On. Now you can invite your friends and family to listen to your creation on SoundCloud and join the fun by participating in the competition.
About the Author
Maryam Rezapoor is a Senior Product Manager with AWS AI Devices team. As a former biomedical researcher and entrepreneur, she finds her passion in working backward from customers’ needs to create new impactful solutions. Outside of work, she enjoys hiking, photography, and gardening.
Deploying reinforcement learning in production using Ray and Amazon SageMaker
Reinforcement learning (RL) is used to automate decision-making in a variety of domains, including games, autoscaling, finance, robotics, recommendations, and supply chain. Launched at AWS re:Invent 2018, Amazon SageMaker RL helps you quickly build, train, and deploy policies learned by RL. Ray is an open-source distributed execution framework that makes it easy to scale your Python applications. Amazon SageMaker RL uses the RLlib library that builds on the Ray framework to train RL policies.
This post walks you through the tools available in Ray and Amazon SageMaker RL that help you address challenges such as scale, security, iterative development, and operational cost when you use RL in production. For a primer on RL, see Amazon SageMaker RL – Managed Reinforcement Learning with Amazon SageMaker.
Use case
In this post, we take a simple supply chain use case, in which you’re deciding how many basketballs to buy for a store to meet customer demand. The agent decides how many basketballs to buy every day, and it takes 5 days to ship the product to the store. The use case is a multi-period variant of the classic newsvendor problem. Newspapers lose value in a single day, and therefore, each agent decision is independent. Because the basketball remains valuable as long as it’s in the store, the agent has to optimize its decisions over a sequence of steps. Customer demand has inherent uncertainty, so the agent needs to balance the trade-off between ordering too many basketballs that may not sell and incur storage cost versus buying too few, which can lead to unsatisfied customers. The objective of the agent is to maximize the sales while minimizing storage costs and customer dissatisfaction. We refer to the agent as the basketball agent in the rest of the post.
You need to address the following challenges to train and deploy the basketball agent in production:
- Formulate the problem. Determine the state, action, rewards, and state transition of the problem. Create a simulation environment that captures the problem formulation.
- Train the agent. Training with RL requires precise algorithm implementations because minor errors can lead to poor performance. Training can require millions of interactions between the agent and the environment before the policy converges. Therefore, distributed training becomes necessary to reduce training times. You need to make various choices while training the agent: picking the state representation, the reward function, the algorithm to use, the neural network architecture, and the hyperparameters of the algorithm. It becomes quickly overwhelming to navigate these options, experiment at scale, and finalize the policy to use in production.
- Deploy and monitor policy. After you train the policy and evaluate its performance offline, you can deploy it in production. When deploying the policy, you need the ensure the policy behaves as expected and scales to the workload in production. You can perform A/B testing, continually deploy improved versions of the model, and look out for anomalous behavior. Development and maintenance of the deployment infrastructure in a secure, scalable, and cost-effective manner can be an onerous task.
Amazon SageMaker RL and Ray provide the undifferentiated heavy lifting infrastructure required for deploying RL at scale, which helps you focus on the problem at hand. You can provision resources with a single click, use algorithm implementations that efficiently utilize these resources provisioned, and track, visualize, debug, and replicate your experiments. You can deploy the model with a managed microservice that autoscales and logs model actions. The rest of the post walks you through these steps with the basketball agent as our use case.
Formulating the problem
For the basketball problem, our state includes the expected customer demand, the current inventory in the store, the inventory on the way, and the cost of purchasing and storing the basketballs. The action is the number of basketballs to order. The reward is the net profit with a penalty for missed demand. This post includes code for a simulator that encodes the problem formulation using the de-facto standard Gym API. We assume a Poisson demand profile. You can use historical customer demand data in the simulator to capture real-world characteristics.
You can create a simulator in different domains, such as financial portfolio management, autoscaling, and multi-player gaming. Amazon SageMaker RL notebooks provide examples of simulators with custom packages and datasets.
Training your agent
You can start training your agent with RL using state-of-the-art algorithms available in Ray. The algorithms have been tested on well-known published benchmarks, and you can customize them to specify your loss function, neural network architecture, logging metrics, and more. You can choose between the TensorFlow and PyTorch deep learning frameworks.
Amazon SageMaker RL makes it easy to get started with Ray using a managed training experience. You can launch experiments on secure, up-to-date instances with pre-built Ray containers using familiar Jupyter notebooks. You pay for storage and instances based on your usage, with no minimum fees or upfront commitments, therefore costs are minimized.
The following code shows the configuration for training the basketball agent using the proximal policy optimization (PPO) algorithm with a single instance:
def get_experiment_config(self):
return { "training": { "env": "Basketball-v1",
"run": "PPO",
"config": {
"lr": 0.0001,
"num_workers": (self.num_cpus - 1),
"num_gpus": self.num_gpus, }, } }
To train a policy with Amazon SageMaker RL, you start a training job. You can save up to 90% on your training cost by using managed spot training, which uses Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances instead of On-Demand Instances. Just enable train_use_spot_instances
and set the train_max_wait
. Amazon SageMaker restarts your training jobs if a Spot Instance is interrupted, and you can configure managed spot training jobs to use periodically saved checkpoints. For more information about using Spot Instances, see Managed Spot Training: Save Up to 90% On Your Amazon SageMaker Training Jobs. The following code shows how you can launch the training job using Amazon SageMaker RL APIs:
estimator = RLEstimator(base_job_name='basketball',
entry_point="train_basketball.py",
image_name=ray_tf_image,
train_instance_type='ml.m5.large',
train_instance_count=1,
train_use_spot_instances=True, # use spot instance
train_max_wait=7200, #seconds,
checkpoint_s3_uri=checkpoint_s3_uri, # s3 for checkpoint syncing
hyperparameters = {
# necessary for syncing between spot instances
"rl.training.upload_dir": checkpoint_s3_uri,
}
...)
estimator.fit()
Amazon SageMaker RL saves the metadata associated with each training job, such as the instance used, the source code, and the metrics logged. The print logs are saved in Amazon CloudWatch, the training outputs are saved in Amazon Simple Storage Service (Amazon S3), and you can replicate each training job with a single click. The Amazon SageMaker RL training job console visualizes the instance resource use and training metrics, such as episode reward and policy loss.
The following example visualization shows the mean episode rewards, policy entropy, and policy loss over the training time. As the agent learns to take better actions, its rewards improve. The entropy indicates the randomness of the actions taken by the agent. Initially, the agent takes random actions to explore the state space, and the entropy is high. As the agent improves, its randomness and entropy reduce. The policy loss indicates the value of the loss function used by the RL algorithm to update the policy neural network. We use the PPO algorithm, which should remain close to zero during training.
Ray also creates a TensorBoard with training metrics and saves them to Amazon S3. The following visualizations show the same metrics in TensorBoard.
Ray is designed for distributed runs, so it can efficiently use all the resources available in an instance: CPU, GPU, and memory. You can scale the training further with multi-instance clusters by incrementing the train_instance_count
in the preceding API call. Amazon SageMaker RL creates the cluster for you, and Ray uses cluster resources to train the agent rapidly.
You can choose to create heterogeneous clusters with multiple instance types (for more information, see the following GitHub repo). For more information about distributed RL training, see Scaling your AI-powered Battlesnake with distributed reinforcement learning in Amazon SageMaker.
You can scale your experiments by creating training jobs with different configurations of state representation, reward function, RL algorithms, hyperparameters, and more. Amazon SageMaker RL helps you organize, track, compare, and evaluate your training jobs with Amazon SageMaker Experiments. You can search, sort by performance metrics, and track the lineage of a policy when you deploy in production. The following code shows an example of experiments with multiple learning rates for training a policy, and sorting by the mean episode rewards:
# define a SageMaker Experiment
rl_exp = Experiment.create(experiment_name="learning_rate_exp",...)
# define first trial
first_trial = Trial.create(trial_name="lr-1e-3",
experiment_name=rl_exp.experiment_name,...)
estimator_1 = RLEstimator(...)
estimator_1.fit(experiment_config={"TrialName": first_trial.trial_name,...})
# define second trial
second_trial = Trial.create(trial_name="lr-1e-4",
experiment_name=rl_exp.experiment_name,...)
estimator_2 = RLEstimator(...)
estimator.fit(experiment_config={"TrialName": second_trial.trial_name,...})
# define third trial
third_trial = Trial.create(trial_name="lr-1e-5",
experiment_name=rl_exp.experiment_name,...)
estimator_3 = RLEstimator(...)
estimator.fit(experiment_config={"TrialName": third_trial.trial_name,...})
# get trials sorted by mean episode rewards
trial_component_analytics = ExperimentAnalytics(
experiment_name=rl_exp.experiment_name,
sort_by="metrics.episode_reward_mean.Avg",
sort_order="Descending",...).dataframe()
The following screenshot shows the output.
The following screenshot shows that we saved 60% of the training cost by using a Spot Instance.
Deploying and monitoring the policy
After you train the RL policy, you can export the learned policy for evaluation and deployment. You can evaluate the learned policy against realistic scenarios expected in production, and ensure its behavior matches the expectation from domain expertise. You can deploy the policy in an Amazon SageMaker endpoint, to edge devices using AWS IoT Greengrass (see Training the Amazon SageMaker object detection model and running it on AWS IoT Greengrass – Part 3 of 3: Deploying to the edge), or natively in your production system.
Amazon SageMaker endpoints are fully managed. You deploy the policy with a single API call, and the required instances and load balancer are created behind a secure HTTP endpoint. The Amazon SageMaker endpoint autoscales the resources so that latency and throughput requirements are met with changing traffic patterns while incurring minimal cost.
With an Amazon SageMaker endpoint, you can check the policy performance in the production environment by A/B testing the policy against the existing model in production. You can log the decisions taken by the policy in Amazon S3 and check for anomalous behavior using Amazon SageMaker Model Monitor. You can use the resulting dataset to train an improved policy. If you have multiple policies, each for a different brand of basketball sold in the store, you can save on costs by deploying all the models behind a multi-model endpoint.
The following code shows how to extract the policy from the RLEstimator
and deploy it to an Amazon SageMaker endpoint. The endpoint is configured to save all the model inferences using the Model Monitor feature.
endpoint_name = 'basketball-demo-model-monitor-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
prefix = 'sagemaker/basketball-demo-model-monitor'
data_capture_prefix = '{}/datacapture'.format(prefix)
s3_capture_upload_path = 's3://{}/{}'.format(s3_bucket, data_capture_prefix)
model = Model(model_data='s3://{}/{}/output/model.tar.gz'.format(s3_bucket, job_name),
framework_version='2.1.0',
role=role)
data_capture_config = DataCaptureConfig(
enable_capture=True,
sampling_percentage=100,
destination_s3_uri=s3_capture_upload_path)
predictor = model.deploy(initial_instance_count=1,
instance_type="ml.c5.xlarge",
endpoint_name=endpoint_name,
data_capture_config=data_capture_config)
result = predictor.predict({"inputs": ...})
You can verify the configurations on the console. The following screenshot shows the data capture settings.
The following screenshot shows model’s production variants.
When the endpoint is up, you can quickly trace back to the model trained under the hood. The following code demonstrates how to retrieve the job-specific information (such as TrainingJobName
, TrainingJobStatus
, and TrainingTimeInSeconds
) with a single line of API call:
#first get the endpoint config for the relevant endpoint
endpoint_config = sm.describe_endpoint_config(EndpointConfigName=endpoint_name)
#now get the model name for the model deployed at the endpoint.
model_name = endpoint_config['ProductionVariants'][0]['ModelName']
#now look up the S3 URI of the model artifacts
model = sm.describe_model(ModelName=model_name)
modelURI = model['PrimaryContainer']['ModelDataUrl']
#search for the training job that created the model artifacts at above S3 URI location
search_params={
"MaxResults": 1,
"Resource": "TrainingJob",
"SearchExpression": {
"Filters": [
{
"Name": "ModelArtifacts.S3ModelArtifacts",
"Operator": "Equals",
"Value": modelURI
}]}
}
results = sm.search(**search_params)
# trace lineage of the underlying training job
results['Results'][0]['TrainingJob'].keys()
The following screenshot shows the output.
When you invoke the endpoint, the request payload, response payload, and additional metadata is saved in the Amazon S3 location that you specified in DataCaptureConfig
. You should expect to see different files from different time periods, organized based on the hour when the invocation occurred. The format of the Amazon S3 path is s3://{destination-bucket-prefix}/{endpoint-name}/{variant-name}/yyyy/mm/dd/hh/filename.jsonl
.
The HTTP request and response payload is saved in Amazon S3, where the JSON file is sorted by date. The following screenshot shows the view on the Amazon S3 console.
The following code is a line from the JSON file. With all the captured data, you can closely monitor the endpoint status and perform evaluation when necessary.
'captureData': {
'endpointInput': {
'data':
{"inputs": {
"observations":
[[1307.4744873046875, 737.0364990234375,
2065.304931640625, 988.8933715820312,
357.6395568847656, 41.90699768066406,
60.84299850463867, 4.65033483505249,
5.944803237915039, 64.77123260498047]],
"prev_action": [0],
"is_training": false,
"prev_reward": -1,
"seq_lens": -1}},
'encoding': 'JSON',
'mode': 'INPUT',
'observedContentType': 'application/json'},
'endpointOutput': {
'data': {
"outputs": {
"action_logp": [0.862621188],
"action_prob": [2.36936307],
"actions_0": [[-0.267252982]],
"vf_preds": [0.00718466379],
"action_dist_inputs": [[-0.364359707, -2.08935]]
}
},
'encoding': 'JSON',
'mode': 'OUTPUT',
'observedContentType': 'application/json'}}
'eventMetadata': {
'eventId': '0ad69e2f-c1b1-47e4-8334-47750c3cd504',
'inferenceTime': '2020-09-30T00:47:14Z'
},
'eventVersion': '0'
Conclusion
With Ray and Amazon SageMaker RL, you can get started on reinforcement learning quickly and scale to production workloads. The total cost of ownership of Amazon SageMaker over a 3-year horizon is reduced by over 54% compared to other cloud options, and developers can be up to 10 times more productive.
The post just scratches the surface of what you can do with Amazon SageMaker RL. Give it a try and please send us feedback, either in the Amazon SageMaker Discussion Forum or through your usual AWS contacts.
About the Author
Bharathan Balaji is a Research Scientist in AWS and his research interests lie in reinforcement learning systems and applications. He contributed to the launch of Amazon SageMaker RL and AWS DeepRacer. He likes to play badminton, cricket and board games during his spare time.
Anna Luo is an Applied Scientist in AWS. She obtained her Ph.D. in Statistics from UC Santa Barbara. Her interests lie in large-scale reinforcement learning algorithms and distributed computing. Her current personal goal is to master snowboarding.
Adding custom data sources to Amazon Kendra
Amazon Kendra is a highly accurate and easy-to-use intelligent search service powered by machine learning (ML). Amazon Kendra provides native connectors for popular data sources like Amazon Simple Storage Service (Amazon S3), SharePoint, ServiceNow, OneDrive, Salesforce, and Confluence so you can easily add data from different content repositories and file systems into a centralized location. This enables you to use Kendra’s natural language search capabilities to quickly find the most relevant answers to your questions.
However, many organizations store relevant information in the form of unstructured data on company intranets or within file systems on corporate networks that are inaccessible to Amazon Kendra.
You can now use the custom data source feature in Amazon Kendra to upload content to your Amazon Kendra index from a wider range of data sources. When you select a connector type, the custom data source feature gives complete control over how documents are selected and indexed, and provides visibility and metrics on which content associated with a data source has been added, modified, or deleted.
In this post, we describe how to use a simple web connector to scrape content from unauthenticated webpages, capture attributes, and ingest this content into an Amazon Kendra index using the custom data source feature. This enables you to ingest your content directly to the index using the BatchPutDocument API, and allows you to keep track of the ingestion through Amazon CloudWatch log streams and through the metrics from the data sync operation.
Setting up a web connector
To use the custom data source connector in Amazon Kendra, you need to create an application that scrapes the documents in your repository and builds a list of documents. You ingest those documents into your Amazon Kendra index by using the BatchPutDocument
operation. To delete documents, you have to provide a list of the document IDs and use the BatchDeleteDocument operation. If you need to modify a document (for example because it was updated), if you provide the same document ID, the document with the matching document ID is replaced on your index.
For this post, we scrape HTML content from AWS FAQs for 11 AI/ML services:
- Amazon CodeGuru
- Amazon Comprehend
- Amazon Forecast
- Amazon Kendra
- Amazon Lex
- Amazon Personalize
- Amazon Polly
- Amazon Rekognition
- Amazon SageMaker
- Amazon Transcribe
- Amazon Translate
We use BeautifulSoup and requests library to scrape the content from the AWS FAQ website. The script first gets the content of an AWS FAQ page through the get_soup_from_url
function. Based on the presence of certain CSS classes, it locates question and answers pairs and for each URL, it creates a text file to be later ingested in Amazon Kendra.
The solution in this post is for demonstration purposes only. We recommend running similar scripts only on your own websites after consulting with the team who manages them, or be sure to follow the terms of service for the website that you’re trying to scrape.
The following screenshot shows a sample of the script.
The following screenshot shows the results of a sample run.
The ScrapedFAQS.zip file contains the scraped documents.
Creating a custom data source
To ingest documents through the custom data source, you need to first create a data source. The assumption is you already have an Amazon Kendra index in your account. If you don’t, you can create a new index.
Amazon Kendra has two provisioning editions: the Amazon Kendra Developer Edition, recommended for building proof of concepts (POCs), and the Amazon Kendra Enterprise Edition, which provides multi-AZ deployment, making it ideal for production. Amazon Kendra connectors work with both editions.
To create your custom data source, complete the following steps:
- On your index, choose Add data sources.
- For Custom data source connector, choose Add connector.
- For Data source name, enter a name (for example,
MyCustomConnector
).
- Review the information in the Next steps
- Choose Add data source.
Syncing documents using the custom data source
Now that your connector is set up, you can ingest documents in Amazon Kendra using the BatchPutDocument
API, and get some metrics to track the status of ingestion. For that you need an ExecutionID, so before running your BatchPutDocument
operation, you need to start a data source sync job. When the data sync is complete, you stop the data source sync job.
For this post, you use the latest version of the AWS SDK for Python (Boto3) and ingest 10 documents with the IDs 0–9.
Extract the .zip file containing the scraped content by using any standard file decompression utility. You should have 11 files on your local file system. In a real use case, these files are likely on a shared file server in your data center. When you create a custom data source, you have complete control over how the documents for the index are selected. Amazon Kendra only provides metric information that you can use to monitor the performance of your data source.
To sync your documents, enter the following code:
import boto3
#Index ID
index_id = <YOUR-INDEX-ID>
#Datasource ID
data_source_id = <YOUR-DATASOURCE-ID>
kendra = boto3.client('kendra')
#Start a data source sync job
result = kendra.start_data_source_sync_job(
Id = data_source_id,
IndexId = index_id
)
print("Start data source sync operation: ")
print(result)
#Obtain the job execution ID from the result
job_execution_id = result['ExecutionId']
print("Job execution ID: "+job_execution_id)
#Start ingesting documents
try:
#Part of the workflow will require you to have a list with your documents ready
#for ingestion
docs = get_docs(data_source_id, job_execution_id)
#batchput docs
result = kendra.batch_put_document(
IndexId = index_id,
Documents = docs
)
print("Response from batch_put_document:")
print(result)
#Stop data source sync job
finally:
#Stop data source sync
result = kendra.stop_data_source_sync_job(
Id = data_source_id,
IndexId = index_id
)
print("Stop data source sync operation:")
print(result)
If everything goes well, you see output similar to the following:
Start data source sync operation:
{
'ExecutionId': 'a5ac1ba0-b480-46e3-a718-5fffa5006f1a',
'ResponseMetadata': {
'RequestId': 'a24a2600-0570-4520-8956-d58c8b1ef01c',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': 'a24a2600-0570-4520-8956-d58c8b1ef01c',
'content-type': 'application/x-amz-json-1.1',
'content-length': '54',
'date': 'Mon, 12 Oct 2020 19:55:11 GMT'
},
'RetryAttempts': 0
}
}
Job execution ID: a5ac1ba0-b480-46e3-a718-5fffa5006f1a
Response from batch_put_document:
{
'FailedDocuments': [],
'ResponseMetadata': {
'RequestId': 'fcda5fed-c55c-490b-9867-b45a3eb6a780',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': 'fcda5fed-c55c-490b-9867-b45a3eb6a780',
'content-type': 'application/x-amz-json-1.1',
'content-length': '22',
'date': 'Mon, 12 Oct 2020 19:55:12 GMT'
},
'RetryAttempts': 0
}
}
Stop data source sync operation:
{
'ResponseMetadata': {
'RequestId': '249a382a-7170-49d1-855d-879b5a6f2954',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': '249a382a-7170-49d1-855d-879b5a6f2954',
'content-type': 'application/x-amz-json-1.1',
'content-length': '0',
'date': 'Mon, 12 Oct 2020 19:55:12 GMT'
},
'RetryAttempts': 0
}
}
Allow for some time for the sync job to finish, because document ingestion could continue as an asynchronous process after the data source sync process has stopped. The status on the Amazon Kendra console should change from Syncing-indexing
to Succeeded
when all the documents have been ingested successfully. You can now confirm the count of the documents that were ingested successfully and the metrics of the operation on the Amazon Kendra console.
Deleting documents from a custom data source
In this section, you explore how to remove documents from your index. You can use the same DataSourceSync
job that you used for ingesting the documents. This process could be useful if you have a changelog of the documents you’re syncing with your Amazon Kendra index, and during your sync job you want to delete documents from your index and also ingest new documents. You can do this by starting the sync job, performing the BatchDeleteDocument
operation, performing the BatchPutDocumentation
operation, and stopping the sync job.
For this post, we use a separate data source sync job to remove the documents with IDs 6, 7, and 8. See the following code:
import boto3
#Index ID
index_id = <YOUR-INDEX-ID>
#Datasource ID
data_source_id = <YOUR-DATASOURCE-ID>
kendra = boto3.client('kendra')
#Start data source sync job
result = kendra.start_data_source_sync_job(
Id = data_source_id,
IndexId = index_id
)
print("Start data source sync operation: ")
print(result)
job_execution_id = result['ExecutionId']
print("Job execution ID: "+job_execution_id)
try:
#Add the document IDs you would like to delete
delete_docs = ["6", "7", "8"]
#Start the batch put delete operation
result = kendra.batch_delete_document(
IndexId = index_id,
DocumentIdList = delete_docs,
DataSourceSyncJobMetricTarget = {
"DataSourceSyncJobId": job_execution_id,
"DataSourceId": data_source_id
}
)
print("Response from batch_delete_document:")
print(result)
finally:
#Stop the data source sync job
result = kendra.stop_data_source_sync_job(
Id = data_source_id,
IndexId = index_id
)
print("Stop data source sync operation:")
print(result)
When the process is complete, you see a message similar to following:
Start data source sync operation:
{
'ExecutionId': '6979977e-0d91-45e9-b69e-19b179cc3bdf',
'ResponseMetadata': {
'RequestId': '677c5ab8-b5e0-4b55-8520-6aa838b8696e',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': '677c5ab8-b5e0-4b55-8520-6aa838b8696e',
'content-type': 'application/x-amz-json-1.1',
'content-length': '54',
'date': 'Mon, 12 Oct 2020 20:25:42 GMT'
},
'RetryAttempts': 0
}
}
Job execution ID: 6979977e-0d91-45e9-b69e-19b179cc3bdf
Response from batch_delete_document:
{
'FailedDocuments': [],
'ResponseMetadata': {
'RequestId': 'e647bac8-becd-4e2f-a089-84255a5d715d',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': 'e647bac8-becd-4e2f-a089-84255a5d715d',
'content-type': 'application/x-amz-json-1.1',
'content-length': '22',
'date': 'Mon, 12 Oct 2020 20:25:43 GMT'
},
'RetryAttempts': 0
}
}
Stop data source sync operation:
{
'ResponseMetadata': {
'RequestId': '58626ede-d535-43dc-abf8-797a5637fc86',
'HTTPStatusCode': 200,
'HTTPHeaders': {
'x-amzn-requestid': '58626ede-d535-43dc-abf8-797a5637fc86',
'content-type': 'application/x-amz-json-1.1',
'content-length': '0',
'date': 'Mon, 12 Oct 2020 20:25:43 GMT'
},
'RetryAttempts': 0
}
}
On Amazon Kendra console, you can see the operation details.
Running queries
In this section, we show results from queries using the documents you ingested into your index.
The following screenshot shows results for the query “what is deep learning?”
The following screenshot shows results for the query “how do I try amazon rekognition?”
The following screenshot shows results for the query “what is vga resolution?”
Conclusion
In this post, we demonstrated how you can use the custom data source feature in Amazon Kendra to ingest documents from a custom data source into an Amazon Kendra index. We used a sample web connector to scrape content from AWS FAQs and stored it in a local file system. Then we outlined the steps you can follow to ingest those scraped documents into your Kendra index. We also detailed how to use CloudWatch metrics to check the status of an ingestion job, and ran a few natural language search queries to get relevant results from the ingested content.
We hope this post helps you take advantage of the intelligent search capabilities of Amazon Kendra to find accurate answers from your enterprise content. For more information about Amazon Kendra, watch AWS re:Invent 2019 – Keynote with Andy Jassy on YouTube.
About the Authors
Tapodipta Ghosh is a Senior Architect. He leads the Content And Knowledge Engineering Machine Learning team that focuses on building models related to AWS Technical Content. He also helps our customers with AI/ML strategy and implementation using our AI Language services like Kendra.
Juan Pablo Bustos is an AI Services Specialist Solutions Architect at Amazon Web Services, based in Dallas, TX. Outside of work, he loves spending time writing and playing music as well as trying random restaurants with his family.
Explaining Amazon SageMaker Autopilot models with SHAP
Machine learning (ML) models have long been considered black boxes because predictions from these models are hard to interpret. However, recently, several frameworks aiming at explaining ML models were proposed. Model interpretation can be divided into local and global explanations. A local explanation considers a single sample and answers questions like “Why does the model predict that Customer A will stop using the product?” or “Why did the ML system refuse John Doe a loan?” Another interesting question is “What should John Doe change in order to get the loan approved?” In contrast, global explanations aim at explaining the model itself and answer questions like “Which features are important for prediction?” You can use local explanations to derive global explanations by averaging many samples. For further reading on interpretable ML, see the excellent book Interpretable Machine Learning by Christoph Molnar.
In this post, we demonstrate using the popular model interpretation framework SHAP for both local and global interpretation.
SHAP
SHAP is a game theoretic framework inspired by shapley values that provides local explanations for any model. SHAP has gained popularity in recent years, probably due to its strong theoretical basis. The SHAP package contains several algorithms that, when given a sample and model, derive the SHAP value for each of the model’s input features. The SHAP value of a feature represents its contribution to the model’s prediction.
To explain models built by Amazon SageMaker Autopilot, we use SHAP’s KernelExplainer
, which is a black box explainer. KernelExplainer
is robust and can explain any model, so can handle the complex feature processing of Amazon SageMaker Autopilot. KernelExplainer
only requires that the model support an inference functionality that, when given a sample, returns the model’s prediction for that sample. The prediction is the predicted value for regression and the class probability for classification.
SHAP includes several other explainers, such as TreeExplainer
and DeepExplainer
, which are specific for decision forest and neural networks, respectively. These are not black box explainers and require knowledge of the model structure and trained params. TreeExplainer
and DeepExplainer
are limited and, as of this writing, can’t support any feature processing.
Creating a notebook instance
You can run the example code provided in this post. It’s recommended to run the code inside an Amazon SageMaker instance type of ml.m5.xlarge or larger to accelerate running time. To launch the notebook with the example code using Amazon SageMaker Studio, complete the following steps:
- Launch an Amazon SageMaker Studio instance.
- Open terminal and clone the GitHub repo:
git clone https://github.com/awslabs/amazon-sagemaker-examples.git
- Open the notebook
autopilot/model-explainability/explaining_customer_churn_model.ipynb
. - Use kernel
Python 3 (Data Science)
.
Setting up the required packages
In this post, we start with a model built by Amazon SageMaker Autopilot, which was already trained on a binary classification task. See the following code:
import boto3
import pandas as pd
import sagemaker
from sagemaker import AutoML
from datetime import datetime
import numpy as np
region = boto3.Session().region_name
session = sagemaker.Session()
For instructions on creating and training an Amazon SageMaker Autopilot model, see Customer Churn Prediction with Amazon SageMaker Autopilot.
Install SHAP with the following code:
!conda install -c conda-forge -y shap
import shap
from shap import KernelExplainer
from shap import sample
from scipy.special import expit
Initialize the plugin to make the plots interactive.
shap.initjs()
Creating an inference endpoint
Create an inference endpoint for the trained model built by Amazon SageMaker Autopilot. See the following code:
autopilot_job_name = '<your_automl_job_name_here>'
autopilot_job = AutoML.attach(autopilot_job_name, sagemaker_session=session)
ep_name = 'sagemaker-automl-' + datetime.now().strftime('%Y-%m-%d-%H-%M-%S')
For classification response to work with SHAP we need the probability scores. This can be achieved by providing a list of keys for response content. The order of the keys will dictate the content order in the response. This parameter is not needed for regression.
inference_response_keys = ['predicted_label', 'probability']
Create the inference endpoint
autopilot_job.deploy(initial_instance_count=1, instance_type='ml.m5.2xlarge', inference_response_keys=inference_response_keys, endpoint_name=ep_name)
You can skip this step if an endpoint with the argument inference_response_keys
set as ['predicted_label', 'probability']
was already created.
Wrapping the Amazon SageMaker Autopilot endpoint with an estimator class
For ease of use, we wrap the inference endpoint with a custom estimator class. Two inference functions are provided: predict
, which returns the numeric prediction value to be used for regression, and predict_proba
, which returns the class probabilities to be used for classification. See the following code:
from sagemaker.predictor import RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_CSV
class AutomlEstimator:
def __init__(self, endpoint, sagemaker_session):
self.predictor = RealTimePredictor(
endpoint=endpoint,
sagemaker_session=sagemaker_session,
content_type=CONTENT_TYPE_CSV,
accept=CONTENT_TYPE_CSV
)
def get_automl_response(self, x):
if x.__class__.__name__ == 'ndarray':
payload = ""
for row in x:
payload = payload + ','.join(map(str, row)) + 'n'
else:
payload = x.to_csv(sep=',', header=False, index=False)
return self.predictor.predict(payload).decode('utf-8')
# Prediction function for regression
def predict(self, x):
response = self.get_automl_response(x)
# Return the first column from the response array containing the numeric prediction value (or label in case of classification)
response = np.array([x.split(',')[0] for x in response.split('n')[:-1]])
return response
# Prediction function for classification
def predict_proba(self, x):
# Return the probability score from AutoPilot’s endpoint response
response = self.get_automl_response(x)
response = np.array([x.split(',')[1] for x in response.split('n')[:-1]])
return response.astype(float)
Create an instance of AutomlEstimator
:
automl_estimator = AutomlEstimator(endpoint=ep_name, sagemaker_session=session)
Data
In this notebook, we use the same dataset as used in the Customer Churn Prediction with Amazon SageMaker Autopilot GitHub repo. Follow the notebook in the GitHub repo to download the dataset if it was not previously downloaded.
Background data
KernelExplainer
requires a sample of the data to be used as background data. KernelExplainer
uses this data to simulate a feature being missing by replacing the feature value with a random value from the background. We use shap.sample
to sample 50 rows from the dataset to be used as background data. Using more samples as background data produces more accurate results, but runtime increases. The clustering algorithms provided in SHAP only support numeric data. You can use a vector of zeros as background data to produce reasonable results.
Choosing background data is challenging. For more information, see AI Explanations Whitepaper and Runtime considerations.
churn_data = pd.read_csv('../Data sets/churn.txt')
data_without_target = churn_data.drop(columns=['Churn?'])
background_data = sample(data_without_target, 50)
Setting up KernelExplainer
Next, we create the KernelExplainer
. Because it’s a black box explainer, KernelExplainer
only requires a handle to the predict
(or predict_proba
) function and doesn’t require any other information about the model. For classification, it’s recommended to derive feature importance scores in the log-odds space because additivity is a more natural assumption there, so we use Logit
. For regression, you should use Identity
. See the following code:
problem_type = automl_job.describe_auto_ml_job(job_name=automl_job_name)['ResolvedAttributes']['ProblemType']
link = "identity" if problem_type == 'Regression' else "logit"
The handle to predict_proba
is passed to KernelExplainer
since KernelSHAP
requires the class probability:
explainer = KernelExplainer(automl_estimator.predict_proba, background_data, link=link)
By analyzing the background data, KernelExplainer
provides us with explainer.expected_value
, which is the model prediction with all features missing. Considering a customer for which we have no data at all (all features are missing), this should theoretically be the model prediction. See the following code:
Since expected_value
is given in the log-odds space we convert it back to probability using expit which is the inverse function to logit
print('expected value =', expit(explainer.expected_value))
expected value = 0.21051377184689046
Local explanation with KernelExplainer
We use KernelExplainer
to explain the prediction of a single sample, the first sample in the dataset. See the following code:
# Get the first sample
x = data_without_target.iloc[0:1]
ManagedEndpoint
will auto delete the endpoint after calculating the SHAP values. To disable auto delete, use ManagedEndpoint(ep_name, auto_delete=False)
from managed_endpoint import ManagedEndpoint
with ManagedEndpoint(ep_name) as mep:
shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='aic')
The SHAP package includes many visualization tools. The following force_plot
code provides a visualization for the SHAP values of a single sample. Since shap_values
are provided in the log-odds space, we convert them back to the probability space by using Logit
shap.force_plot(explainer.expected_value, shap_values, x, link=link)
The following visualization is the result.
From this plot, we learn that the most influential feature is VMail Message
, which pushes the probability down by about 7%. VMail Message = 25
makes the probability 7% lower in comparison to the notion of that feature being missing. SHAP values don’t provide the information of how increasing or decreasing VMail Message
affects prediction.
In many use cases, we’re interested only in the most influential features. By setting l1_reg='num_features(5)'
, SHAP provides non-zero scores for only the most influential five features:
with ManagedEndpoint(ep_name) as mep:
shap_values = explainer.shap_values(x, nsamples='auto', l1_reg='num_features(5)')
shap.force_plot(explainer.expected_value, shap_values, x, link=link)
The following visualization is the result.
KernelExplainer computation cost
KernelExplainer
computation cost is dominated by the inference calls. To estimate SHAP values for a single sample, KernelExplainer
calls the inference function twice: first with the sample unaugmented, and then with many randomly augmented instances of the sample. The number of augmented instances in our use case is 50 (number of samples in the background data) * 2088 (nsamples = 'auto'
) = 104,400. So, for this use case, the cost of running KernelExplainer
for a single sample is roughly the cost of 104,400 inference calls.
Global explanation with KernelExplainer
Next, we use KernelExplainer
to provide insight about the model as a whole. We do this by running KernelExplainer
locally on 50 samples and aggregating the results:
X = sample(data_without_target, 50)
with ManagedEndpoint(ep_name) as mep:
shap_values = explainer.shap_values(X, nsamples='auto', l1_reg='aic')
You can use force_plot
to visualize SHAP values for many samples simultaneously, force_plot
then rotates the plot of each sample by 90 degrees and stacks the plots horizontally. See the following code:
shap.force_plot(explainer.expected_value, shap_values, X, link=link)
The resulting plot is interactive (in the notebook) and can be manually analyzed.
summary_plot
is another visualization tool displaying the mean absolute value of the SHAP values for each feature using a bar plot. Currently, summary_plot
doesn’t support link functions, so the SHAP values are presented in the log-odds space (and not the probability space). See the following code:
shap.summary_plot(shap_values, X, plot_type="bar")
The following graph shows the results.
Conclusion
In this post, we demonstrated how to use KernelSHAP
to explain models created by Amazon SageMaker Autopilot, both locally and globally. KernelExplainer
is a robust black box explainer that requires only that the model support an inference functionality that, when given a sample, returns the model’s prediction for that sample. This inference functionality was provided by wrapping the Amazon SageMaker Autopilot inference endpoint with a custom estimator class.
For more information about Amazon SageMaker Autopilot, see Amazon SageMaker Autopilot.
To explore related features of Amazon SageMaker, see the following:
- ML Explainability with Amazon SageMaker Debugger
- Detecting and analyzing incorrect model predictions with Amazon SageMaker Model Monitor and Debugger
- Explaining Credit Decisions with Amazon SageMaker
About the Authors
Yotam Elor is a Senior Applied Scientist at AWS Sagemaker. He works on Sagemaker Autopilot – AWS’s auto ML solution.
Somnath Sarkar is a Software Engineer in the AWS SageMaker Autopilot team. He enjoys machine learning in general with focus in scalable and distributed systems.
Creating an intelligent ticket routing solution using Slack, Amazon AppFlow, and Amazon Comprehend
Support tickets, customer feedback forms, user surveys, product feedback, and forum posts are some of the documents that businesses collect from their customers and employees. The applications used to collect these case documents typically include incident management systems, social media channels, customer forums, and email. Routing these cases quickly and accurately to support groups best suited to handle them speeds up resolution times and increases customer satisfaction.
In traditional incident management systems internal to a business, assigning the case to a support group is either done by the employee during case creation or a centralized support group routing these tickets to specialized groups after case creation. Both of these scenarios have drawbacks. In the first scenario, the employee opening the case should be aware of the various support groups and their function. The decision to pick the right support group increases the cognitive overload on the employee opening the case. There is a chance of human error in both scenarios, which results in re-routing cases and thereby increasing the resolution times. These repetitive tasks result in a decrease of employee productivity.
Enterprises use business communication platforms like Slack to facilitate conversations between employees. This post provides a solution that simplifies reporting incidents through Slack and routes them to the right support groups. You can use this solution to set up a Slack channel in which employees can report many types of support issues. Individual support groups have their own private Slack channels.
Amazon AppFlow provides a no-code solution to transfer data from Slack channels into AWS securely. You can use Amazon Comprehend custom classification to classify the case documents into support groups automatically. Upon classification, you post the message in the respective private support channel by using Slack Application Programming Interface (API) integration. Depending on the incident management system, you can automate the ticket creation process using APIs.
When you combine Amazon AppFlow with Amazon Comprehend, you can implement an accurate, intelligent routing solution that eliminates the need to create and assign tickets to support groups manually. You can increase productivity by focusing on higher-priority tasks.
Solution overview
For our use case, we use the fictitious company AnyCorp Software Inc, whose programmers use a primary Slack channel to ask technical questions about four different topics. The programmer gets a reply with a ticket number that they can refer to for future communication. The question is intelligently routed to one of the five specific channels dedicated to each topic-specific support group. The following diagram illustrates this architecture.
The solution to building this intelligent ticket routing solution comprises four main components:
- Communication platform – A Slack application with a primary support channel for employees to report issues, four private channels for each support group and one private channel for all other issues.
- Data transfer – A flow in Amazon AppFlow securely transfers data from the primary support channel in Slack to an Amazon Simple Storage Service (Amazon S3) bucket, scheduled to run every 1 minute.
- Document classification – A multi-class custom document classifier in Amazon Comprehend uses ground truth data comprising issue descriptions and their corresponding support group labels. You also create an endpoint for this custom classification model.
- Routing controller – An AWS Lambda function is triggered when Amazon AppFlow puts new incidents into the S3 bucket. For every incident received, the function calls the Amazon Comprehend custom classification model endpoint, which returns a label for the support group best suited to address the incident. After receiving the label from Amazon Comprehend, the function using the Slack API replies to the original thread in the primary support channel. The reply contains a ticket number and the support group’s name that will address the issue. Simultaneously, the function posts the issue to the private channel associated with the support group that will address the issue.
Dataset
For Amazon Comprehend to classify a document into one of the named categories, it needs to train on a dataset with known inputs and outputs. For our use case, we use the Jira Social Repository dataset hosted on GitHub. The dataset comprises issues extracted from the Jira Issue Tracking System of four popular open-source ecosystems: the Apache Software Foundation, Spring, JBoss, and CodeHaus communities. We used the Apache Software Foundation issues, filtered four categories (GROOVY
, HADOOP
, MAVEN
, and LOG4J2
) and created a CSV file for training purposes.
- Download the data.zip
- On the Amazon S3 console, choose Create bucket.
- For Bucket name, enter
[YOUR_COMPANY]-comprehend-issue-classifier
. - Choose Create.
- Unzip the train-data.zip file and upload all files in folder into the
[YOUR_COMPANY]-comprehend-issue-classifier
bucket. - Create another bucket named
[YOUR_COMPANY]-comprehend-issue-classifier-output
.
We store the output of the custom classification model training in this folder.
Your [YOUR_COMPANY]-comprehend-issue-classifier
bucket should look like the following screenshot.
Deploying Amazon Comprehend
To deploy Amazon Comprehend, complete the following steps:
- On the Amazon Comprehend console, under Customization, choose Custom classification.
- Choose Train classifier.
- For Name, enter
comprehend-issue-classifier
. - For Classifier mode, select Using Multi-class mode.
Because our dataset has multiple classes and only one class per line, we use the multi-class mode.
- For S3 location, enter
s3://[YOUR_COMPANY]-comprehend-issue-classifier
. - For Output data, choose Browse S3.
- Find the bucket you created in the previous step and choose the
s3://[YOUR_COMPANY]-comprehend-issue-classifier-output
folder. - For IAM role, select Create an IAM role.
- For Permissions to access, choose Input and output (if specified) S3 bucket.
- For Name suffix, enter
comprehend-issue-classifier
.
- Choose Train classifier.
The process can take up to 30 minutes to complete.
- When the training is complete and the status shows as
Trained
, choose comprehend-issue-classifier. - In the Endpoints section, choose Create endpoint.
- For Endpoint name, enter
comprehend-issue-classifier-endpoint
. - For Inference units, enter
1
. - Choose Create endpoint.
- When the endpoint is created, copy its ARN from the Endpoint details section to use later in the Lambda function.
Creating a Slack app
In this section, we create a Slack app to connect with Amazon AppFlow for our intelligent ticket routing solution. For more information, see Creating, managing, and building apps.
- Sign in to your Slack workspace where you’d like to create the ticket routing solution, or create a new workspace.
- Create a Slack app named
TicketResolver
. - After you create the app, in the navigation pane, under Features, choose OAuth & Permissions.
- For Redirect URLs, enter
https://console.aws.amazon.com/appflow/oauth
. - For User Token Scopes, add the following:
-
- channels:history
- channels:read
- chat:write
- groups:history
- groups:read
- im:history
- im:read
- mpim:history
- mpim:read
- users:read
- In the navigation pane, under Settings, choose Basic Information.
- Expand Install your app to your workspace.
- Choose Install App to Workspace.
- Follow the instructions to install the app to your workspace.
- Create a Slack channel named
testing-slack-integration
. This channel is your primary channel to report issues. - Create an additional five channels:
groovy-issues
,hadoop-issues
,maven-isssues
,log4j2-issues
,other-issues
. Mark them all as private. These will be used by the support groups designated to handle the specific issues. - In your channel, choose Connect an app.
- Connect the
TicketResolver
app you created.
Deploying the AWS CloudFormation template
You can deploy this architecture using the provided AWS CloudFormation template in us-east-1.
- Choose Launch Stack:
- Provide a stack name.
- Provide the following parameters:
CategoryChannelMap
, which is a mapping between Amazon Comprehend categories and your Slack channels in string format; for example,'{ "GROOVY":"groovy-issues", "HADOOP":"hadoop-issues", "MAVEN":"maven-issues", "LOG4J2":"log4j-issues", "OTHERS":"other-issues" }'
-
ComprehendClassificationScoreThreshold
, which can be left with default value of 0.75ComprehendEndpointArn
which is created in the previous step that looks likearn:aws:comprehend:{YOUR_REGION}:{YOUR_ACCOUNT_ID}:document-classifier-endpoint/comprehend-issue-classifier-endpoint
Region
where your AWS resources are provisioned. Default is set to us-east-1SlackOAuthAccessToken
, which is the OAuth access token on your Slack API page in the OAuth Tokens & Redirect URLs section
-
SlackClientID
can be found under App Credentials section from your slack app home page as Client IDSlackClientSecret
can be found under App Credentials section from your slack app home page as Client Secret
-
SlackWorkspaceInstanceURL
which can be found by clicking the down arrow next to workspace.
-
SlackChannelID
which is the channel ID for thetesting-slack-integration
channel
-
LambdaCodeBucket
which is a bucket name where your lambda code is stored. Default is set tointelligent-routing-lambda-code
, which is the public bucket containing the lambda deployment package. If your AWS account is in us-east-1, no change is needed. For other regions, please download the lambda deployment package from here. Create a s3 bucket in your AWS account and upload the package, and change the parameter value to your bucket name.LambdaCodeKey
which is a zip file name of your lambda code. Default is set tolambda.zip
, which is the name of deployment package in the public bucket. Please revise this to your file name if you had to download and upload the lambda deployment package to your bucket in step k.
- Choose Next
- In the Capabilities and transforms section, select all three check-boxes to provide acknowledgment to AWS CloudFormation to create AWS Identity and Access Management (IAM) resources and expand the template.
- Choose Create stack.
This process might take 15 minutes or more to complete, and creates the following resources:
- IAM roles for the Lambda function to use
- A Lambda function to integrate Slack with Amazon Comprehend to categorize issues typed by Slack users
- An Amazon AppFlow Slack connection for the flow to use
- An Amazon AppFlow flow to securely connect the Slack app with AWS services
Activating the Amazon AppFlow flow
You can create a flow on the Amazon AppFlow console.
- On the Amazon AppFlow console, choose View flows.
- Choose Activate flow.
Your SlackAppFlow
is now active and runs every 1 minute to gather incremental data from the Slack channel testing-slack-integration
.
Testing your integration
We can test the end-to-end integration by typing some issues related to your channels in the testing-slack-integration
channel and waiting for about 1 minute for your Amazon AppFlow connection to transfer data to the S3 bucket. This triggers the Lambda function to run Amazon Comprehend analysis and return a category, and finally respond in the testing-slack-integration
channel and the channel with the corresponding category with a random ticket number generated.
For example, in the following screenshot, we enter a Maven-related issue in the testing-slack-integration
channel.
You see a reply from the TicketResolver
app added to your original response in the testing-slack-integration
channel.
Also, you see a slack message posted in channel.
Cleaning up
To avoid incurring any charges in the future, delete all the resources you created as part of this post:
- Amazon Comprehend endpoint
comprehend-issue-classifier-endpoint
- Amazon Comprehend classifier
comprehend-issue-classifier
- Slack app
TicketResolver
- Slack channels
testing-slack-integration
,groovy-issues
,hadoop-issues
,maven-issues
,log4j2-issues
, andother-issues
- S3 bucket
comprehend-issue-classifier-output
- S3 bucket
comprehend-issue-classifier
- CloudFormation stack (this removes all the resources the CloudFormation template created)
Conclusion
In this post, you learned how to use Amazon Comprehend, Amazon AppFlow, and Slack to create an intelligent issue-routing solution. For more information about securely transferring data software-as-a-service (SaaS) applications like Salesforce, Marketo, Slack, and ServiceNow to AWS, see Get Started with Amazon AppFlow. For more information about Amazon Comprehend custom classification models, see Custom Classification. You can also discover other Amazon Comprehend features and get inspiration from other AWS blog posts about using Amazon Comprehend beyond classification.
About the Author
Shanthan Kesharaju is a Senior Architect who helps our customers with AI/ML strategy and architecture. He is an award winning product manager and has built top trending Alexa skills. Shanthan has an MBA in Marketing from Duke University and an MS in Management Information Systems from Oklahoma State University.
So Young Yoon is a Conversation A.I. Architect at AWS Professional Services where she works with customers across multiple industries to develop specialized conversational assistants which have helped these customers provide their users faster and accurate information through natural language. Soyoung has M.S. and B.S. in Electrical and Computer Engineering from Carnegie Mellon University.