Company is testing a new class of robots that use artificial intelligence and computer vision to move freely throughout facilities.Read More
Lessons learned from 10 years of DynamoDB
Prioritizing predictability over efficiency, adapting data partitioning to traffic, and continuous verification are a few of the principles that help ensure stability, availability, and efficiency.Read More
Create synthetic data for computer vision pipelines on AWS
Collecting and annotating image data is one of the most resource-intensive tasks on any computer vision project. It can take months at a time to fully collect, analyze, and experiment with image streams at the level you need in order to compete in the current marketplace. Even after you’ve successfully collected data, you still have a constant stream of annotation errors, poorly framed images, small amounts of meaningful data in a sea of unwanted captures, and more. These major bottlenecks are why synthetic data creation needs to be in the toolkit of every modern engineer. By creating 3D representations of the objects we want to model, we can rapidly prototype algorithms while concurrently collecting live data.
In this post, I walk you through an example of using the open-source animation library Blender to build an end-to-end synthetic data pipeline, using chicken nuggets as an example. The following image is an illustration of the data generated in this blog post.
What is Blender?
Blender is an open-source 3D graphics software primarily used in animation, 3D printing, and virtual reality. It has an extremely comprehensive rigging, animation, and simulation suite that allows the creation of 3D worlds for nearly any computer vision use case. It also has an extremely active support community where most, if not all, user errors are solved.
Set up your local environment
We install two versions of Blender: one on a local machine with access to a GUI, and the other on an Amazon Elastic Compute Cloud (Amazon EC2) P2 instance.
Install Blender and ZPY
Install Blender from the Blender website.
Then complete the following steps:
- Run the following commands:
- Copy the necessary Python headers into the Blender version of Python so that you can use other non-Blender libraries:
- Override your Blender version and force installs so that the Blender-provided Python works:
- Download
zpy
and install from source: - Change the NumPy version to
>=1.19.4
andscikit-image>=0.18.1
to make the install on3.10.2
possible and so you don’t get any overwrites: - To ensure compatibility with Blender 3.2, go into
zpy/render.py
and comment out the following two lines (for more information, refer to Blender 3.0 Failure #54): - Next, install the
zpy
library: - Download the add-ons version of
zpy
from the GitHub repo so you can actively run your instance: - Save a file called
enable_zpy_addon.py
in your/home
directory and run the enablement command, because you don’t have a GUI to activate it:If
zpy-addon
doesn’t install (for whatever reason), you can install it via the GUI. - In Blender, on the Edit menu, choose Preferences.
- Choose Add-ons in the navigation pane and activate
zpy
.
You should see a page open in the GUI, and you’ll be able to choose ZPY. This will confirm that Blender is loaded.
AliceVision and Meshroom
Install AliceVision and Meshrooom from their respective GitHub repos:
FFmpeg
Your system should have ffmpeg
, but if it doesn’t, you’ll need to download it.
Instant Meshes
You can either compile the library yourself or download the available pre-compiled binaries (which is what I did) for Instant Meshes.
Set up your AWS environment
Now we set up the AWS environment on an EC2 instance. We repeat the steps from the previous section, but only for Blender and zpy
.
- On the Amazon EC2 console, choose Launch instances.
- Choose your AMI.There are a few options from here. We can either choose a standard Ubuntu image, pick a GPU instance, and then manually install the drivers and get everything set up, or we can take the easy route and start with a preconfigured Deep Learning AMI and only worry about installing Blender.For this post, I use the second option, and choose the latest version of the Deep Learning AMI for Ubuntu (Deep Learning AMI (Ubuntu 18.04) Version 61.0).
- For Instance type¸ choose p2.xlarge.
- If you don’t have a key pair, create a new one or choose an existing one.
- For this post, use the default settings for network and storage.
- Choose Launch instances.
- Choose Connect and find the instructions to log in to our instance from SSH on the SSH client tab.
- Connect with SSH:
ssh -i "your-pem" ubuntu@IPADDRESS.YOUR-REGION.compute.amazonaws.com
Once you’ve connected to your instance, follow the same installation steps from the previous section to install Blender and zpy
.
Data collection: 3D scanning our nugget
For this step, I use an iPhone to record a 360-degree video at a fairly slow pace around my nugget. I stuck a chicken nugget onto a toothpick and taped the toothpick to my countertop, and simply rotated my camera around the nugget to get as many angles as I could. The faster you film, the less likely you get good images to work with depending on the shutter speed.
After I finished filming, I sent the video to my email and extracted the video to a local drive. From there, I used ffmepg
to chop the video into frames to make Meshroom ingestion much easier:
Open Meshroom and use the GUI to drag the nugget_images
folder to the pane on the left. From there, choose Start and wait a few hours (or less) depending on the length of the video and if you have a CUDA-enabled machine.
You should see something like the following screenshot when it’s almost complete.
Data collection: Blender manipulation
When our Meshroom reconstruction is complete, complete the following steps:
- Open the Blender GUI and on the File menu, choose Import, then choose Wavefront (.obj) to your created texture file from Meshroom.
The file should be saved inpath/to/MeshroomCache/Texturing/uuid-string/texturedMesh.obj
. - Load the file and observe the monstrosity that is your 3D object.
Here is where it gets a bit tricky. - Scroll to the top right side and choose the Wireframe icon in Viewport Shading.
- Select your object on the right viewport and make sure it’s highlighted, scroll over to the main layout viewport, and either press Tab or manually choose Edit Mode.
- Next, maneuver the viewport in such a way as to allow yourself to be able to see your object with as little as possible behind it. You’ll have to do this a few times to really get it correct.
- Click and drag a bounding box over the object so that only the nugget is highlighted.
- After it’s highlighted like in the following screenshot, we separate our nugget from the 3D mass by left-clicking, choosing Separate, and then Selection.
We now move over to the right, where we should see two textured objects:texturedMesh
andtexturedMesh.001
. - Our new object should be
texturedMesh.001
, so we choosetexturedMesh
and choose Delete to remove the unwanted mass.
- Choose the object (
texturedMesh.001
) on the right, move to our viewer, and choose the object, Set Origin, and Origin to Center of Mass.
Now, if we want, we can move our object to the center of the viewport (or simply leave it where it is) and view it in all its glory. Notice the large black hole where we didn’t really get good film coverage from! We’re going to need to correct for this.
To clean our object of any pixel impurities, we export our object to an .obj file. Make sure to choose Selection Only when exporting.
Data collection: Clean up with Instant Meshes
Now we have two problems: our image has a pixel gap creating by our poor filming that we need to clean up, and our image is incredibly dense (which will make generating images extremely time-consuming). To tackle both issues, we need to use a software called Instant Meshes to extrapolate our pixel surface to cover the black hole and also to shrink the total object to a smaller, less dense size.
- Open Instant Meshes and load our recently saved
nugget.obj
file.
- Under Orientation field, choose Solve.
- Under Position field, choose Solve.
Here’s where it gets interesting. If you explore your object and notice that the criss-cross lines of the Position solver look disjointed, you can choose the comb icon under Orientation field and redraw the lines properly.
- Choose Solve for both Orientation field and Position field.
- If everything looks good, export the mesh, name it something like
nugget_refined.obj
, and save it to disk.
Data collection: Shake and bake!
Because our low-poly mesh doesn’t have any image texture associated with it and our high-poly mesh does, we either need to bake the high-poly texture onto the low-poly mesh, or create a new texture and assign it to our object. For sake of simplicity, we’re going to create an image texture from scratch and apply that to our nugget.
I used Google image search for nuggets and other fried things in order to get a high-res image of the surface of a fried object. I found a super high-res image of a fried cheese curd and made a new image full of the fried texture.
With this image, I’m ready to complete the following steps:
- Open Blender and load the new
nugget_refined.obj
the same way you loaded your initial object: on the File menu, choose Import, Wavefront (.obj), and choose thenugget_refined.obj
file. - Next, go to the Shading tab.
At the bottom you should notice two boxes with the titles Principled BDSF and Material Output. - On the Add menu, choose Texture and Image Texture.
An Image Texture box should appear. - Choose Open Image and load your fried texture image.
- Drag your mouse between Color in the Image Texture box and Base Color in the Principled BDSF box.
Now your nugget should be good to go!
Data collection: Create Blender environment variables
Now that we have our base nugget object, we need to create a few collections and environment variables to help us in our process.
- Left-click on the hand scene area and choose New Collection.
- Create the following collections: BACKGROUND, NUGGET, and SPAWNED.
- Drag the nugget to the NUGGET collection and rename it nugget_base.
Data collection: Create a plane
We’re going to create a background object from which our nuggets will be generated when we’re rendering images. In a real-world use case, this plane is where our nuggets are placed, such as a tray or bin.
- On the Add menu, choose Mesh and then Plane.
From here, we move to the right side of the page and find the orange box (Object Properties). - In the Transform pane, for XYZ Euler, set X to 46.968, Y to 46.968, and Z to 1.0.
- For both Location and Rotation, set X, Y, and Z to 0.
Data collection: Set the camera and axis
Next, we’re going to set our cameras up correctly so that we can generate images.
- On the Add menu, choose Empty and Plain Axis.
- Name the object Main Axis.
- Make sure our axis is 0 for all the variables (so it’s directly in the center).
- If you have a camera already created, drag that camera to under Main Axis.
- Choose Item and Transform.
- For Location, set X to 0, Y to 0, and Z to 100.
Data collection: Here comes the sun
Next, we add a Sun object.
- On the Add menu, choose Light and Sun.
The location of this object doesn’t necessarily matter as long as it’s centered somewhere over the plane object we’ve set. - Choose the green lightbulb icon in the bottom right pane (Object Data Properties) and set the strength to 5.0.
- Repeat the same procedure to add a Light object and put it in a random spot over the plane.
Data collection: Download random backgrounds
To inject randomness into our images, we download as many random textures from texture.ninja as we can (for example, bricks). Download to a folder within your workspace called random_textures
. I downloaded about 50.
Generate images
Now we get to the fun stuff: generating images.
Image generation pipeline: Object3D and DensityController
Let’s start with some code definitions:
We first define a basic container Class with some important properties. This class mainly exists to allow us to create a BVH tree (a way to represent our nugget object in 3D space), where we’ll need to use the BVHTree.overlap
method to see if two independent generated nugget objects are overlapping in our 3D space. More on this later.
The second piece of code is our density controller. This serves as a way to bound ourselves to the rules of reality and not the 3D world. For example, in the 3D Blender world, objects in Blender can exist inside each other; however, unless someone is performing some strange science on our chicken nuggets, we want to make sure no two nuggets are overlapping by a degree that makes it visually unrealistic.
We use our Plane
object to spawn a set of bounded invisible cubes that can be queried at any given time to see if the space is occupied or not.
See the following code:
In the following snippet, we select the nugget and create a bounding cube around that nugget. This cube represents the size of a single pseudo-voxel of our psuedo-kdtree object. We need to use the bpy.context.view_layer.update()
function because when this code is being run from inside a function or script vs. the blender-gui, it seems that the view_layer
isn’t automatically updated.
Next, we slightly update our cube object so that its length and width are square, as opposed to the natural size of the nugget it was created from:
Now we use our updated cube object to create a plane that can volumetrically hold num_objects
amount of nuggets:
We take our plane object and create a giant cube of the same length and width as our plane, with the height of our nugget cube, CUBE1:
From here, we want to create voxels from our cube. We take the number of cubes we would to fit num_objects
and then cut them from our cube object. We look for the upward-facing mesh-face of our cube, and then pick that face to make our cuts. See the following code:
Lastly, we calculate the center of the top-face of each cut we’ve made from our big cube and create actual cubes from those cuts. Each of these newly created cubes represents a single piece of space to spawn or move nuggets around our plane. See the following code:
Next, we develop an algorithm that understands which cubes are occupied at any given time, finds which objects overlap with each other, and moves overlapping objects separately into unoccupied space. We won’t be able get rid of all overlaps entirely, but we can make it look real enough.
See the following code:
Image generation pipeline: Cool runnings
In this section, we break down what our run
function is doing.
We initialize our DensityController
and create something called a saver using the ImageSaver
from zpy
. This allows us to seemlessly save our rendered images to any location of our choosing. We then add our nugget
category (and if we had more categories, we would add them here). See the following code:
Next, we need to make a source object from which we spawn copy nuggets from; in this case, it’s the nugget_base
that we created:
Now that we have our base nugget, we’re going to save the world poses (locations) of all the other objects so that after each rendering run, we can use these saved poses to reinitialize a render. We also move our base nugget completely out of the way so that the kdtree doesn’t sense a space being occupied. Finally, we initialize our kdtree-cube objects. See the following code:
The following code collects our downloaded backgrounds from texture.ninja, where they’ll be used to be randomly projected onto our plane:
Here is where the magic begins. We first regenerate out kdtree-cubes for this run so that we can start fresh:
We use our density controller to generate a random spawn point for our nugget, create a copy of nugget_base
, and move the copy to the randomly generated spawn point:
Next, we randomly jitter the size of the nugget, the mesh of the nugget, and the scale of the nugget so that no two nuggets look the same:
We turn our nugget copy into an Object3D
object where we use the BVH tree functionality to see if our plane intersects or overlaps any face or vertices on our nugget copy. If we find an overlap with the plane, we simply move the nugget upwards on its Z axis. See the following code:
Now that all nuggets are created, we use our DensityController
to move nuggets around so that we have a minimum number of overlaps, and those that do overlap aren’t hideous looking:
In the following code: we restore the Camera
and Main Axis
poses and randomly select how far the camera is to the Plane
object:
We decide how randomly we want the camera to travel along the Main Axis
. Depending on if we want it to be mainly overhead or if we care very much about the angle from which it sees the board, we can adjust the top_down_mostly
parameter depending on how well our training model is picking up the signal of “What even is a nugget anyway?”
In the following code, we do the same thing with the Sun
object, and randomly pick a texture for the Plane
object:
Finally, we hide all our objects that we don’t want to be rendered: the nugget_base
and our entire cube structure:
Lastly, we use zpy
to render our scene, save our images, and then save our annotations. For this post, I made some small changes to the zpy
annotation library for my specific use case (annotation per image instead of one file per project), but you shouldn’t have to for the purpose of this post).
Voila!
Run the headless creation script
Now that we have our saved Blender file, our created nugget, and all the supporting information, let’s zip our working directory and either scp
it to our GPU machine or uploaded it via Amazon Simple Storage Service (Amazon S3) or another service:
Log in to your EC2 instance and decompress your working_blender folder:
Now we create our data in all its glory:
The script should run for 500 images, and the data is saved in /path/to/working_blender_dir/nugget_data
.
The following code shows a single annotation created with our dataset:
Conclusion
In this post, I demonstrated how to use the open-source animation library Blender to build an end-to-end synthetic data pipeline.
There are a ton of cool things you can do in Blender and AWS; hopefully this demo can help you on your next data-starved project!
References
- Easily Clean Your 3D Scans (blender)
- Instant Meshes: A free quad-based autoretopology program
- How to 3D Scan an Object for Synthetic Data
- Generate synthetic data with Blender and Python
About the Author
Matt Krzus is a Sr. Data Scientist at Amazon Web Service in the AWS Professional Services group
Enable CI/CD of multi-Region Amazon SageMaker endpoints
Amazon SageMaker and SageMaker inference endpoints provide a capability of training and deploying your AI and machine learning (ML) workloads. With inference endpoints, you can deploy your models for real-time or batch inference. The endpoints support various types of ML models hosted using AWS Deep Learning Containers or your own containers with custom AI/ML algorithms. When you launch SageMaker inference endpoints with multiple instances, SageMaker distributes the instances across multiple Availability Zones (in a single Region) for high availability.
In some cases, however, to ensure lowest possible latency for customers in diverse geographical areas, you may require deploying inference endpoints in multiple Regions. Multi-Regional deployment of SageMaker endpoints and other related application and infrastructure components can also be part of a disaster recovery strategy for your mission-critical workloads aimed at mitigating the risk of a Regional failure.
SageMaker Projects implements a set of pre-built MLOps templates that can help manage endpoint deployments. In this post, we show how you can extend an MLOps SageMaker Projects pipeline to enable multi-Regional deployment of your AI/ML inference endpoints.
Solution overview
SageMaker Projects deploys both training and deployment MLOPs pipelines; you can use these to train a model and deploy it using an inference endpoint. To reduce complexity and cost of a multi-Region solution, we assume that you train the model in a single Region and deploy inference endpoints in two or more Regions.
This post presents a solution that slightly modifies a SageMaker project template to support multi-Region deployment. To better illustrate the changes, the following figure displays both a standard MLOps pipeline created automatically by SageMaker (Steps 1-5) as well as changes required to extend it to a secondary Region (Steps 6-11).
The SageMaker Projects template automatically deploys a boilerplate MLOps solution, which includes the following components:
- Amazon EventBridge monitors AWS CodeCommit repositories for changes and starts a run of AWS CodePipeline if a code commit is detected.
- If there is a code change, AWS CodeBuild orchestrates the model training using SageMaker training jobs.
- After the training job is complete, the SageMaker model registry registers and catalogs the trained model.
- To prepare for the deployment stage, CodeBuild extends the default AWS CloudFormation template configuration files with parameters of an approved model from the model registry.
- Finally, CodePipeline runs the CloudFormation templates to deploy the approved model to the staging and production inference endpoints.
The following additional steps modify the MLOps Projects template to enable the AI/ML model deployment in the secondary Region:
- A replica of the Amazon Simple Storage Service (Amazon S3) bucket in the primary Region storing model artifacts is required in the secondary Region.
- The CodePipeline template is extended with more stages to run a cross-Region deployment of the approved model.
- As part of the cross-Region deployment process, the CodePipeline template uses a new CloudFormation template to deploy the inference endpoint in a secondary Region. The CloudFormation template deploys the model from the model artifacts from the S3 replica bucket created in Step 6.
9–11 optionally, create resources in Amazon Route 53, Amazon API Gateway, and AWS Lambda to route application traffic to inference endpoints in the secondary Region.
Prerequisites
Create a SageMaker project in your primary Region (us-east-2 in this post). Complete the steps in Building, automating, managing, and scaling ML workflows using Amazon SageMaker Pipelines until the section Modifying the sample code for a custom use case.
Update your pipeline in CodePipeline
In this section, we discuss how to add manual CodePipeline approval and cross-Region model deployment stages to your existing pipeline created for you by SageMaker.
- On the CodePipeline console in your primary Region, find and select the pipeline containing your project name and ending with deploy. This pipeline has already been created for you by SageMaker Projects. You modify this pipeline to add AI/ML endpoint deployment stages for the secondary Region.
- Choose Edit.
- Choose Add stage.
- For Stage name, enter
SecondaryRegionDeployment
. - Choose Add stage.
- In the
SecondaryRegionDeployment
stage, choose Add action group.In this action group, you add a manual approval step for model deployment in the secondary Region. - For Action name, enter
ManualApprovaltoDeploytoSecondaryRegion
. - For Action provider, choose Manual approval.
- Leave all other settings at their defaults and choose Done.
- In the
SecondaryRegionDeployment
stage, choose Add action group (afterManualApprovaltoDeploytoSecondaryRegion
).In this action group, you add a cross-Region AWS CloudFormation deployment step. You specify the names of build artifacts that you create later in this post. - For Action name, enter
DeploytoSecondaryRegion
. - For Action provider, choose AWS Cloud Formation.
- For Region, enter your secondary Region name (for example,
us-west-2
). - For Input artifacts, enter
BuildArtifact
. - For ActionMode, enter
CreateorUpdateStack
. - For StackName, enter
DeploytoSecondaryRegion
. - Under Template, for Artifact Name, select
BuildArtifact
. - Under Template, for File Name, enter
template-export-secondary-region.yml
. - Turn Use Configuration File on.
- Under Template, for Artifact Name, select
BuildArtifact
. - Under Template, for File Name, enter
secondary-region-config-export.json
. - Under Capabilities, choose
CAPABILITY_NAMED_IAM
. - For Role, choose
AmazonSageMakerServiceCatalogProductsUseRole
created by SageMaker Projects. - Choose Done.
- Choose Save.
- If a Save pipeline changes dialog appears, choose Save again.
Modify IAM role
We need to add additional permissions to the AWS Identity and Access Management (IAM) role AmazonSageMakerServiceCatalogProductsUseRole
created by AWS Service Catalog to enable CodePipeline and S3 bucket access for cross-Region deployment.
- On the IAM console, choose Roles in the navigation pane.
- Search for and select
AmazonSageMakerServiceCatalogProductsUseRole
. - Choose the IAM policy under Policy name:
AmazonSageMakerServiceCatalogProductsUseRole-XXXXXXXXX
. - Choose Edit Policy and then JSON.
- Modify the AWS CloudFormation permissions to allow CodePipeline to sync the S3 bucket in the secondary Region. You can replace the existing IAM policy with the updated one from the following GitHub repo (see lines:16-18, 198, 213)
- Choose Review policy.
- Choose Save changes.
Add the deployment template for the secondary Region
To spin up an inference endpoint in the secondary Region, the SecondaryRegionDeployment
stage needs a CloudFormation template (for endpoint-config-template-secondary-region.yml
) and a configuration file (secondary-region-config.json
).
The CloudFormation template is configured entirely through parameters; you can further modify it to fit your needs. Similarly, you can use the config file to define the parameters for the endpoint launch configuration, such as the instance type and instance count:
To add these files to your project, download them from the provided links and upload them to Amazon SageMaker Studio in the primary Region. In Studio, choose File Browser and then the folder containing your project name and ending with modeldeploy
.
Upload these files to the deployment repository’s root folder by choosing the upload icon. Make sure the files are located in the root folder as shown in the following screenshot.
Modify the build Python file
Next, we need to adjust the deployment build.py
file to enable SageMaker endpoint deployment in the secondary Region to do the following:
- Retrieve the location of model artifacts and Amazon Elastic Container Registry (Amazon ECR) URI for the model image in the secondary Region
- Prepare a parameter file that is used to pass the model-specific arguments to the CloudFormation template that deploys the model in the secondary Region
You can download the updated build.py
file and replace the existing one in your folder. In Studio, choose File Browser and then the folder containing your project name and ending with modeldeploy
. Locate the build.py file and replace it with the one you downloaded.
The CloudFormation template uses the model artifacts stored in a S3 bucket and the Amazon ECR image path to deploy the inference endpoint in the secondary Region. This is different from the deployment from the model registry in the primary Region, because you don’t need to have a model registry in the secondary Region.
Modify the buildspec file
buildspec.yml
contains instructions run by CodeBuild. We modify this file to do the following:
- Install the SageMaker Python library needed to support the code run
- Pass through the –secondary-region and model-specific parameters to
build.py
- Add the S3 bucket content sync from the primary to secondary Regions
- Export the secondary Region CloudFormation template and associated parameter file as artifacts of the CodeBuild step
Open the buildspec.yml
file from the model deploy folder and make the highlighted modifications as shown in the following screenshot.
Alternatively, you can download the following buildspec.yml
file to replace the default file.
Add CodeBuild environment variables
In this step, you add configuration parameters required for CodeBuild to create the model deployment configuration files in the secondary Region.
- On the CodeBuild console in the primary Region, find the project containing your project name and ending with deploy. This project has already been created for you by SageMaker Projects.
- Choose the project and on the Edit menu, choose Environment.
- In the Advanced configuration section, deselect Allow AWS CodeBuild to modify this service role so it can be used with this build project.
- Add the following environment variables, defining the names of the additional CloudFormation templates, secondary Region, and model-specific parameters:
-
EXPORT_TEMPLATE_NAME_SECONDARY_REGION – For Value, enter
template-export-secondary-region.yml
and for Type, choose PlainText. -
EXPORT_TEMPLATE_SECONDARY_REGION_CONFIG – For Value, enter
secondary-region-config-export.json
and for Type, choose PlainText. - AWS_SECONDARY_REGION – For Value, enter us-west-2 and for Type, choose PlainText.
-
FRAMEWORK – For Value, enter
xgboost
(replace with your framework) and for Type, choose PlainText. - MODEL_VERSION – For Value, enter 1.0-1 (replace with your model version) and for Type, choose PlainText.
-
EXPORT_TEMPLATE_NAME_SECONDARY_REGION – For Value, enter
- Copy the value of
ARTIFACT_BUCKET
into Notepad or another text editor. You need this value in the next step. - Choose Update environment.
You need the values you specified for model training for FRAMEWORK
and MODEL_VERSION
. For example, to find these values for the Abalone model used in MLOps boilerplate deployment, open Studio and on the File Browser menu, open the folder with your project name and ending with modelbuild. Navigate to pipelines/abalone
and open the pipeline.py
file. Search for sagemaker.image_uris.retrieve
and copy the relevant values.
Create an S3 replica bucket in the secondary Region
We need to create an S3 bucket to hold the model artifacts in the secondary Region. SageMaker uses this bucket to get the latest version of model to spin up an inference endpoint. You only need to do this one time. CodeBuild automatically syncs the content of the bucket in the primary Region to the replication bucket with each pipeline run.
- On the Amazon S3 console, choose Create bucket.
- For Bucket name, enter the value of
ARTEFACT_BUCKET
copied in the previous step and append-replica
to the end (for example,sagemaker-project-X-XXXXXXXX-replica
. - For AWS Region, enter your secondary Region (
us-west-2
). - Leave all other values at their default and choose Create bucket.
Approve a model for deployment
The deployment stage of the pipeline requires an approved model to start. This is required for the deployment in the primary Region.
- In Studio (primary Region), choose SageMaker resources in the navigation pane.
- For Select the resource to view, choose Model registry.
- Choose model group name starting with your project name.
- In the right pane, check the model version, stage and status.
- If the status shows pending, choose the model version and then choose Update status.
- Change status to Approved, then choose Update status.
Deploy and verify the changes
All the changes required for multi-Region deployment of your SageMaker inference endpoint are now complete and you can start the deployment process.
- In Studio, save all the files you edited, choose Git, and choose the repository containing your project name and ending with deploy.
- Choose the plus sign to make changes.
- Under Changed, add
build.py
andbuildspec.yml
. - Under Untracked, add
endpoint-config-template-secondary-region.yml
andsecondary-region-config.json
. - Enter a comment in the Summary field and choose Commit.
- Push the changes to the repository by choosing Push.
Pushing these changes to the CodeCommit repository triggers a new pipeline run, because an EventBridge event monitors for pushed commits. After a few moments, you can monitor the run by navigating to the pipeline on the CodePipeline console.
Make sure to provide manual approval for deployment to production and the secondary Region.
You can verify that the secondary Region endpoint is created on the SageMaker console, by choosing Dashboard in the navigation pane and confirming the endpoint status in Recent activity.
Add API Gateway and Route 53 (Optional)
You can optionally follow the instructions in Call an Amazon SageMaker model endpoint using Amazon API Gateway and AWS Lambda to expose the SageMaker inference endpoint in the secondary Region as an API using API Gateway and Lambda.
Clean up
To delete the SageMaker project, see Delete an MLOps Project using Amazon SageMaker Studio. To ensure the secondary inference endpoint is destroyed, go to the AWS CloudFormation console and delete the related stacks in your primary and secondary Regions; this destroys the SageMaker inference endpoints.
Conclusion
In this post, we showed how a MLOps specialist can modify a preconfigured MLOps template for their own multi-Region deployment use case, such as deploying workloads in multiple geographies or as part of implementing a multi-Regional disaster recovery strategy. With this deployment approach, you don’t need to configure services in the secondary Region and can reuse the CodePipeline and CloudBuild setups in the primary Region for cross-Regional deployment. Additionally, you can save on costs by continuing the training of your models in the primary Region while utilizing SageMaker inference in multiple Regions to scale your AI/ML deployment globally.
Please let us know your feedback in the comments section.
About the Authors
Mehran Najafi, PhD, is a Senior Solutions Architect for AWS focused on AI/ML and SaaS solutions at Scale.
Steven Alyekhin is a Senior Solutions Architect for AWS focused on MLOps at Scale.
Detect fraudulent transactions using machine learning with Amazon SageMaker
Businesses can lose billions of dollars each year due to malicious users and fraudulent transactions. As more and more business operations move online, fraud and abuses in online systems are also on the rise. To combat online fraud, many businesses have been using rule-based fraud detection systems.
However, traditional fraud detection systems rely on a set of rules and filters hand-crafted by human specialists. The filters can often be brittle and the rules may not capture the full spectrum of fraudulent signals. Furthermore, while fraudulent behaviors are ever-evolving, the static nature of predefined rules and filters makes it difficult to maintain and improve traditional fraud detection systems effectively.
In this post, we show you how to build a dynamic, self-improving, and maintainable credit card fraud detection system with machine learning (ML) using Amazon SageMaker.
Alternatively, if you’re looking for a fully managed service to build customized fraud detection models without writing code, we recommend checking out Amazon Fraud Detector. Amazon Fraud Detector enables customers with no ML experience to automate building fraud detection models customized for their data, leveraging more than 20 years of fraud detection expertise from AWS and Amazon.com.
Solution overview
This solution builds the core of a credit card fraud detection system using SageMaker. We start by training an unsupervised anomaly detection model using the algorithm Random Cut Forest (RCF). Then we train two supervised classification models using the algorithm XGBoost, one as a baseline model and the other for making predictions, using different strategies to address the extreme class imbalance in data. Lastly, we train an optimal XGBoost model with hyperparameter optimization (HPO) to further improve the model performance.
For the sample dataset, we use the public, anonymized credit card transactions dataset that was originally released as part of a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles). In the walkthrough, we also discuss how you can customize the solution to use your own data.
The outputs of the solution are as follows:
- An unsupervised SageMaker RCF model. The model outputs an anomaly score for each transaction. A low score value indicates that the transaction is considered normal (non-fraudulent). A high value indicates that the transaction is fraudulent. The definitions of low and high depend on the application, but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.
- A supervised SageMaker XGBoost model trained using its built-in weighting schema to address the highly unbalanced data issue.
- A supervised SageMaker XGBoost model trained using the Sythetic Minority Over-sampling Technique (SMOTE).
- A trained SageMaker XGBoost model with HPO.
- Predictions of the probability for each transaction being fraudulent. If the estimated probability of a transaction is over a threshold, it’s classified as fraudulent.
To demonstrate how you can use this solution in your existing business infrastructures, we also include an example of making REST API calls to the deployed model endpoint, using AWS Lambda to trigger both the RCF and XGBoost models.
The following diagram illustrates the solution architecture.
Prerequisites
To try out the solution in your own account, make sure that you have the following in place:
- You need an AWS account to use this solution. If you don’t have an account, you can sign up for one.
- The solution outlined in this post is part of Amazon SageMaker JumpStart. To run this SageMaker JumpStart 1P Solution and have the infrastructure deploy to your AWS account, you need to create an active Amazon SageMaker Studio instance (see Onboard to Amazon SageMaker Domain).
When the Studio instance is ready, you can launch Studio and access JumpStart. JumpStart solutions are not available in SageMaker notebook instances, and you can’t access them through SageMaker APIs or the AWS Command Line Interface (AWS CLI).
Launch the solution
To launch the solution, complete the following steps:
- Open JumpStart by using the JumpStart launcher in the Get Started section or by choosing the JumpStart icon in the left sidebar.
- Under Solutions, choose Detect Malicious Users and Transactions to open the solution in another Studio tab.
- On the solution tab, choose Launch to launch the solution.
The solution resources are provisioned and another tab opens showing the deployment progress. When the deployment is finished, an Open Notebook button appears. - Choose Open Notebook to open the solution notebook in Studio.
Investigate and process the data
The default dataset contains only numerical features, because the original features have been transformed using Principal Component Analysis (PCA) to protect user privacy. As a result, the dataset contains 28 PCA components, V1–V28, and two features that haven’t been transformed, Amount and Time. Amount refers to the transaction amount, and Time is the seconds elapsed between any transaction in the data and the first transaction.
The Class column corresponds to whether or not a transaction is fraudulent.
We can see that the majority is non-fraudulent, because out of the total 284,807 examples, only 492 (0.173%) are fraudulent. This is a case of extreme class imbalance, which is common in fraud detection scenarios.
We then prepare our data for loading and training. We split the data into a train set and a test set, using the former to train and the latter to evaluate the performance of our model. It’s important to split the data before applying any techniques to alleviate the class imbalance. Otherwise, we might leak information from the test set into the train set and hurt the model’s performance.
If you want to bring in your own training data, make sure that it’s tabular data in CSV format, upload the data to an Amazon Simple Storage Service (Amazon S3) bucket, and edit the S3 object path in the notebook code.
If your data includes categorical columns with non-numerical values, you need to one-hot encode these values (using, for example, sklearn’s OneHotEncoder) because the XGBoost algorithm only supports numerical data.
Train an unsupervised Random Cut Forest model
In a fraud detection scenario, we commonly have very few labeled examples, and labeling fraud can take a lot of time and effort. Therefore, we also want to extract information from the unlabeled data at hand. We do this using an anomaly detection algorithm, taking advantage of the high data imbalance that is common in fraud detection datasets.
Anomaly detection is a form of unsupervised learning where we try to identify anomalous examples based solely on their feature characteristics. Random Cut Forest is a state-of-the-art anomaly detection algorithm that is both accurate and scalable. With each data example, RCF associates an anomaly score.
We use the SageMaker built-in RCF algorithm to train an anomaly detection model on our training dataset, then make predictions on our test dataset.
First, we examine and plot the predicted anomaly scores for positive (fraudulent) and negative (non-fraudulent) examples separately, because the numbers of positive and negative examples differ significantly. We expect the positive (fraudulent) examples to have relatively high anomaly scores, and the negative (non-fraudulent) ones to have low anomaly scores. From the histograms, we can see the following patterns:
- Almost half of the positive examples (left histogram) have anomaly scores higher than 0.9, whereas most of the negative examples (right histogram) have anomaly scores lower than 0.85.
- The unsupervised learning algorithm RCF has limitations to identify fraudulent and non-fraudulent examples accurately. This is because no label information is used. We address this issue by collecting label information and using a supervised learning algorithm in later steps.
Then, we assume a more real-world scenario where we classify each test example as either positive (fraudulent) or negative (non-fraudulent) based on its anomaly score. We plot the score histogram for all test examples as follows, choosing a cutoff score of 1.0 (based on the pattern shown in the histogram) for classification. Specifically, if an example’s anomaly score is less than or equal to 1.0, it’s classified as negative (non-fraudulent). Otherwise, the example is classified as positive (fraudulent).
Lastly, we compare the classification result with the ground truth labels and compute the evaluation metrics. Because our dataset is imbalanced, we use the evaluation metrics balanced accuracy, Cohen’s Kappa score, F1 score, and ROC AUC, because they take into account the frequency of each class in the data. For all of these metrics, a larger value indicates a better predictive performance. Note that in this step we can’t compute the ROC AUC yet, because there is no estimated probability for positive and negative classes from the RCF model on each example. We compute this metric in later steps using supervised learning algorithms.
. | RCF |
Balanced accuracy | 0.560023 |
Cohen’s Kappa | 0.003917 |
F1 | 0.007082 |
ROC AUC | – |
From this step, we can see that the unsupervised model can already achieve some separation between the classes, with higher anomaly scores correlated with fraudulent examples.
Train an XGBoost model with the built-in weighting schema
After we’ve gathered an adequate amount of labeled training data, we can use a supervised learning algorithm to discover relationships between the features and the classes. We choose the XGBoost algorithm because it has a proven track record, is highly scalable, and can deal with missing data. We need to handle the data imbalance this time, otherwise the majority class (the non-fraudulent, or negative examples) will dominate the learning.
We train and deploy our first supervised model using the SageMaker built-in XGBoost algorithm container. This is our baseline model. To handle the data imbalance, we use the hyperparameter scale_pos_weight
, which scales the weights of the positive class examples against the negative class examples. Because the dataset is highly skewed, we set this hyperparameter to a conservative value: sqrt(num_nonfraud/num_fraud)
.
We train and deploy the model as follows:
- Retrieve the SageMaker XGBoost container URI.
- Set the hyperparameters we want to use for the model training, including the one we mentioned that handles data imbalance,
scale_pos_weight
. - Create an XGBoost estimator and train it with our train dataset.
- Deploy the trained XGBoost model to a SageMaker managed endpoint.
- Evaluate this baseline model with our test dataset.
Then we evaluate our model with the same four metrics as mentioned in the last step. This time we can also calculate the ROC AUC metric.
. | RCF | XGBoost |
Balanced accuracy | 0.560023 | 0.847685 |
Cohen’s Kappa | 0.003917 | 0.743801 |
F1 | 0.007082 | 0.744186 |
ROC AUC | – | 0.983515 |
We can see that a supervised learning method XGBoost with the weighting schema (using the hyperparameter scale_pos_weight
) achieves significantly better performance than the unsupervised learning method RCF. There is still room to improve the performance, however. In particular, raising the Cohen’s Kappa score above 0.8 would be generally very favorable.
Apart from single-value metrics, it’s also useful to look at metrics that indicate performance per class. For example, the confusion matrix, per-class precision, recall, and F1-score can provide more information about our model’s performance.
. | precision | recall | f1-score | support |
non-fraud | 1.00 | 1.00 | 1.00 | 28435 |
fraud | 0.80 | 0.70 | 0.74 | 46 |
Keep sending test traffic to the endpoint via Lambda
To demonstrate how to use our models in a production system, we built a REST API with Amazon API Gateway and a Lambda function. When client applications send HTTP inference requests to the REST API, which triggers the Lambda function, which in turn invokes the RCF and XGBoost model endpoints and returns the predictions from the models. You can read the Lambda function code and monitor the invocations on the Lambda console.
We also created a Python script that makes HTTP inference requests to the REST API, with our test data as input data. To see how this was done, check the generate_endpoint_traffic.py
file in the solution’s source code. The prediction outputs are logged to an S3 bucket through an Amazon Kinesis Data Firehose delivery stream. You can find the destination S3 bucket name on the Kinesis Data Firehose console, and check the prediction results in the S3 bucket.
Train an XGBoost model with the over-sampling technique SMOTE
Now that we have a baseline model using XGBoost, we can see if sampling techniques that are designed specifically for imbalanced problems can improve the performance of the model. We use Sythetic Minority Over-sampling (SMOTE), which oversamples the minority class by interpolating new data points between existing ones.
The steps are as follows:
- Use SMOTE to oversample the minority class (the fraudulent class) of our train dataset. SMOTE oversamples the minority class from about 0.17–50%. Note that this is a case of extreme oversampling of the minority class. An alternative would be to use a smaller resampling ratio, such as having one minority class sample for every
sqrt(non_fraud/fraud)
majority sample, or using more advanced resampling techniques. For more over-sampling options, refer to Compare over-sampling samplers. - Define the hyperparameters for training the second XGBoost so that scale_pos_weight is removed and the other hyperparameters remain the same as when training the baseline XGBoost model. We don’t need to handle data imbalance with this hyperparameter anymore, because we’ve already done that with SMOTE.
- Train the second XGBoost model with the new hyperparameters on the SMOTE processed train dataset.
- Deploy the new XGBoost model to a SageMaker managed endpoint.
- Evaluate the new model with the test dataset.
When evaluating the new model, we can see that with SMOTE, XGBoost achieves a better performance on balanced accuracy, but not on Cohen’s Kappa and F1 scores. The reason for this is that SMOTE has oversampled the fraud class so much that it’s increased its overlap in feature space with the non-fraud cases. Because Cohen’s Kappa gives more weight to false positives than balanced accuracy does, the metric drops significantly, as does the precision and F1 score for fraud cases.
. | RCF | XGBoost | XGBoost SMOTE |
Balanced accuracy | 0.560023 | 0.847685 | 0.912657 |
Cohen’s Kappa | 0.003917 | 0.743801 | 0.716463 |
F1 | 0.007082 | 0.744186 | 0.716981 |
ROC AUC | – | 0.983515 | 0.967497 |
However, we can bring back the balance between metrics by adjusting the classification threshold. So far, we’ve been using 0.5 as the threshold to label whether or not a data point is fraudulent. After experimenting different thresholds from 0.1–0.9, we can see that Cohen’s Kappa keeps increasing along with the threshold, without a significant loss in balanced accuracy.
This adds a useful calibration to our model. We can use a low threshold if not missing any fraudulent cases (false negatives) is our priority, or we can increase the threshold to minimize the number of false positives.
Train an optimal XGBoost model with HPO
In this step, we demonstrate how to improve model performance by training our third XGBoost model with hyperparameter optimization. When building complex ML systems, manually exploring all possible combinations of hyperparameter values is impractical. The HPO feature in SageMaker can accelerate your productivity by trying many variations of a model on your behalf. It automatically looks for the best model by focusing on the most promising combinations of hyperparameter values within the ranges that you specify.
The HPO process needs a validation dataset, so we first further split our training data into training and validation datasets using stratified sampling. To tackle the data imbalance problem, we use XGBoost’s weighting schema again, setting the scale_pos_weight
hyperparameter to sqrt(num_nonfraud/num_fraud)
.
We create an XGBoost estimator using the SageMaker built-in XGBoost algorithm container, and specify the objective evaluation metric and the hyperparameter ranges within which we’d like to experiment. With these we then create a HyperparameterTuner and kick off the HPO tuning job, which trains multiple models in parallel, looking for optimal hyperparameter combinations.
When the tuning job is complete, we can see its analytics report and inspect each model’s hyperparameters, training job information, and its performance against the objective evaluation metric.
Then we deploy the best model and evaluate it with our test dataset.
Evaluate and compare all model performance on the same test data
Now we have the evaluation results from all four models: RCF, XGBoost baseline, XGBoost with SMOTE, and XGBoost with HPO. Let’s compare their performance.
. | RCF | XGBoost | XGBoost with SMOTE | XGBoost with HPO |
Balanced accuracy | 0.560023 | 0.847685 | 0.912657 | 0.902156 |
Cohen’s Kappa | 0.003917 | 0.743801 | 0.716463 | 0.880778 |
F1 | 0.007082 | 0.744186 | 0.716981 | 0.880952 |
ROC AUC | – | 0.983515 | 0.967497 | 0.981564 |
We can see that XGBoost with HPO achieves even better performance than that with the SMOTE method. In particular, Cohen’s Kappa scores and F1 are over 0.8, indicating an optimal model performance.
Clean up
When you’re finished with this solution, make sure that you delete all unwanted AWS resources to avoid incurring unintended charges. In the Delete solution section on your solution tab, choose Delete all resources to delete resources automatically created when launching this solution.
Alternatively, you can use AWS CloudFormation to delete all standard resources automatically created by the solution and notebook. To use this approach, on the AWS CloudFormation console, find the CloudFormation stack whose description contains fraud-detection-using-machine-learning, and delete it. This is a parent stack, and choosing to delete this stack will automatically delete the nested stacks.
With either approach, you still need to manually delete any extra resources that you may have created in this notebook. Some examples include extra S3 buckets (in addition to the solution’s default bucket), extra SageMaker endpoints (using a custom name), and extra Amazon Elastic Container Registry (Amazon ECR) repositories.
Conclusion
In this post, we showed you how to build the core of a dynamic, self-improving, and maintainable credit card fraud detection system using ML with SageMaker. We built, trained, and deployed an unsupervised RCF anomaly detection model, a supervised XGBoost model as the baseline, another supervised XGBoost model with SMOTE to tackle the data imbalance problem, and a final XGBoost model optimized with HPO. We discussed how to handle data imbalance and use your own data in the solution. We also included an example REST API implementation with API Gateway and Lambda to demonstrate how to use the system in your existing business infrastructure.
To try it out yourself, open SageMaker Studio and launch the JumpStart solution. To learn more about the solution, check out its GitHub repository.
About the Authors
Xiaoli Shen is a Solutions Architect and Machine Learning Technical Field Community (TFC) member at Amazon Web Services. She’s focused on helping customers architecting on the cloud and leveraging AWS services to derive business value. Prior to joining AWS, she was a tech lead and senior full-stack engineer building data-intensive distributed systems on the cloud.
Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.
Vedant Jain is a Sr. AI/ML Specialist Solutions Architect, helping customers derive value out of the Machine Learning ecosystem at AWS. Prior to joining AWS, Vedant has held ML/Data Science Specialty positions at various companies such as Databricks, Hortonworks (now Cloudera) & JP Morgan Chase. Outside of his work, Vedant is passionate about making music, using Science to lead a meaningful life & exploring delicious vegetarian cuisine from around the world.
Implement RStudio on your AWS environment and access your data lake using AWS Lake Formation permissions
R is a popular analytic programming language used by data scientists and analysts to perform data processing, conduct statistical analyses, create data visualizations, and build machine learning (ML) models. RStudio, the integrated development environment for R, provides open-source tools and enterprise-ready professional software for teams to develop and share their work across their organization Building, securing, scaling and maintaining RStudio yourself is, however, tedious and cumbersome.
Implementing the RStudio environment in AWS provides elasticity and scalability that you don’t have when deploying on-prem, eliminating the need of managing that infrastructure. You can select the desired compute and memory based on processing requirements and can also scale up or down to work with analytical and ML workloads of different sizes without an upfront investment. This lets you quickly experiment with new data sources and code, and roll out new analytics processes and ML models to the rest of the organization. You can also seamlessly integrate your Data Lake resources to make them available to developers and Data Scientists and secure the data by using row-level and column-level access controls from AWS Lake Formation.
This post presents two ways to easily deploy and run RStudio on AWS to access data stored in data lake:
- Fully managed on Amazon SageMaker
- RStudio on amazon SageMaker is a managed service option which allows you to avoid having to manage the underlying infrastructure for your RStudio environment. You can easily bring your own RStudio Workbench license using AWS License Manager
- You can also use RStudio on Amazon SageMaker’s integration with AWS Identity and Access Management or AWS IAM Identity Center (successor of AWS Single Sign On) to implement user-level security access controls. As we will see later in this post, you can secure your data lake by using row-level and column-level access controls from AWS Lake Formation.
- RStudio on Amazon SageMaker enables you to dynamically choose an instance with desired compute and memory from a wide array of ML instances available on SageMaker.
- Self-hosted on Amazon Elastic Compute Cloud (Amazon EC2)
- You can choose to deploy the open-source version of RStudio using an EC2 hosted approach that we will also describe in this post. The self-hosted option requires the administrator to create an EC2 instance and install RStudio manually or using a AWS CloudFormation There is also less flexibility for implementing user-access controls in this option since all users have the same access level in this type of implementation.
RStudio on Amazon SageMaker
You can launch RStudio Workbench with a simple click from SageMaker. With SageMaker customers don’t have to bear the operational overhead of building, installing, securing, scaling and maintaining RStudio, they don’t have to pay for the continuously running RStudio Server (if they are using t3.medium) and they only pay for RSession compute when they use it. RStudio users will have flexibility to dynamically scale compute by switching instances on-the-fly. Running RStudio on SageMaker requires an administrator to establish a SageMaker domain and associated user profiles. You also need an appropriate RStudio license
Within SageMaker, you can grant access at the RStudio administrator and RStudio user level, with differing permissions. Only user profiles granted one of these two roles can access RStudio in SageMaker. For more information about administrator tasks for setting up RStudio on SageMaker, refer to Get started with RStudio on Amazon SageMaker. That post also shows the process of selecting EC2 instances for each session, and how the administrator can restrict EC2 instance options for RStudio users.
Use Lake Formation row-level and column-level security access
In addition to allowing your team to launch RStudio sessions on SageMaker, you can also secure the data lake by using row-level and column-level access controls from Lake Formation. For more information, refer to Effective data lakes using AWS Lake Formation, Part 4: Implementing cell-level and row-level security.
Through Lake Formation security controls, you can make sure that each person has the right access to the data in the data lake. Consider the following two user profiles in the SageMaker domain, each with a different execution role:
User Profile | Execution Role |
rstudiouser-fullaccess |
AmazonSageMaker-ExecutionRole-FullAccess |
rstudiouser-limitedaccess |
AmazonSageMaker-ExecutionRole-LimitedAccess |
The following screenshot shows the rstudiouser-limitedaccess
profile details.
The following screenshot shows the rstudiouser-fullaccess
profile details.
The dataset used for this post is a COVID-19 public dataset. The following screenshot shows an example of the data:
After you create the user profile and assign it to the appropriate role, you can access Lake Formation to crawl the data with AWS Glue, create the metadata and table, and grant access to the table data. For the AmazonSageMaker-ExecutionRole-FullAccess
role, you grant access to all of the columns in the table, and for AmazonSageMaker-ExecutionRole-LimitedAccess
, you grant access using the data filter USA_Filter
. We use this filter to provide row-level and cell-level column permissions (see the Resource column in the following screenshot).
As shown in the following screenshot, the second role has limited access. Users associated with this role can only access the continent
, date
, total_cases
, total_deaths
, new_cases
, new_deaths
, and iso_codecolumns
.
![Fig6: AWS Lake Formation Column-level permissions for AmazonSageMaker-ExecutionRole-Limited Access role](https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2022/10/14/ML-8288-image011.png)
Fig6: AWS Lake Formation Column-level permissions for AmazonSageMaker-ExecutionRole-Limited Access role
With role permissions attached to each user profile, we can see how Lake Formation enforces the appropriate row-level and column-level permissions. You can open the RStudio Workbench from the Launch app drop-down menu in the created user list, and choose RStudio.
In the following screenshot, we launch the app as the rstudiouser-limitedaccess user
.
You can see the RStudio Workbench home page and a list of sessions, projects, and published content.
Choose a session name to start the session in SageMaker. Install Paws (see guidance earlier in this post) so that you can access the appropriate AWS services. Now you can run a query to pull all of the fields from the dataset via Amazon Athena, using the command “SELECT * FROM "databasename.tablename"
, and store the query output in an Amazon Simple Storage Service (Amazon S3) bucket.
The following screenshot shows the output files in the S3 bucket.
The following screenshot shows the data in these output files using Amazon S3 Select.
Only USA data and columns continent, date, total_cases
, total_deaths
, new_cases
, new_deaths
, and iso_code
are shown in the result for the rstudiouser-limitedaccess
user.
Let’s repeat the same steps for the rstudiouser-fullaccess
user.
You can see the RStudio Workbench home page and a list of sessions, projects, and published content.
Let’s run the same query “SELECT * FROM "databasename.tablename"
using Athena.
The following screenshot shows the output files in the S3 bucket.
The following screenshot shows the data in these output files using Amazon S3 Select.
As shown in this example, the rstudiouser-fullaccess
user has access to all the columns and rows in the dataset.
Self-Hosted on Amazon EC2
If you want to start experimenting with RStudio’s open-source version on AWS, you can install Rstudio on an EC2 instance. This CloudFormation template provided in this post provisions the EC2 instance and installs RStudio using the user data script. You can run the template multiple times to provision multiple RStudio instances as needed, and you can use it in any AWS Region. After you deploy the CloudFormation template, it provides you with a URL to access RStudio from a web browser. Amazon EC2 enables you to scale up or down to handle changes in data size and the necessary compute capacity to run your analytics.
Create a key-value pair for secure access
AWS uses public-key cryptography to secure the login information for your EC2 instance. You specify the name of the key pair in the KeyPair
parameter when you launch the CloudFormation template. Then you can use the same key to log in to the provisioned EC2 instance later if needed.
Before you run the CloudFormation template, make sure that you have the Amazon EC2 key pair in the AWS account that you’re planning to use. If not, then refer to Create a key pair using Amazon EC2 for instructions to create one.
Launch the CloudFormation templateSign in to the CloudFormation console in the us-east-1
Region and choose Launch Stack.
You must enter several parameters into the CloudFormation template:
-
InitialUser and InitialPassword – The user name and password that you use to log in to the RStudio session. The default values are
rstudio
andRstudio@123
, respectively. - InstanceType – The EC2 instance type on which to deploy the RStudio server. The template currently accepts all instances in the t2, m4, c4, r4, g2, p2, and g3 instance families, and can incorporate other instance families easily. The default value is t2.micro.
- KeyPair – The key pair you use to log in to the EC2 instance.
- VpcId and SubnetId – The Amazon Virtual Private Cloud (Amazon VPC) and subnet in which to launch the instance.
After you enter these parameters, deploy the CloudFormation template. When it’s complete, the following resources are available:
- An EC2 instance with RStudio installed on it.
- An IAM role with necessary permissions to connect to other AWS services.
- A security group with rules to open up port 8787 for the RStudio Server.
Log in to RStudio
Now you’re ready to use RStudio! Go to the Outputs tab for the CloudFormation stack and copy the RStudio URL value (it’s in the format http://ec2-XX-XX-XXX-XX.compute-1.amazonaws.com:8787/
). Enter that URL in a web browser. This opens your RStudio session, which you can log into using the same user name and password that you provided while running the CloudFormation template.
Access AWS services from RStudio
After you access the RStudio session, you should install the R Package for AWS (Paws). This lets you connect to many AWS services, including the services and resources in your data lake. To install Paws, enter and run the following R code:
To use an AWS service, create a client and access the service’s operations from that client. When accessing AWS APIs, you must provide your credentials and Region. Paws searches for the credentials and Region using the AWS authentication chain:
- Explicitly provided access key, secret key, session token, profile, or Region
- R environment variables
- Operating system environment variables
- AWS shared credentials and configuration files in
.aws/credentials
and.aws/config
- Container IAM role
- Instance IAM role
Because you’re running on an EC2 instance with an attached IAM role, Paws automatically uses your IAM role credentials to authenticate AWS API requests.
For production environment, we recommend using the scalable Rstudio solution outlined in this blog.
Conclusion
You learned how to deploy your RStudio environment in AWS. We demonstrated the advantages of using RStudio on Amazon SageMaker and how you can get started. You also learned how to quickly begin experimenting with the open-source version of RStudio using a self-hosted installation using Amazon EC2. We also demonstrated how to integrate RStudio into your data lake architectures and implement fine-grained access control on a data lake table using the row-level and cell-level security feature of Lake Formation.
In our next post, we will demonstrate how to containerize R scripts and run them using AWS Lambda.
About the authors
Venkata Kampana is a Senior Solutions Architect in the AWS Health and Human Services team and is based in Sacramento, CA. In that role, he helps public sector customers achieve their mission objectives with well-architected solutions on AWS.
Dr. Dawn Heisey-Grove is the public health analytics leader for Amazon Web Services’ state and local government team. In this role, she’s responsible for helping state and local public health agencies think creatively about how to achieve their analytics challenges and long-term goals. She’s spent her career finding new ways to use existing or new data to support public health surveillance and research.
Design patterns for serial inference on Amazon SageMaker
As machine learning (ML) goes mainstream and gains wider adoption, ML-powered applications are becoming increasingly common to solve a range of complex business problems. The solution to these complex business problems often requires using multiple ML models. These models can be sequentially combined to perform various tasks, such as preprocessing, data transformation, model selection, inference generation, inference consolidation, and post-processing. Organizations need flexible options to orchestrate these complex ML workflows. Serial inference pipelines are one such design pattern to arrange these workflows into a series of steps, with each step enriching or further processing the output generated by the previous steps and passing the output to the next step in the pipeline.
Additionally, these serial inference pipelines should provide the following:
- Flexible and customized implementation (dependencies, algorithms, business logic, and so on)
- Repeatable and consistent for production implementation
- Undifferentiated heavy lifting by minimizing infrastructure management
In this post, we look at some common use cases for serial inference pipelines and walk through some implementation options for each of these use cases using Amazon SageMaker. We also discuss considerations for each of these implementation options.
The following table summarizes the different use cases for serial inference, implementation considerations and options. These are discussed in this post.
Use Case | Use Case Description | Primary Considerations | Overall Implementation Complexity | Recommended Implementation options | Sample Code Artifacts and Notebooks |
Serial inference pipeline (with preprocessing and postprocessing steps included) | Inference pipeline needs to preprocess incoming data before invoking a trained model for generating inferences, and then postprocess generated inferences, so that they can be easily consumed by downstream applications | Ease of implementation | Low | Inference container using the SageMaker Inference Toolkit | Deploy a Trained PyTorch Model |
Serial inference pipeline (with preprocessing and postprocessing steps included) | Inference pipeline needs to preprocess incoming data before invoking a trained model for generating inferences, and then postprocess generated inferences, so that they can be easily consumed by downstream applications | Decoupling, simplified deployment, and upgrades | Medium | SageMaker inference pipeline | Inference Pipeline with Custom Containers and xgBoost |
Serial model ensemble | Inference pipeline needs to host and arrange multiple models sequentially, so that each model enhances the inference generated by the previous one, before generating the final inference | Decoupling, simplified deployment and upgrades, flexibility in model framework selection | Medium | SageMaker inference pipeline | Inference Pipeline with Scikit-learn and Linear Learner |
Serial inference pipeline (with targeted model invocation from a group) | Inference pipeline needs to invoke a specific customized model from a group of deployed models, based on request characteristics or for cost-optimization, in addition to preprocessing and postprocessing tasks | Cost-optimization and customization | High | SageMaker inference pipeline with multi-model endpoints (MMEs) | Amazon SageMaker Multi-Model Endpoints using Linear Learner |
In the following sections, we discuss each use case in more detail.
Serial inference pipeline using inference containers
Serial inference pipeline use cases have requirements to preprocess incoming data before invoking a pre-trained ML model for generating inferences. Additionally, in some cases, the generated inferences may need to be processed further, so that they can be easily consumed by downstream applications. This is a common scenario for use cases where a streaming data source needs to be processed in real time before a model can be fitted on it. However, this use case can manifest for batch inference as well.
SageMaker provides an option to customize inference containers and use them to build a serial inference pipeline. Inference containers use the SageMaker Inference Toolkit and are built on SageMaker Multi Model Server (MMS), which provides a flexible mechanism to serve ML models. The following diagram illustrates a reference pattern of how to implement a serial inference pipeline using inference containers.
SageMaker MMS expects a Python script that implements the following functions to load the model, preprocess input data, get predictions from the model, and postprocess the output data:
- input_fn() – Responsible for deserializing and preprocessing the input data
- model_fn() – Responsible for loading the trained model from artifacts in Amazon Simple Storage Service (Amazon S3)
- predict_fn() – Responsible for generating inferences from the model
- output_fn() – Responsible for serializing and postprocessing the output data (inferences)
For detailed steps to customize an inference container, refer to Adapting Your Own Inference Container.
Inference containers are an ideal design pattern for serial inference pipeline use cases with the following primary considerations:
- High cohesion – The processing logic and corresponding model drive single business functionality and need to be co-located
- Low overall latency – The elapsed time between when an inference request is made and response is received
In a serial inference pipeline, the processing logic and model are encapsulated within the same single container, so much of the invocation calls remain within the container. This helps reduce the overall number of hops, resulting in better overall latency and responsiveness of the pipeline.
Also, for use cases where ease of implementation is an important criterion, inference containers can help, with various processing steps of the pipeline be co-located within the same container.
Serial inference pipeline using a SageMaker inference pipeline
Another variation of the serial inference pipeline use case requires clearer decoupling between the various steps in the pipeline (such as data preprocessing, inference generation, data postprocessing, and formatting and serialization). This could be due to a variety of reasons:
- Decoupling – Various steps of the pipeline have a clearly defined purpose and need to be run on separate containers due to the underlying dependencies involved. This also helps keep the pipeline well structured.
- Frameworks – Various steps of the pipeline use specific fit-for-purpose frameworks (such as scikit or Spark ML) and therefore need to be run on separate containers.
- Resource Isolation – Various steps of the pipeline have varying resource consumption requirements and therefore need to be run on separate containers for more flexibility and control.
Furthermore, for slightly more complex serial inference pipelines, multiple steps may be involved to process a request and generate an inference. Therefore, from an operational standpoint, it may be beneficial to host these steps on separate containers for better functional isolation, and facilitate easier upgrades and enhancements (change one step without impacting other models or processing steps).
If your use case aligns with some of these considerations, a SageMaker inference pipeline provides an easy and flexible option to build a serial inference pipeline. The following diagram illustrates a reference pattern of how to implement a serial inference pipeline using multiple steps hosted on dedicated containers using a SageMaker inference pipeline.
A SageMaker inference pipeline consists of a linear sequence of 2–15 containers that process requests for inferences on data. The inference pipeline provides the option to use pre-trained SageMaker built-in algorithms or custom algorithms packaged in Docker containers. The containers are hosted on the same underlying instance, which helps reduce the overall latency and minimize cost.
The following code snippet shows how multiple processing steps and models can be combined to create a serial inference pipeline.
We start by building and specifying Spark ML and XGBoost-based models that we intend to use as part of the pipeline:
The models are then arranged sequentially within the pipeline model definition:
The inference pipeline is then deployed behind an endpoint for real-time inference by specifying the type and number of host ML instances:
The entire assembled inference pipeline can be considered a SageMaker model that you can use to make either real-time predictions or process batch transforms directly, without any external preprocessing. Within an inference pipeline model, SageMaker handles invocations as a sequence of HTTP requests originating from an external application. The first container in the pipeline handles the initial request, performs some processing, and then dispatches the intermediate response as a request to the second container in the pipeline. This happens for each container in the pipeline, and finally returns the final response to the calling client application.
SageMaker inference pipelines are fully managed. When the pipeline is deployed, SageMaker installs and runs all the defined containers on each of the Amazon Elastic Compute Cloud (Amazon EC2) instances provisioned as part of the endpoint or batch transform job. Furthermore, because the containers are co-located and hosted on the same EC2 instance, the overall pipeline latency is reduced.
Serial model ensemble using a SageMaker inference pipeline
An ensemble model is an approach in ML where multiple ML models are combined and used as part of the inference process to generate final inferences. The motivations for ensemble models could include improving accuracy, reducing model sensitivity to specific input features, and reducing single model bias, among others. In this post, we focus on the use cases related to a serial model ensemble, where multiple ML models are sequentially combined as part of a serial inference pipeline.
Let’s consider a specific example related to a serial model ensemble where we need to group a user’s uploaded images based on certain themes or topics. This pipeline could consist of three ML models:
- Model 1 – Accepts an image as input and evaluates image quality based on image resolution, orientation, and more. This model then attempts to upscale the image quality and sends the processed images that meet a certain quality threshold to the next model (Model 2).
- Model 2 – Accepts images validated through Model 1 and performs image recognition to identify objects, places, people, text, and other custom actions and concepts in images. The output from Model 2 that contains identified objects is sent to Model 3.
- Model 3 – Accepts the output from Model 2 and performs natural language processing (NLP) tasks such as topic modeling for grouping images together based on themes. For example, images could be grouped based on location or people identified. The output (groupings) is sent back to the client application.
The following diagram illustrates a reference pattern of how to implement multiple ML models hosted on a serial model ensemble using a SageMaker inference pipeline.
As discussed earlier, the SageMaker inference pipeline is managed, which enables you to focus on the ML model selection and development, while reducing the undifferentiated heavy lifting associated with building the serial ensemble pipeline.
Additionally, some of the considerations discussed earlier around decoupling, algorithm and framework choice for model development, and deployment are relevant here as well. For instance, because each model is hosted on a separate container, you have flexibility in selecting the ML framework that best fits each model and your overall use case. Furthermore, from a decoupling and operational standpoint, you can continue to upgrade or modify individual steps much more easily, without affecting other models.
The SageMaker inference pipeline is also integrated with the SageMaker model registry for model cataloging, versioning, metadata management, and governed deployment to production environments to support consistent operational best practices. The SageMaker inference pipeline is also integrated with Amazon CloudWatch to enable monitoring the multi-container models in inference pipelines. You can also get visibility into real-time metrics to better understand invocations and latency for each container in the pipeline, which helps with troubleshooting and resource optimization.
Serial inference pipeline (with targeted model invocation from a group) using a SageMaker inference pipeline
SageMaker multi-model endpoints (MMEs) provide a cost-effective solution to deploy a large number of ML models behind a single endpoint. The motivations for using multi-model endpoints could include invocating a specific customized model based on request characteristics (such as origin, geographic location, user personalization, and so on) or simply hosting multiple models behind the same endpoint to achieve cost-optimization.
When you deploy multiple models on a single multi-model enabled endpoint, all models share the compute resources and the model serving container. The SageMaker inference pipeline can be deployed on an MME, where one of the containers in the pipeline can dynamically serve requests based on the specific model being invoked. From a pipeline perspective, the models have identical preprocessing requirements and expect the same feature set, but are trained to align to a specific behavior. The following diagram illustrates a reference pattern of how this integrated pipeline would work.
With MMEs, the inference request that originates from the client application should specify the target model that needs to be invoked. The first container in the pipeline handles the initial request, performs some processing, and then dispatches the intermediate response as a request to the second container in the pipeline, which hosts multiple models. Based on the target model specified in the inference request, the model is invoked to generate an inference. The generated inference is sent to the next container in the pipeline for further processing. This happens for each subsequent container in the pipeline, and finally SageMaker returns the final response to the calling client application.
Multiple model artifacts are persisted in an S3 bucket. When a specific model is invoked, SageMaker dynamically loads it onto the container hosting the endpoint. If the model is already loaded in the container’s memory, invocation is faster because SageMaker doesn’t need to download the model from Amazon S3. If instance memory utilization is high and a new model is invoked and therefore needs to be loaded, unused models are unloaded from memory. The unloaded models remain in the instance’s storage volume, however, and can be loaded into the container’s memory later again, without being downloaded from the S3 bucket again.
One of the key considerations while using MMEs is to understand model invocation latency behavior. As discussed earlier, models are dynamically loaded into the container’s memory of the instance hosting the endpoint when invoked. Therefore, the model invocation may take longer when it’s invoked for the first time. When the model is already in the instance container’s memory, the subsequent invocations are faster. If an instance memory utilization is high and a new model needs to be loaded, unused models are unloaded. If the instance’s storage volume is full, unused models are deleted from the storage volume. SageMaker fully manages the loading and unloading of the models, without you having to take any specific actions. However, it’s important to understand this behavior because it has implications on the model invocation latency and therefore overall end-to-end latency.
Pipeline hosting options
SageMaker provides multiple instance type options to select from for deploying ML models and building out inference pipelines, based on your use case, throughput, and cost requirements. For example, you can choose CPU or GPU optimized instances to build serial inference pipelines, on a single container or across multiple containers. However, there are sometimes requirements where it is desired to have flexibility and support to run models on CPU or GPU based instances within the same pipeline for additional flexibility.
You can now use NVIDIA Triton Inference Server to serve models for inference on SageMaker for heterogeneous compute requirements. Check out Deploy fast and scalable AI with NVIDIA Triton Inference Server in Amazon SageMaker for additional details.
Conclusion
As organizations discover and build new solutions powered by ML, the tools required for orchestrating these pipelines should be flexible enough to support based on a given use case, while simplifying and reducing the ongoing operational overheads. SageMaker provides multiple options to design and build these serial inference workflows, based on your requirements.
We look forward to hearing from you about what use cases you’re building using serial inference pipelines. If you have questions or feedback, please share them in the comments.
About the authors
Rahul Sharma is a Senior Solutions Architect at AWS Data Lab, helping AWS customers design and build AI/ML solutions. Prior to joining AWS, Rahul has spent several years in the finance and insurance sector, helping customers build data and analytical platforms.
Anand Prakash is a Senior Solutions Architect at AWS Data Lab. Anand focuses on helping customers design and build AI/ML, data analytics, and database solutions to accelerate their path to production.
Dhawal Patel is a Principal Machine Learning Architect at AWS. He has worked with organizations ranging from large enterprises to mid-sized startups on problems related to distributed computing, and Artificial Intelligence. He focuses on Deep learning including NLP and Computer Vision domains. He helps customers achieve high performance model inference on SageMaker.
Saurabh Trikande is a Senior Product Manager for Amazon SageMaker Inference. He is passionate about working with customers and making machine learning more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch and spending time with his family.
Train a time series forecasting model faster with Amazon SageMaker Canvas Quick build
Today, Amazon SageMaker Canvas introduces the ability to use the Quick build feature with time series forecasting use cases. This allows you to train models and generate the associated explainability scores in under 20 minutes, at which point you can generate predictions on new, unseen data. Quick build training enables faster experimentation to understand how well the model fits to the data and what columns are driving the prediction, and allows business analysts to run experiments with varied datasets so they can select the best-performing model.
Canvas expands access to machine learning (ML) by providing business analysts with a visual point-and-click interface that allows you to generate accurate ML predictions on your own—without requiring any ML experience or having to write a single line of code.
In this post, we showcase how to to train a time series forecasting model faster with quick build training in Canvas.
Solution overview
Until today, training a time series forecasting model took up to 4 hours via the standard build method. Although that approach has the benefit of prioritizing accuracy over training time, this was leading frequently to long training times, which in turn wasn’t allowing for fast experimentation that business analysts across all sorts of organizations usually seek. Starting today, Canvas allows you to employ the Quick build feature for training a time series forecasting model, adding to the use cases for which it was already available (binary and multi-class classification and numerical regression). Now you can train a model and get explainability information in under 20 minutes, with everything in place to start generating inference.
To use the Quick build feature for time series forecasting ML use cases, all you need to do is upload your dataset to Canvas, configure the training parameters (such as target column), and then choose Quick build instead of Standard build (which was the only available option for this type of ML use case before today). Note that quick build is only available for datasets with fewer than 50,000 rows.
Let’s walk through a scenario of applying the Quick build feature to a real-world ML use case involving time series data and getting actionable results.
Create a Quick build in Canvas
Anyone who has worked with ML, even if they possess no relevant experience or expertise, knows that the end result is only as good as the training dataset. No matter how much of a good fit the algorithm is that you used to train the model, the end result will reflect the quality of the inferencing on unseen data, and won’t be satisfactory if the training data isn’t indicative of the given use case, is biased, or has frequent missing values.
For the purposes of this post , we use a sample synthetic dataset that contains demand and pricing information for various items at a given time period, specified with a timestamp (a date field in the CSV file). The dataset is available on GitHub. The following screenshot shows the first ten rows.
Solving a business problem using no-code ML with Canvas is a four-step process: import the dataset, build the ML model, check its performance, and then use the model to generate predictions (also known as inference in ML terminology). If you’re new to Canvas, a prompt walking you through the process appears. Feel free to spend a couple of minutes with the in-app tutorial if you want, otherwise you can choose Skip for now. There’s also a dedicated Getting Started guide you can follow to immerse yourself fully in the service if you want a more detailed introduction.
We start by uploading the dataset. Complete the following steps:
- On the Datasets page, choose Import Data.
- Upload data from local disk or other sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Snowflake, to load the sample dataset.The
product_demand.csv
now shows in the list of datasets. - Open
product_demand.csv
and choose Create a model to start the model creation process.
You’re redirected to the Build tab of the Canvas app to start the next step of the Canvas workflow.
- First, we select the target variable, the value that we’re trying to predict as a function of the other variables available in the dataset. In our case, that’s the
demand
variable.
Canvas automatically infers that this is a time series forecasting problem.
For Canvas to solve the time series forecasting use case, we need to set up a couple of configuration options.
- Specify which column uniquely identifies the items in the dataset, where the timestamps are stored, and the horizon of predictions (how many months into the future we want to look at).
- Additionally, we can provide a holiday schedule, which can be helpful in some use cases that benefit from having this information, such as retail or supply chain use cases.
- Choose Save.
Choosing the right prediction horizon is of paramount importance for a good time series forecasting use case. The greater the value, the more into the future we will generate the prediction—however, it’s less likely to be accurate due to the probabilistic nature of the forecast generated. A higher value also means a longer time to train, as well as more resources needed for both training and inference. Finally, it’s best practice to have data points from the past at least 3–5 times the forecast horizon. If you want to predict 6 months into the future (like in our example), you should have at least 18 months’ worth of historical data, up to 30 months. - After you safe these configurations, choose Quick Build.
Canvas launches an in-memory AutoML process that trains multiple time series forecasting models with different hyperparameters. In less than 20 minutes (depending on the dataset), Canvas will output the best model performance in the form of five metrics.
Let’s dive deep into the advanced metrics for time series forecasts in Canvas, and how we can make sense of them:
- Average weighted quantile loss (wQL) – Evaluates the forecast by averaging the accuracy at the P10, P50, and P90 quantiles. A lower value indicates a more accurate model.
- Weighted absolute percent error (WAPE) – The sum of the absolute error normalized by the sum of the absolute target, which measures the overall deviation of forecasted values from observed values. A lower value indicates a more accurate model, where WAPE = 0 is a model with no errors.
- Root mean square error (RMSE) – The square root of the average squared errors. A lower RMSE indicates a more accurate model, where RMSE = 0 is a model with no errors.
- Mean absolute percent error (MAPE) – The percentage error (percent difference of the mean forecasted value versus the actual value) averaged over all time points. A lower value indicates a more accurate model, where MAPE = 0 is a model with no errors.
- Mean absolute scaled error (MASE) – The mean absolute error of the forecast normalized by the mean absolute error of a simple baseline forecasting method. A lower value indicates a more accurate model, where MASE < 1 is estimated to be better than the baseline and MASE > 1 is estimated to be worse than the baseline.
For more information about advanced metrics, refer to Use advanced metrics in your analyses.
Built-in explainability is part of the value proposition of Canvas, because it provides information about column impact on the Analyze tab. In this use case, we can see that price has a great impact on the value of demand. This makes sense because a very low price would increase demand by a large margin.
Predictions and what-if scenarios
After we’ve analyzed the performance of our model, we can use it to generate predictions and test what-if scenarios.
- On the Predict tab, choose Single item.
- Choose an item (for this example,
item_002
).
The following screenshot shows the forecast for item_002
.
We can expect an increase in demand in the coming months. Canvas also provides a probabilistic threshold around the expected forecast, so we can decide whether to take the upper bound of the prediction (with the risk of over-allocation) or the lower bound (risking under-allocation). Use these values with caution, and apply your domain knowledge to determine the best prediction for your business.
Canvas also support what-if scenarios, which makes it possible to see how changing values in the dataset can affect the overall forecast for a single item, directly on the forecast plot. For the purposes of this post, we simulate a 2-month campaign where we introduce a 50% discount, cutting the price from $120 to $60.
- Choose What if scenario.
- Choose the values you want to change (for this example, November and December).
- Choose Generate prediction.
We can see that the changed price introduces a spike with the demand of the product for the months impacted by the discount campaign, and then slowly returns to the expected values from the previous forecast.
As a final test, we can determine the impact of definitively changing the price of a product.
- Choose Try new what-if scenario.
- Select Bulk edit all values.
- For New Value, enter 70.
- Choose Generate prediction.
This is a lower price than the initial $100–120, therefore we expect a sharp increase in product demand. This is confirmed by the forecast, as shown in the following screenshot.
Clean up
To avoid incurring future session charges, log out of SageMaker Canvas.
Conclusion
In this post, we walked you through the Quick build feature for time series forecasting models and the updated metrics analysis view. Both are available as of today in all Regions where Canvas is available. For more information, refer to Build a model and Use advanced metrics in your analyses.
To learn more about Canvas, refer to these links:
- Enable intelligent decision-making with Amazon SageMaker Canvas and Amazon QuickSight
- Provision and manage ML environments with Amazon SageMaker Canvas using AWS CDK and AWS Service Catalog
- Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas
To learn more about other use cases that you can solve with Canvas, check out the following posts:
- Predict customer churn with no-code machine learning using Amazon SageMaker Canvas
- Reinventing retail with no-code machine learning: Sales forecasting using Amazon SageMaker Canva
- Predict types of machine failures with no-code machine learning using Amazon SageMaker Canvas
- Predict shipment ETA with no-code machine learning using Amazon SageMaker Canvas
Start experimenting with Canvas today, and build your time series forecasting models in under 20 minutes, using the 2-month Free Tier that Canvas offers.
About the Authors
Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.
Nikiforos Botis is a Solutions Architect at AWS, looking after the public sector of Greece and Cyprus, and is a member of the AWS AI/ML technical community. He enjoys working with customers on architecting their applications in a resilient, scalable, secure, and cost-optimized way.
Use Amazon SageMaker Canvas for exploratory data analysis
Exploratory data analysis (EDA) is a common task performed by business analysts to discover patterns, understand relationships, validate assumptions, and identify anomalies in their data. In machine learning (ML), it’s important to first understand the data and its relationships before getting into model building. Traditional ML development cycles can sometimes take months and require advanced data science and ML engineering skills, whereas no-code ML solutions can help companies accelerate the delivery of ML solutions to days or even hours.
Amazon SageMaker Canvas is a no-code ML tool that helps business analysts generate accurate ML predictions without having to write code or without requiring any ML experience. Canvas provides an easy-to-use visual interface to load, cleanse, and transform the datasets, followed by building ML models and generating accurate predictions.
In this post, we walk through how to perform EDA to gain a better understanding of your data before building your ML model, thanks to Canvas’ built-in advanced visualizations. These visualizations help you analyze the relationships between features in your datasets and comprehend your data better. This is done intuitively, with the ability to interact with the data and discover insights that may go unnoticed with ad hoc querying. They can be created quickly through the ‘Data visualizer’ within Canvas prior to building and training ML models.
Solution overview
These visualizations add to the range of capabilities for data preparation and exploration already offered by Canvas, including the ability to correct missing values and replace outliers; filter, join, and modify datasets; and extract specific time values from timestamps. To learn more about how Canvas can help you cleanse, transform, and prepare your dataset, check out Prepare data with advanced transformations.
For our use case, we look at why customers churn in any business and illustrate how EDA can help from a viewpoint of an analyst. The dataset we use in this post is a synthetic dataset from a telecommunications mobile phone carrier for customer churn prediction that you can download (churn.csv), or you bring your own dataset to experiment with. For instructions on importing your own dataset, refer to Importing data in Amazon SageMaker Canvas.
Prerequisites
Follow the instructions in Prerequisites for setting up Amazon SageMaker Canvas before you proceed further.
Import your dataset to Canvas
To import the sample dataset to Canvas, complete the following steps:
- Log in to Canvas as a business user.First, we upload the dataset mentioned previously from our local computer to Canvas. If you want to use other sources, such as Amazon Redshift, refer to Connect to an external data source.
- Choose Import.
- Choose Upload, then choose Select files from your computer.
- Select your dataset (churn.csv) and choose Import data.
- Select the dataset and choose Create model.
- For Model name, enter a name (for this post, we have given the name Churn prediction).
- Choose Create.
As soon as you select your dataset, you’re presented with an overview that outlines the data types, missing values, mismatched values, unique values, and the mean or mode values of the respective columns.
From an EDA perspective, you can observe there are no missing or mismatched values in the dataset. As a business analyst, you may want to get an initial insight into the model build even before starting the data exploration to identify how the model will perform and what factors are contributing to the model’s performance. Canvas gives you the ability to get insights from your data before you build a model by first previewing the model. - Before you do any data exploration, choose Preview model.
- Select the column to predict (churn).Canvas automatically detects this is two-category prediction.
- Choose Preview model. SageMaker Canvas uses a subset of your data to build a model quickly to check if your data is ready to generate an accurate prediction. Using this sample model, you can understand the current model accuracy and the relative impact of each column on predictions.
The following screenshot shows our preview.
The model preview indicates that the model predicts the correct target (churn?) 95.6% of the time. You can also see the initial column impact (influence each column has on the target column). Let’s do some data exploration, visualization, and transformation, and then proceed to build a model.
Data exploration
Canvas already provides some common basic visualizations, such as data distribution in a grid view on the Build tab. These are great for getting a high-level overview of the data, understanding how the data is distributed, and getting a summary overview of the dataset.
As a business analyst, you may need to get high-level insights on how the data is distributed as well as how the distribution reflects against the target column (churn) to easily understand the data relationship before building the model. You can now choose Grid view to get an overview of the data distribution.
The following screenshot shows the overview of the distribution of the dataset.
We can make the following observations:
- Phone takes on too many unique values to be of any practical use. We know phone is a customer ID and don’t want to build a model that might consider specific customers, but rather learn in a more general sense what could lead to churn. You can remove this variable.
- Most of the numeric features are nicely distributed, following a Gaussian bell curve. In ML, you want the data to be distributed normally because any variable that exhibits normal distribution is able to be forecasted with higher accuracy.
Let’s go deeper and check out the advanced visualizations available in Canvas.
Data visualization
As business analysts, you want to see if there are relationships between data elements, and how they’re related to churn. With Canvas, you can explore and visualize your data, which helps you gain advanced insights into your data before building your ML models. You can visualize using scatter plots, bar charts, and box plots, which can help you understand your data and discover the relationships between features that could affect the model accuracy.
To start creating your visualizations, complete the following steps:
- On the Build tab of the Canvas app, choose Data visualizer.
A key accelerator of visualization in Canvas is the Data visualizer. Let’s change the sample size to get a better perspective.
- Choose number of rows next to Visualization sample.
- Use the slider to select your desired sample size.
- Choose Update to confirm the change to your sample size.
You may want to change the sample size based on your dataset. In some cases, you may have a few hundred to a few thousand rows where you can select the entire dataset. In some cases, you may have several thousand rows, in which case you may select a few hundred or a few thousand rows based on your use case.
A scatter plot shows the relationship between two quantitative variables measured for the same individuals. In our case, it’s important to understand the relationship between values to check for correlation.
Because we have Calls, Mins, and Charge, we will plot the correlation between them for Day, Evening, and Night.
First, let’s create a scatter plot between Day Charge vs. Day Mins.
We can observe that as Day Mins increases, Day Charge also increases.
The same applies for evening calls.
Night calls also have the same pattern.
Because mins and charge seem to increase linearly, you can observe that they have a high correlation with one another. Including these feature pairs in some ML algorithms can take additional storage and reduce the speed of training, and having similar information in more than one column might lead to the model overemphasizing the impacts and lead to undesired bias in the model. Let’s remove one feature from each of the highly correlated pairs: Day Charge from the pair with Day Mins, Night Charge from the pair with Night Mins, and Intl Charge from the pair with Intl Mins.
Data balance and variation
A bar chart is a plot between a categorical variable on the x-axis and numerical variable on y-axis to explore the relationship between both variables. Let’s create a bar chart to see the how the calls are distributed across our target column Churn for True and False. Choose Bar chart and drag and drop day calls and churn to the y-axis and x-axis, respectively.
Now, let’s create same bar chart for evening calls vs churn.
Next, let’s create a bar chart for night calls vs. churn.
It looks like there is a difference in behavior between customers who have churned and those that didn’t.
Box plots are useful because they show differences in behavior of data by class (churn or not). Because we’re going to predict churn (target column), let’s create a box plot of some features against our target column to infer descriptive statistics on the dataset such as mean, max, min, median, and outliers.
Choose Box plot and drag and drop Day mins and Churn to the y-axis and x-axis, respectively.
You can also try the same approach to other columns against our target column (churn).
Let’s now create a box plot of day mins against customer service calls to understand how the customer service calls spans across day mins value. You can see that customer service calls don’t have a dependency or correlation on the day mins value.
From our observations, we can determine that the dataset is fairly balanced. We want the data to be evenly distributed across true and false values so that the model isn’t biased towards one value.
Transformations
Based on our observations, we drop Phone column because it is just an account number and Day Charge, Eve Charge, Night Charge columns because they contain overlapping information such as the mins columns, but we can run a preview again to confirm.
After the data analysis and transformation, let’s preview the model again.
You can observe that the model estimated accuracy changed from 95.6% to 93.6% (this could vary), however the column impact (feature importance) for specific columns has changed considerably, which improves the speed of training as well as the columns’ influence on the prediction as we move to next steps of model building. Our dataset doesn’t require additional transformation, but if you needed to you could take advantage of ML data transforms to clean, transform, and prepare your data for model building.
Build the model
You can now proceed to build a model and analyze results. For more information, refer to Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.
Clean up
To avoid incurring future session charges, log out of Canvas.
Conclusion
In this post, we showed how you can use Canvas visualization capabilities for EDA to better understand your data before model building, create accurate ML models, and generate predictions using a no-code, visual, point-and-click interface.
- For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts.
- To learn more about visualization in Canvas, refer to Explore your data using visualization techniques.
- For more information about model training and inference in Canvas, refer to Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.
- To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas.
About the Authors
Rajakumar Sampathkumar is a Principal Technical Account Manager at AWS, providing customers guidance on business-technology alignment and supporting the reinvention of their cloud operation models and processes. He is passionate about cloud and machine learning. Raj is also a machine learning specialist and works with AWS customers to design, deploy, and manage their AWS workloads and architectures.
Rahul Nabera is a Data Analytics Consultant in AWS Professional Services. His current work focuses on enabling customers build their data and machine learning workloads on AWS. In his spare time, he enjoys playing cricket and volleyball.
Raviteja Yelamanchili is an Enterprise Solutions Architect with Amazon Web Services based in New York. He works with large financial services enterprise customers to design and deploy highly secure, scalable, reliable, and cost-effective applications on the cloud. He brings over 11+ years of risk management, technology consulting, data analytics, and machine learning experience. When he is not helping customers, he enjoys traveling and playing PS5.
reMARS revisited: Net zero carbon goal and Amazon’s fulfillment network
Examining the opportunities for reducing energy consumption in robotics and automation across Amazon’s fulfillment center network.Read More