Run machine learning enablement events at scale using AWS DeepRacer multi-user account mode

This post was co-written by Marius Cealera, Senior Partner Solutions Architect at AWS, Zdenko Estok, Cloud Architect at Accenture and Sakar Selimcan, Cloud Architect at Accenture.

Machine learning (ML) is a high-stakes business priority, with companies spending $306 billion on ML applications in the past 3 years. According to Accenture, companies that scale ML across a business can achieve nearly triple the return on their investments. But too many companies aren’t achieving the value they expected. Scaling ML effectively for the long term requires the professionalization of the industry and the democratization of ML literacy across the enterprise. This requires more accessible ML training, speaking to a larger number of people with diverse backgrounds.

This post shows how companies can introduce hundreds of employees to ML concepts by easily running AWS DeepRacer events at scale.

Run AWS DeepRacer events at scale

AWS DeepRacer is a simple and fun way to get started with reinforcement learning (RL), an ML technique where an agent, such as a physical or virtual AWS DeepRacer vehicle, discovers the optimal actions to take in a given environment. You can get started with RL quickly with hands-on tutorials that guide you through the basics of training RL models and testing them in an exciting, autonomous car racing experience.

“We found the user-friendly nature of DeepRacer allowed our enablement sessions to reach parts of our organizations that are usually less inclined to participate in AI/ML events,” says Zdenko Estok, a Cloud Architect at Accenture. “Our post-event statistics indicate that up to 75% of all participants to DeepRacer events are new to AI/ML and 50% are new to AWS.”

Until recently, organizations hosting private AWS DeepRacer events had to create and assign AWS accounts to every event participant. This often meant securing and monitoring usage across hundreds or even thousands of AWS accounts. The setup and participant onboarding was cumbersome and time-consuming, often limiting the size of the event. With AWS DeepRacer multi-user account management, event organizers can provide hundreds of participants access to AWS DeepRacer using a single AWS account, simplifying event management and improving the participant experience.

Build a solution around AWS DeepRacer multi-user account management

You can use AWS DeepRacer multi-user account management to set usage quotas on training hours, monitor spending on training and storage, enable and disable training, and view and manage models for every event participant. In addition, when combined with an enterprise identity provider (IdP), AWS DeepRacer multi-user account management provides a quick and frictionless onboarding experience for event participants. The following diagram explains what such a setup looks like.

 Solution diagram showing AWS IAM Identity Center being used to provide access to the AWS DeepRacer console

The solution assumes access to an AWS account.

To set up your account with AWS DeepRacer admin permissions for multi-user, follow the steps in Set up your account with AWS DeepRacer admin permissions for multi-user to attach the AWS Identity and Access Management (IAM) AWS DeepRacer Administrator policy, AWSDeepRacerAccountAdminAccess, to the user, group, or role used to administer the event. Next, navigate to the AWS DeepRacer console and activate multi-user account mode.

By activating multi-user account mode, you enable participants to train models on the AWS DeepRacer console, with all training and storage charges billed to the administrator’s AWS account. By default, a sponsoring account in multi-user mode is limited to 100 concurrent training jobs, 100 concurrent evaluation jobs, 1,000 cars, and 50 private leaderboards, shared among all sponsored profiles. You can increase these limits by contacting Customer Service.

This setup also relies on using an enterprise IdP with AWS IAM Identity Center (Successor to AWS Single Sign-On) enabled. For information on setting up IAM Identity Center with an IdP, see Enable IAM Identity Center and Connect to your external identity provider. Note that different IdPs may require slightly different setup steps. Consult your IdP’s documentation for more details.

The solution depicted here works as follows:

  1. Event participants are directed to a dedicated event portal. This can be a simple webpage where participants can enter their enterprise email address in a basic HTML form and choose Register. Registered participants can use this portal to access the AWS DeepRacer console. You can further personalize this page to gather additional user data (such as the user’s DeepRacer AWS profile or their level of AI and ML knowledge) or to add event marketing and training materials.
  2. The event portal registration form calls a customer API endpoint that stores email addresses in Amazon DynamoDB through AWS AppSync. For more information, refer to Attaching a Data Source for a sample CloudFormation template on setting up AWS AppSync with DynamoDB and calling the API from a browser client.
  3. For every new registration, an Amazon DynamoDB Streams event triggers an AWS Lambda function that calls the IdP’s API (in this case, the Azure Active Directory API) to add the participant’s identity in a dedicated event group that was previously set up with IAM Identity Center. The IAM Identity Center permission set controls the level of access racers have in the AWS account. At a minimum, this permission set should include the AWSDeepRacerDefaultMultiUserAccess managed policy. For more information, refer to Permission sets and AWS DeepRacer managed policies.
  4. If the IdP call is successful, the same Lambda function sends an email notification using Amazon Pinpoint, informing the participant the registration was successful and providing the AWS Management Console access URL generated in IAM Identity Center. For more information, refer to Send email by using the Amazon Pinpoint API.
  5. When racers choose this link, they’re asked to authenticate with their enterprise credentials, unless their current browser session is already authenticated. After authentication, racers are redirected to the AWS DeepRacer console where they can start training AWS DeepRacer models and submit them to virtual races.
  6. Event administrators use the AWS DeepRacer console to create and manage races. Race URLs can be shared with the racers through a Lambda-generated email, either as part as the initial registration flow or as a separate notification. Event administrators can monitor and limit usage directly on the AWS DeepRacer console, including estimated spending and training model hours. Administrators can also pause racer sponsorship and delete models.
  7. Finally, administrators can disable multi-user account mode after the event ends and remove participant access to the AWS account either by removing the users from IAM Identity Center or by  disabling the setup in the external IdP.

Conclusion

AWS DeepRacer events are a great way to raise interest and increase ML knowledge across all pillars and levels of an organization. This post explains how you can couple AWS DeepRacer multi-user account mode with IAM Identity Center and an enterprise IdP to run AWS DeepRacer events at scale with minimum administrative effort, while ensuring a great participant experience.

The solution presented in this post was developed and used by Accenture to run the world’s largest private AWS DeepRacer event in 2021, with more than 2,000 racers. By working with the Accenture AWS Business Group (AABG), a strategic collaboration by Accenture and AWS, you can learn from the cultures, resources, technical expertise, and industry knowledge of two leading innovators, helping you accelerate the pace of innovation to deliver disruptive products and services. Connect with our team at accentureaws@amazon.com to engage with a network of specialists steeped in industry knowledge and skilled in strategic AWS services in areas ranging from big data to cloud native to ML.


About the authors

Marius Cealera is a senior partner solutions architect at AWS. He works closely with the Accenture AWS Business Group (AABG) to develop and implement innovative cloud solutions. When not working, he enjoys being with his family, biking and trekking in and around Luxembourg.

Zdenko Estok works as a cloud architect and DevOps engineer at Accenture. He works with AABG to develop and implement innovative cloud solutions, and specializes in Infrastructure as Code and Cloud Security. Zdenko likes to bike to the office and enjoys pleasant walks in nature.

Selimcan “Can” Sakar is a cloud first developer and solution architect at Accenture Germany with focus on emerging technologies such as AI/ML, IoT, and Blockchain. Can suffers from Gear Acquisition Syndrome (aka G.A.S.) and likes to pursuit new instruments, bikes and sim-racing equipment in his free time.

Read More

Enable intelligent decision-making with Amazon SageMaker Canvas and Amazon QuickSight

Every company, regardless of its size, wants to deliver the best products and services to its customers. To achieve this, companies want to understand industry trends and customer behavior, and optimize internal processes and data analyses on a routine basis. This is a crucial component of a company’s success.

A very prominent part of the analyst role includes business metrics visualization (like sales revenue) and prediction of future events (like increase in demand) to make data-driven business decisions. To approach this first challenge, you can use Amazon QuickSight, a cloud-scale business intelligence (BI) service that provides easy-to-understand insights and gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. For the second task, you can use Amazon SageMaker Canvas, a cloud service that expands access to machine learning (ML) by providing business analysts with a visual point-and-click interface that allows you to generate accurate ML predictions on your own.

When looking at these metrics, business analysts often identify patterns in customer behavior, in order to determine whether the company risks losing the customer. This problem is called customer churn, and ML models have a proven track record of predicting such customers with high accuracy (for an example, see Elula’s AI Solutions Help Banks Improve Customer Retention).

Building ML models can be a tricky process because it requires an expert team to manage the data preparation and ML model training. However, with Canvas, you can do that without any special knowledge and with zero lines of code. For more information, check out Predict customer churn with no-code machine learning using Amazon SageMaker Canvas.

In this post, we show you how to visualize the predictions generated from Canvas in a QuickSight dashboard, enabling intelligent decision-making via ML.

Overview of solution

In the post Predict customer churn with no-code machine learning using Amazon SageMaker Canvas, we assumed the role of a business analyst in the marketing department of a mobile phone operator, and we successfully created an ML model to identify customers with potential risk of churn. Thanks to the predictions generated by our model, we now want to make an analysis of a potential financial outcome to make data-driven business decisions about potential promotions for these clients and regions.

The architecture that will help us achieve this is shown in the following diagram.

The workflow steps are as follows:

  1. Upload a new dataset with the current customer population into Canvas.
  2. Run a batch prediction and download the results.
  3. Upload the files into QuickSight to create or update visualizations.

You can perform these steps in Canvas without writing a single line of code. For the full list of supported data sources, refer to Importing data in Amazon SageMaker Canvas.

Prerequisites

For this walkthrough, make sure that the following prerequisites are met:

Use the customer churn model

After you complete the prerequisites, you should have a model trained on historical data in Canvas, ready to be used with new customer data to predict customer churn, which you can then use in QuickSight.

  1. Create a new file churn-no-labels.csv by randomly selecting 1,500 lines from the original dataset churn.csv and removing the Churn? column.

We use this new dataset to generate predictions.

We complete the next steps in Canvas. You can open Canvas via the AWS Management Console, or via the SSO application provided by your cloud administrator. If you’re not sure how to access Canvas, refer to Getting started with using Amazon SageMaker Canvas.

  1. On the Canvas console, choose Datasets in the navigation pane.
  2. Choose Import.

  1. Choose Upload and choose the churn-no-labels.csv file that you created.
  2. Choose Import data.

The data import process time depends on the size of the file. In our case, it should be around 10 seconds. When it’s complete, we can see the dataset is in Ready status.

  1. To preview the first 100 rows of the dataset, choose the options menu (three dots) and choose Preview.

  1. Choose Models in the navigation pane, then choose the churn model you created as part of the prerequisites.

  1. On the Predict tab, choose Select dataset.

  1. Select the churn-no-labels.csv dataset, then choose Generate predictions.

Inference time depends on model complexity and dataset size; in our case, it takes around 10 seconds. When the job is finished, it changes its status to Ready and we can download the results.

  1. Choose the options menu (three dots), Download, and Download all values.

Optionally, we can take a quick look at the results choosing Preview. The first two columns are predictions from the model.

We have successfully used our model to predict churn risk for our current customer population. Now we’re ready to visualize business metrics based on our predictions.

Import data to QuickSight

As we discussed previously, business analysts require predictions to be visualized together with business metrics in order to make data-driven business decisions. To do that, we use QuickSight, which provides easy-to-understand insights and gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. With QuickSight, we can build visualizations like graphs and charts in seconds with a simple drag-and-drop interface. In this post, we build several visualizations to better understand business risks and how we could manage them, such as where we should launch new marketing campaigns.

To get started, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

QuickSight supports many data sources. In this post, we use a local file, the one we previously generated in Canvas, as our source data.

  1. Choose Upload a file.

  1. Choose the recently downloaded file with predictions.

QuickSight uploads and analyzes the file.

  1. Check that everything is as expected in the preview, then choose Next.

  1. Choose Visualize.

The data is now successfully imported and we’re ready to analyze it.

Create a dashboard with business metrics of churn predictions

It’s time to analyze our data and make a clear and easy-to-use dashboard that recaps all the information necessary for data-driven business decisions. This type of dashboard is an important tool in the arsenal of a business analysts.

The following is an example dashboard that can help identify and act on the risk of customer churn.

On this dashboard, we visualize several important business metrics:

  • Customers likely to churn – The left donut chart represents the number and percent of users over 50% risk of churning. This chart helps us quickly understand the size of a potential problem.
  • Potential revenue loss – The top middle donut chart represents the amount of revenue loss from users over 50% risk of churning. This chart helps us quickly understand the size of potential revenue loss from churn. The chart also shows that we could lose several above-average customers as a percent of potential revenue lost that’s bigger than the percent of users at risk of churning.
  • Potential revenue loss by state – The top right horizontal bar chart represents the size of revenue lost versus revenue from customers not at risk of churning. This visual could help us understand which state is the most important for us from a marketing campaign perspective.
  • Details about customers at risk of churning – The bottom left table contains details about all our customers. This table could be helpful if we want to quickly look at the details of several customers with and without churn risk.

Customers likely to churn

We start by building a chart with customers at risk of churning.

  1. Under Fields list, choose the Churn? attribute.

QuickSight automatically builds a visualization.

Although the bar plot is a common visualization to understand data distribution, we prefer to use a donut chart. We can change this visual by changing its properties.

  1. Choose the donut chart icon under Visual types.
  2. Choose the current name (double-click) and change it to Customers likely to churn.

  1. To customize other visual effects (remove legend, add values, change font size), choose the pencil icon and make your changes.

As shown in the following screenshot, we increased the area of the donut, as well as added some extra information in the labels.

Potential revenue loss

Another important metric to consider when calculating the business impact of customer churn is potential revenue loss. This is an important metric because it helps us understand the business impact from customers not at risk of churning. In the telecom industry, for example, we could have many inactive clients who have a high risk of churn and but zero revenue. This chart can help us understand if we’re in a such situation or not. To add this metric to our dashboard, we create a custom calculated field by providing the mathematical formula for computing potential revenue loss, then visualize it as another donut chart.

  1. On the Add menu, choose Add calculated field.

  1. Name the field Total charges.
  2. Enter the formula {Day Charge}+{Eve Charge}+{Intl Charge}+{Night Charge}.
  3. Choose Save.

  1. On the Add menu, choose Add visual.

  1. Under Visual types, choose the donut chart icon.
  2. Under Fields list, drag Churn? to Group/Color.
  3. Drag Total charges to Value.
  4. On the Value menu, choose Show as and choose Currency.
  5. Choose the pencil icon to customize other visual effects (remove legend, add values, change font size).

At this moment, our dashboard has two visualizations.

We can already observe that in total we could lose 18% (270) customers, which equals 24% ($6,280) in revenue. Let’s explore further by analyzing potential revenue loss at the state level.

Potential revenue loss by state

To visualize potential revenue loss by state, let’s add a horizontal bar graph.

  1. On the Add menu, choose Add visual.

  1. Under Visual types¸ choose the horizontal bar chart icon.
  2. Under Fields list¸ drag Churn? to Group/Color.
  3. Drag Total charges to Value.
  4. On the Value menu, choose Show as and Currency.
  5. Drag Stage to Y axis.
  6. Choose the pencil icon to customize other visual effects (remove legend, add values, change font size).

  1. We can also sort our new visual by choosing Total charges at the bottom and choosing Descending.

This visual could help us understand which state is the most important from a marketing campaign perspective. For example, in Hawaii, we could potentially lose half our revenue ($253,000) while in Washington, this value is less than 10% ($52,000). We can also see that in Arizona, we risk losing almost every customer.

Details about customers at risk of churning

Let’s build a table with details about customers at risk of churning.

  1. On the Add menu, choose Add visual.

  1. Under Visual types, choose the table icon.
  2. Under Field lists, drag Phone, State, Int’l Plan, Vmail Plan, Churn?, and Account Length to Group by.
  3. Drag probability to Value.
  4. On the Value menu, choose Show as and Percent.

Customize your dashboard

QuickSight offers several options to customize your dashboard, such as the following.

  1. To add a name, on the Add menu, choose Add title.

  1. Enter a title (for this post, we rename our dashboard Churn analysis).

  1. To resize your visuals, choose the bottom right corner of the chart and drag to the desired size.
  2. To move a visual, choose the top center of the chart and drag it to a new location.
  3. To change the theme, choose Themes in the navigation pane.
  4. Choose your new theme (for example, Midnight), and choose Apply.

Publish your dashboard

A dashboard is a read-only snapshot of an analysis that you can share with other QuickSight users for reporting purposes. Your dashboard preserves the configuration of the analysis at the time you publish it, including such things as filtering, parameters, controls, and sort order. The data used for the analysis isn’t captured as part of the dashboard. When you view the dashboard, it reflects the current data in the datasets used by the analysis.

To publish your dashboard, complete the following steps:

  1. On the Share menu, choose Publish dashboard.

  1. Enter a name for your dashboard.
  2. Choose Publish dashboard.

Congratulations, you have successfully created a churn analysis dashboard.

Update your dashboard with a new prediction

As the model evolves and we generate new data from the business, we might need to update this dashboard with new information. Complete the following steps:

  1. Create a new file churn-no-labels-updated.csv by randomly selecting another 1,500 lines from the original dataset churn.csv and removing the Churn? column.

We use this new dataset to generate new predictions.

  1. Repeat the steps from the Use the customer churn model section of this post to get predictions for the new dataset, and download the new file.
  2. On the QuickSight console, choose Datasets in the navigation pane.
  3. Choose the dataset we created.

  1. Choose Edit dataset.

  1. On the drop-down menu, choose Update file.

  1. Choose Upload file.

  1. Choose the recently downloaded file with the predictions.
  2. Review the preview, then choose Confirm file update.

After the “File updated successfully” message appears, we can see that file name has also changed.

  1. Choose Save & publish.

  1. When the “Saved and published successfully” message apears, you can go back to the main menu by choosing the QuickSight logo in the left upper corner.

  1. Choose Dashboards in the navigation pane and choose the dashboard we created before.

You should see your dashboard with the updated values.

We have just updated our QuickSight dashboard with the most recent predictions from Canvas.

Clean up

To avoid future charges, log out from Canvas.

Conclusion

In this post, we used an ML model from Canvas to predict customers at risk of churning and built a dashboard with insightful visualizations to help us make data-driven business decisions. We did so without writing a single line of code thanks to user-friendly interfaces and clear visualizations. This enables business analysts to be agile in building ML models, and perform analyses and extract insights in complete autonomy from data science teams.

To learn more about using Canvas, see Build, Share, Deploy: how business analysts and data scientists achieve faster time-to-market using no-code ML and Amazon SageMaker Canvas. For more information about creating ML models with a no-code solution, see Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts. To learn more about the latest QuickSight features and best practices, see AWS Big Data Blog.


About the Author

Aleksandr Patrushev is AI/ML Specialist Solutions Architect at AWS, based in Luxembourg. He is passionate about the cloud and machine learning, and the way they could change the world. Outside work, he enjoys hiking, sports, and spending time with his family.

Davide Gallitelli is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in Brussels and works closely with customers throughout Benelux. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.

Read More

Amazon SageMaker Autopilot is up to eight times faster with new ensemble training mode powered by AutoGluon

Amazon SageMaker Autopilot has added a new training mode that supports model ensembling powered by AutoGluon. Ensemble training mode in Autopilot trains several base models and combines their predictions using model stacking. For datasets less than 100 MB, ensemble training mode builds machine learning (ML) models with high accuracy quickly—up to eight times faster than hyperparameter optimization (HPO) training mode with 250 trials, and up to 5.8 times faster than HPO training mode with 100 trials. It supports a wide range of algorithms, including LightGBM, CatBoost, XGBoost, Random Forest, Extra Trees, linear models, and neural networks based on PyTorch and FastAI.

How AutoGluon builds ensemble models

AutoGluon-Tabular (AGT) is a popular open-source AutoML framework that trains highly accurate ML models on tabular datasets. Unlike existing AutoML frameworks, which primarily focus on model and hyperparameter selection, AGT succeeds by ensembling multiple models and stacking them in multiple layers. The default behavior of AGT can be summarized as follows: Given a dataset, AGT trains various base models ranging from off-the-shelf boosted trees to customized neural networks on the dataset. The predictions from the base models are used as features to build a stacking model, which learns the appropriate weight of each base model. With these learned weights, the stacking model then combines the base model’s predictions and returns the combined predictions as the final set of predictions.

How Autopilot’s ensemble training mode works

Different datasets have characteristics that are suitable for different algorithms. Given a dataset with unknown characteristics, it’s difficult to know beforehand which algorithms will work best on a dataset. With this in mind, data scientists using AGT often create multiple custom configurations with a subset of algorithms and parameters. They run these configurations on a given dataset to find the best configuration in terms of performance and inference latency.

Autopilot is a low-code ML product that automatically builds the best ML models for your data. In the new ensemble training mode, Autopilot selects an optimal set of AGT configurations and runs multiple trials to return the best model. These trials are run in parallel to evaluate if AGT’s performance can be further improved, in terms of objective metrics or inference latency.

Results observed using OpenML benchmarks

To evaluate the performance improvements, we used OpenML benchmark datasets with sizes varying from 0.5–100 MB and ran 10 AGT trials with different combinations of algorithms and hyperparameter configurations. The tests compared ensemble training mode to HPO mode with 250 trials and HPO mode with 100 trials. The following table compares the overall Autopilot experiment runtime (in minutes) between the two training modes for various dataset sizes.

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Runtime Improvement with HPO 250 Runtime Improvement with HPO 100
< 1MB 121.5 mins 88.0 mins 15.0 mins 8.1x 5.9x
1–10 MB 136.1 mins 76.5 mins 25.8 mins 5.3x 3.0x
10–100 MB 152.7 mins 103.1 mins 60.9 mins 2.5x 1.7x

For comparing performance of multiclass classification problems, we use accuracy, for binary classification problems we use the F1-score, and for regression problems we use R2. The gains in objective metrics are shown in the following tables. We observed that ensemble training mode performed better than HPO training mode (both 100 and 250 trials).

Note that the ensemble mode shows consistent improvement over HPO mode with 250 trials irrespective of dataset size and problem types.

The following table compares accuracy for multi-class classification problems (higher is better).

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Percentage Improvement over HPO 250
< 1MB 0.759 0.761 0.771 1.46%
1–5 MB 0.941 0.935 0.957 1.64%
5–10 MB 0.639 0.633 0.671 4.92%
10–50 MB 0.998 0.999 0.999 0.11%
51–100 MB 0.853 0.852 0.875 2.56%

The following table compares F1 scores for binary classification problems (higher is better).

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Percentage Improvement over HPO 250
< 1MB 0.801 0.807 0.826 3.14%
1–5 MB 0.59 0.587 0.629 6.60%
5–10 MB 0.886 0.889 0.898 1.32%
10–50 MB 0.731 0.736 0.754 3.12%
51–100 MB 0.503 0.493 0.541 7.58%

The following table compares R2 for regression problems (higher is better).

Dataset Size HPO Mode (250 trials) HPO Mode (100 trials) Ensemble Mode (10 trials) Percentage Improvement over HPO 250
< 1MB 0.717 0.718 0.716 0%
1–5 MB 0.803 0.803 0.817 2%
5–10 MB 0.590 0.586 0.614 4%
10–50 MB 0.686 0.688 0.684 0%
51–100 MB 0.623 0.626 0.631 1%

In the next sections, we show how to use the new ensemble training mode in Autopilot to analyze datasets and easily build high-quality ML models.

Dataset overview

We use the Titanic dataset to predict if a given passenger survived or not. This is a binary classification problem. We focus on creating an Autopilot experiment using the new ensemble training mode and compare the results of F1 score and overall runtime with an Autopilot experiment using HPO training mode (100 trials).

Column Name Description
Passengerid Identification number
Survived Survival
Pclass Ticket class
Name Passenger name
Sex Sex
Age Age in years
Sibsp Number of siblings or spouses aboard the Titanic
Parch Number of parents or children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
Embarked Port of embarkation

The dataset has 890 rows and 12 columns. It contains demographic information about the passengers (age, sex, ticket class, and so on) and the Survived (yes/no) target column.

Prerequisites

Complete the following prerequisite steps:

  1. Ensure that you have an AWS account, secure access to log in to the account via the AWS Management Console, and AWS Identity and Access Management (IAM) permissions to use Amazon SageMaker and Amazon Simple Storage Service (Amazon S3) resources.
  2. Download the Titanic dataset and upload it to an S3 bucket in your account.
  3. Onboard to a SageMaker domain and access Amazon SageMaker Studio to use Autopilot. For instructions, refer Onboard to Amazon SageMaker Domain. If you’re using existing Studio, upgrade to the latest version of Studio to use the new ensemble training mode.

Create an Autopilot experiment with ensemble training mode

When the dataset is ready, you can initialize an Autopilot experiment in Studio. For full instructions, refer to Create an Amazon SageMaker Autopilot experiment. Create an Autopilot experiment by providing an experiment name, the data input, and specifying the target data to predict in the Experiment and data details section. Optionally, you can specify the data spilt ratio and auto creation of the Amazon S3 output location.

For our use case, we provide an experiment name, input Amazon S3 location, and choose Survived as the target. We keep the auto split enabled and override the default output Amazon S3 location.

Next, we specify the training method in the Training method section. You can either let Autopilot select the training mode automatically using Auto based on the dataset size, or select the training mode manually for either ensembling or HPO. The details on each option are as follows:

  • Auto – Autopilot automatically chooses either ensembling or HPO mode based on your dataset size. If your dataset is larger than 100 MB, Autopilot chooses HPO, otherwise it chooses ensembling.
  • Ensembling – Autopilot uses AutoGluon’s ensembling technique to train several base models and combines their predictions using model stacking into an optimal predictive model.
  • Hyperparameter optimization – Autopilot finds the best version of a model by tuning hyperparameters using the Bayesian Optimization technique and running training jobs on your dataset. HPO selects the algorithms most relevant to your dataset and picks the best range of hyperparameters to tune the models.

For our use case, we select Ensembling as our training mode.

After this, we proceed to the Deployment and advanced settings section. Here, we deselect the Auto deploy option. Under Advanced settings, you can specify the type of ML problem that you want to solve. If nothing is provided, Autopilot automatically determines the model based on the data you provide. Because ours is a binary classification problem, we choose Binary classification as our problem type and F1 as our objective metric.

Finally, we review our selections and choose Create experiment.

At this point, it’s safe to leave Studio and return later to check on the result, which you can find on the Experiments menu.

The following screenshot shows the final results of our titanic-ens ensemble training mode Autopilot job.

You can see the multiple trials that have been attempted by the Autopilot in ensemble training mode. Each trial returns the best model from the pool of individual model runs and stacking ensemble model runs.

To explain this a little further, let’s assume Trial 1 considered all eight supported algorithms and used stacking level 2. It will internally create the individual models for each algorithm as well as the weighted ensemble models with stack Level 0, Level 1, and Level 2. However, the output of Trial 1 will be the best model from the pool of models created.

Similarly, let’s consider Trial 2 to have picked up tree based boosting algorithms only. In this case, Trial 2 will internally create three individual models for each of the three algorithms as well as the weighted ensemble models, and return the best model from its run.

The final model returned by a trial may or may not be a weighted ensemble model, but the majority of the trials will most likely return their best weighted ensemble model. Finally, based on the selected objective metric, the best model amongst all the 10 trials will be identified.

In the preceding example, our best model was the one with highest F1 score (our objective metric). Several other useful metrics, including accuracy, balanced accuracy, precision, and recall are also shown. In our environment, the end-to-end runtime for this Autopilot experiment was 10 minutes.

Create an Autopilot experiment with HPO training mode

Now let’s perform all of the aforementioned steps to create a second Autopilot experiment with the HPO training method (default 100 trials). Apart from training method selection, which is now Hyperparameter optimization, everything else stays the same. In HPO mode, you can specify the number of trials by setting Max candidates under Advanced settings for Runtime, but we recommend leaving this to default. Not providing any value in Max candidates will run 100 HPO trials. In our environment, the end-to-end runtime for this Autopilot experiment was 2 hours.

Runtime and performance metric comparison

We see that for our dataset (under 1 MB), not only did ensemble training mode run 12 times faster than HPO training mode (120 minutes to 10 minutes), but it also produced improved F1 scores and other performance metrics.

Training Mode F1 Score Accuracy Balanced Accuracy AUC Precision Recall Log Loss Runtime
Ensemble modeWeightedEnsemble 0.844 0.878 0.865 0.89 0.912 0.785 0.394 10 mins
HPO mode – XGBoost 0.784 0.843 0.824 0.867 0.831 0.743 0.428 120 mins

Inference

Now that we have a winner model, we can either deploy it to an endpoint for real-time inferencing or use batch transforms to make predictions on the unlabeled dataset we downloaded earlier.

Summary

You can run your Autopilot experiments faster without any impact on performance with the new ensemble training mode for datasets less than 100 MB. To get started, create an SageMaker Autopilot experiment on the Studio console and select Ensembling as your training mode, or let Autopilot infer the training mode automatically based on the dataset size. You can refer to the CreateAutoMLJob API reference guide for updates to API, and upgrade to the latest version of Studio to use the new ensemble training mode. For more information on this feature, see Model support, metrics, and validation with Amazon SageMaker Autopilot and to learn more about Autopilot, visit the product page.


About the authors

Janisha Anand is a Senior Product Manager in the SageMaker Low/No Code ML team, which includes SageMaker Autopilot. She enjoys coffee, staying active, and spending time with her family.

Saket Sathe is a Senior Applied Scientist in the SageMaker Autopilot team. He is passionate about building the next generation of machine learning algorithms and systems. Aside from work, he loves to read, cook, slurp ramen, and play badminton.

Abhishek Singh is a Software Engineer for the Autopilot team in AWS. He has 8+ years experience as a software developer, and is passionate about building scalable software solutions that solve customer problems. In his free time, Abhishek likes to stay active by going on hikes or getting involved in pick up soccer games.

Vadim Omeltchenko is a Sr. AI/ML Solutions Architect who is passionate about helping AWS customers innovate in the cloud. His prior IT experience was predominantly on the ground.

Read More

Configure a custom Amazon S3 query output location and data retention policy for Amazon Athena data sources in Amazon SageMaker Data Wrangler

Amazon SageMaker Data Wrangler reduces the time that it takes to aggregate and prepare data for machine learning (ML) from weeks to minutes in Amazon SageMaker Studio, the first fully integrated development environment (IDE) for ML. With Data Wrangler, you can simplify the process of data preparation and feature engineering, and complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization, from a single visual interface. You can import data from multiple data sources such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Snowflake, and 26 federated query data sources supported by Amazon Athena.

Starting today, when importing data from Athena data sources, you can configure the S3 query output location and data retention period to import data in Data Wrangler to control where and how long Athena stores the intermediary data. In this post, we walk you through this new feature.

Solution overview

Athena is an interactive query service that makes it easy to browse the AWS Glue Data Catalog, and analyze data in Amazon S3 and 26 federated query data sources using standard SQL. When you use Athena to import data, you can use Data Wrangler’s default S3 location for the Athena query output, or specify an Athena workgroup to enforce a custom S3 location. Previously, you had to implement cleanup workflows to remove this intermediary data, or manually set up S3 lifecycle configuration to control storage cost and meet your organization’s data security requirements. This is a big operational overhead, and not scalable.

Data Wrangler now supports custom S3 locations and data retention periods for your Athena query output. With this new feature, you can change the Athena query output location to a custom S3 bucket. You now have a default data retention policy of 5 days for the Athena query output, and you can change this to meet your organization’s data security requirements. Based on the retention period, the Athena query output in the S3 bucket gets cleaned up automatically. After you import the data, you can perform exploratory data analysis on this dataset and store the clean data back to Amazon S3.

The following diagram illustrates this architecture.

For our use case, we use a sample bank dataset to walk through the solution. The workflow consists of the following steps:

  1. Download the sample dataset and upload it to an S3 bucket.
  2. Set up an AWS Glue crawler to crawl the schema and store the metadata schema in the AWS Glue Data Catalog.
  3. Use Athena to access the Data Catalog to query data from the S3 bucket.
  4. Create a new Data Wrangler flow to connect to Athena.
  5. When creating the connection, set the retention TTL for the dataset.
  6. Use this connection in the workflow and store the clean data in another S3 bucket.

For simplicity, we assume that you have already set up the Athena environment (steps 1–3). We detail the subsequent steps in this post.

Prerequisites

To set up the Athena environment, refer to the User Guide for step-by-step instructions, and complete steps 1–3 as outlined in the previous section.

Import your data from Athena to Data Wrangler

To import your data, complete the following steps:

  1. On the Studio console, choose the Resources icon in the navigation pane.
  2. Choose Data Wrangler on the drop-down menu.
  3. Choose New flow.
  4. On the Import tab, choose Amazon Athena.

    A detail page opens where you can connect to Athena and write a SQL query to import from the database.
  5. Enter a name for your connection.
  6. Expand Advanced configuration.
    When connecting to Athena, Data Wrangler uses Amazon S3 to stages the queried data. By default, this data is staged at the S3 location s3://sagemaker-{region}-{account_id}/athena/ with a retention period of 5 days.
  7. For Amazon S3 location of query results, enter your S3 location.
  8. Select Data retention period and set the data retention period (for this post, 1 day).
    If you deselect this option, the data will persist indefinitely.Behind the scenes, Data Wrangler attaches an S3 lifecycle configuration policy to that S3 location to automatically clean up. See the following example policy:

     "Rules": [
            {
                "Expiration": {
                    "Days": 1
                },
                "ID": "sm-data-wrangler-retention-policy-xxxxxxx",
                "Filter": {
                    "Prefix": "athena/test"
                },
                "Status": "Enabled"
            }
        ]

    You need s3:GetLifecycleConfiguration and s3:PutLifecycleConfiguration for your SageMaker execution role to correctly apply the lifecycle configuration policies. Without these permissions, you get error messages when you try to import the data.

    The following error message is an example of missing the GetLifecycleConfiguration permission.

    The following error message is an example of missing the PutLifecycleConfiguration permission.

  9. Optionally, for Workgroup, you can specify an Athena workgroup.
    An Athena workgroup isolates users, teams, applications, or workloads into groups, each with its own permissions and configuration settings. When you specify a workgroup, Data Wrangler inherits the workgroup setting defined in Athena. For example, if a workgroup has an S3 location defined to store query results and enable Override client side settings, you can’t edit the S3 query result location.By default, Data Wrangler also saves the Athena connection for you. This is displayed as a new Athena tile in the Import tab. You can always reopen that connection to query and bring different data into Data Wrangler.
  10. Deselect Save connection if you don’t want to save the connection.
  11. To configure the Athena connection, choose None for Sampling to import the entire dataset.

    For large datasets, Data Wrangler allows you to import a subset of your data to build out your transformation workflow, and only process the entire dataset when you’re ready. This speeds up the iteration cycle and save processing time and cost. To learn more about different data sampling options available, visit Amazon SageMaker Data Wrangler now supports random sampling and stratified sampling.
  12. For Data catalog¸ choose AwsDataCatalog.
  13. For Database, choose your database.

    Data Wrangler displays the available tables. You can choose each table to check the schema and preview the data.
  14. Enter the following code in the query field:
    Select *
    From bank_additional_full

  15. Choose Run to preview the data.
  16. If everything looks good, choose Import.
  17. Enter a dataset name and choose Add to import the data into your Data Wrangler workspace.

Analyze and process data with Data Wrangler

After you load the data in to Data Wrangler, you can do exploratory data analysis (EDA) and prepare the data for machine learning.

  1. Choose the plus sign next to the bank-data dataset in the data flow, and choose Add analysis.
    Data Wrangler provides built-in analyses, including a Data Quality and Insights Report, data correlation, a pre-training bias report, a summary of your dataset, and visualizations (such as histograms and scatter plots). Additionally, you can create your own custom visualization.
  2. For Analysis type¸ choose Data Quality and Insight Report.
    This automatically generates visualizations, analyses to identify data quality issues, and recommendations for the right transformations required for your dataset.
  3. For Target column, choose Y.
  4. Because this is a classification problem statement, for Problem type, select Classification.
  5. Choose Create.

    Data Wrangler creates a detailed report on your dataset. You can also download the report to your local machine.
  6. For data preparation, choose the plus sign next to the bank-data dataset in the data flow, and choose Add transform.
  7. Choose Add step to start building your transformations.

At the time of this writing, Data Wrangler provides over 300 built-in transformations. You can also write your own transformations using Pandas or PySpark.

You can now start building your transforms and analyses based on your business requirements.

Clean up

To avoid ongoing costs, delete the Data Wrangler resources using the steps below when you’re finished.

  1. Select Running Instances and Kernels icon.
  2. Under RUNNING APPS, click on the shutdown icon next to the sagemaker-data-wrangler-1.0 app.
  3. Choose Shut down all to confirm.

Conclusion

In this post, we provided an overview of customizing your S3 location and enabling S3 lifecycle configurations for importing data from Athena to Data Wrangler. With this feature, you can store intermediary data in a secured S3 location, and automatically remove the data copy after the retention period to reduce the risk for unauthorized access to data. We encourage you to try out this new feature. Happy building!

To learn more about Athena and SageMaker, visit the Athena User Guide and Amazon SageMaker Documentation.


About the authors

 Meenakshisundaram Thandavarayan is a Senior AI/ML specialist with AWS. He helps hi-tech strategic accounts on their AI and ML journey. He is very passionate about data-driven AI.

Harish Rajagopalan is a Senior Solutions Architect at Amazon Web Services. Harish works with enterprise customers and helps them with their cloud journey.

James Wu is a Senior AI/ML Specialist Solution Architect at AWS. helping customers design and build AI/ML solutions. James’s work covers a wide range of ML use cases, with a primary interest in computer vision, deep learning, and scaling ML across the enterprise. Prior to joining AWS, James was an architect, developer, and technology leader for over 10 years, including 6 years in engineering and 4 years in marketing & advertising industries.

Read More

Use RStudio on Amazon SageMaker to create regulatory submissions for the life sciences industry

Pharmaceutical companies seeking approval from regulatory agencies such as the US Food & Drug Administration (FDA) or Japanese Pharmaceuticals and Medical Devices Agency (PMDA) to sell their drugs on the market must submit evidence to prove that their drug is safe and effective for its intended use. A team of physicians, statisticians, chemists, pharmacologists, and other clinical scientists review the clinical trial submission data and proposed labeling. If the review establishes that the there is sufficient statistical evidence to prove that the health benefits of the drug outweigh the risks, the drug is approved for sale.

The clinical trial submission package consists of tabulated data, analysis data, trial metadata, and statistical reports consisting of statistical tables, listings, and figures. In the case of the US FDA, the electronic common technical document (eCTD) is the standard format for submitting applications, amendments, supplements, and reports to the FDA’s Center for Biologics Evaluation and Research (CBER) and Center for Drug Evaluation and Research (CDER). For the FDA and Japanese PMDA, it’s a regulatory requirement to submit tabulated data in CDISC Standard Data Tabulation Model (SDTM), analysis data in CDISC Analysis Dataset Model (ADaM), and trial metadata in CDISC Define-XML (based on Operational Data Model (ODM)).

In this post, we demonstrate how we can use RStudio on Amazon SageMaker to create such regulatory submission deliverables. This post describes the clinical trial submission process, how we can ingest clinical trial research data, tabulate and analyze the data, and then create statistical reports—summary tables, data listings, and figures (TLF). This method can enable pharmaceutical customers to seamlessly connect to clinical data stored in their AWS environment, process it using R, and help accelerate the clinical trial research process.

Drug development process

The drug development process can broadly be divided into five major steps, as illustrated in the following figure.

Drug Development Process

It takes on an average 10–15 years and approximately USD $1–3 billion for one drug to receive a successful approval out of around 10,000 potential molecules. During the early phases of research (the drug discovery phase), promising drug candidates are identified, which move further to preclinical research. During the preclinical phase, researchers try to find out the toxicity of the drug by performing in vitro experiments in the lab and in vivo experiments on animals. After preclinical testing, drugs move on the clinical trial research phase, where they must be tested on humans to ascertain their safety and efficacy. The researchers design clinical trials and detail the study plan in the clinical trial protocol. They define the different clinical research phases—from small Phase 1 studies to determine drug safety and dosage, to a bigger Phase 2 trials to determine drug efficacy and side effects, to even bigger Phase 3 and 4 trials to determine drug efficacy, safety, and monitoring adverse reactions. After successful human clinical trials, the drug sponsor files a New Drug Application (NDA) to market the drug. The regulatory agencies review all the data, work with the sponsor on prescription labeling information, and approve the drug. After the drug’s approval, the regulatory agencies review post-market safety reports to ensure the complete product’s safety.

In 1997, Clinical Data Interchange Standards Consortium (CDISC), a global, non-profit organization comprising of pharmaceutical companies, CROs, biotech, academic institutions, healthcare providers, and government agencies, was started as volunteer group. CDISC has published data standards to streamline the flow of data from collection through submissions, and facilitated data interchange between partners and providers. CDISC has published the following standards:

  • CDASH (Clinical Data Acquisition Standards Harmonization) – Standards for collected data
  • SDTM (Study Data Tabulation Model) – Standards for submitting tabulated data
  • ADaM (Analysis Data Model) – Standards for analysis data
  • SEND (Standard for Exchange of Nonclinical Data) – Standards for nonclinical data
  • PRM (Protocol Representation Model) – Standards for protocol

These standards can help trained reviewers analyze data more effectively and quickly using standard tools, thereby reducing drug approval times. It’s a regulatory requirement from the US FDA and Japanese PMDA to submit all tabulated data using the SDTM format.

R for clinical trial research submissions

SAS and R are two of the most used statistical analysis software used within the pharmaceutical industry. When development of the SDTM standards was started by CDISC, SAS was in almost universal use in the pharmaceutical industry and at the FDA. However, R is gaining tremendous popularity nowadays because it’s open source, and new packages and libraries are continuously added. Students primarily use R during their academics and research, and they take this familiarity with R to their jobs. R also offers support for emerging technologies such as advanced deep learning integrations.

Cloud providers such as AWS have now become the platform of choice for pharmaceutical customers to host their infrastructure. AWS also provides managed services such as SageMaker, which makes it effortless to create, train, and deploy machine learning (ML) models in the cloud. SageMaker also allows access to the RStudio IDE from anywhere via a web browser. This post details how statistical programmers and biostatisticians can ingest their clinical data into the R environment, how R code can be run, and how results are stored. We provide snippets of code that allow clinical trial data scientists to ingest XPT files into the R environment, create R data frames for SDTM and ADaM, and finally create TLF that can be stored in an Amazon Simple Storage Service (Amazon S3) object storage bucket.

RStudio on SageMaker

On November 2, 2021, AWS in collaboration with RStudio PBC announced the general availability of RStudio on SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to SageMaker in just a few simple steps. To learn more about this exciting collaboration, check out Announcing RStudio on Amazon SageMaker.

Along with the RStudio Workbench, the RStudio suite for R developers also offers RStudio Connect and RStudio Package Manager. RStudio Connect is designed to allow data scientists to publish insights, dashboards, and web applications. It makes it easy to share ML and data science insights from data scientists’ complicated work and put it in the hands of decision-makers. RStudio Connect also makes hosting and managing content simple and scalable for wide consumption.

Solution overview

In the following sections, we discuss how we can import raw data from a remote repository or S3 bucket in RStudio on SageMaker. It’s also possible to connect directly to Amazon Relational Database Service (Amazon RDS) and data warehouses like Amazon Redshift (see Connecting R with Amazon Redshift) directly from RStudio; however, this is outside the scope of this post. After data has been ingested from a couple of different sources, we process it and create R data frames for a table. Then we convert the table data frame into an RTF file and store the results back in an S3 bucket. These outputs can then potentially be used for regulatory submission purposes, provided the R packages used in the post have been validated for use for regulatory submissions by the customer.

Set up RStudio on SageMaker

For instructions on setting up RStudio on SageMaker in your environment, refer to Get started with RStudio on SageMaker. Make sure that the execution role of RStudio on SageMaker has access to download and upload data to the S3 bucket in which data is stored. To learn more about how to manage R packages and publish your analysis using RStudio on SageMaker, refer to Announcing Fully Managed RStudio on SageMaker for Data Scientists.

Ingest data into RStudio

In this step, we ingest data from various sources to make it available for our R session. We import data in SAS XPT format; however, the process is similar if you want to ingest data in other formats. One of the advantages of using RStudio on SageMaker is that if the source data is stored in your AWS accounts, then SageMaker can natively access the data using AWS Identity and Access Management (IAM) roles.

Access data stored in a remote repository

In this step, we import ADaM data from the FDA’s GitHub repository. We create a local directory called data in the RStudio environment to store the data and download demographics data (dm.xpt) from the remote repository. In this context, the local directory refers to a directory created on the your private Amazon EFS storage that is attached by default to your R session environment. See the following code:

######################################################
# Step 1.1 – Ingest Data from Remote Data Repository #
######################################################

# Remote Data Path 
raw_data_url = “https://github.com/FDA/PKView/raw/master/Installation%20Package/OCP/data/clinical/DRUG000/0000/m5/datasets/test001/tabulations/sdtm”
raw_data_name = “dm.xpt”

#Create Local Directory to store downloaded files
dir.create(“data”)
local_file_location <- paste0(getwd(),”/data/”)
download.file(raw_data_url, paste0(local_file_location,raw_data_name))

When this step is complete, you can see dm.xpt being downloaded by navigating to Files, data, dm.xpt.

Access data stored in Amazon S3

In this step, we download data stored in an S3 bucket in our account. We have copied contents from the FDA’s GitHub repository to the S3 bucket named aws-sagemaker-rstudio for this example. See the following code:

#####################################################
# Step 1.2 - Ingest Data from S3 Bucket             #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()

s3_bucket = "aws-sagemaker-rstudio"
s3_key = "DRUG000/test001/tabulations/sdtm/pp.xpt"

session$download_data(local_file_location, s3_bucket, s3_key)

When the step is complete, you can see pp.xpt being downloaded by navigating to Files, data, pp.xpt.

Process XPT data

Now that we have SAS XPT files available in the R environment, we need to convert them into R data frames and process them. We use the haven library to read XPT files. We merge CDISC SDTM datasets dm and pp to create ADPP dataset. Then we create a summary statistic table using the ADPP data frame. The summary table is then exported in RTF format.

First, XPT files are read using the read_xpt function of the haven library. Then an analysis dataset is created using the sqldf function of the sqldf library. See the following code:

########################################################
# Step 2.1 - Read XPT files. Create Analysis dataset.  #
########################################################

library(haven)
library(sqldf)


# Read XPT Files, convert them to R data frame
dm = read_xpt("data/dm.xpt")
pp = read_xpt("data/pp.xpt")

# Create ADaM dataset
adpp = sqldf("select a.USUBJID
                    ,a.PPCAT as ACAT
                    ,a.PPTESTCD
                    ,a.PPTEST
                    ,a.PPDTC
                    ,a.PPSTRESN as AVAL
                    ,a.VISIT as AVISIT
                    ,a.VISITNUM as AVISITN
                    ,b.sex
                from pp a 
           left join dm b 
                  on a.usubjid = b.usubjid
             ")

Then, an output data frame is created using functions from the Tplyr and dplyr libraries:

########################################################
# Step 2.2 - Create output table                       #
########################################################

library(Tplyr)
library(dplyr)

t = tplyr_table(adpp, SEX) %>% 
  add_layer(
    group_desc(AVAL, by = "Area under the concentration-time curve", where= PPTESTCD=="AUC") %>% 
      set_format_strings(
        "n"        = f_str("xx", n),
        "Mean (SD)"= f_str("xx.x (xx.xx)", mean, sd),
        "Median"   = f_str("xx.x", median),
        "Q1, Q3"   = f_str("xx, xx", q1, q3),
        "Min, Max" = f_str("xx, xx", min, max),
        "Missing"  = f_str("xx", missing)
      )
  )  %>% 
  build()

output = t %>% 
  rename(Variable = row_label1,Statistic = row_label2,Female =var1_F, Male = var1_M) %>% 
  select(Variable,Statistic,Female, Male)

The output data frame is then stored as an RTF file in the output folder in the RStudio environment:

#####################################################
# Step 3 - Save the Results as RTF                  #
#####################################################
library(rtf)

dir.create("output")
rtf = RTF("output/tab_adpp.rtf")  
addHeader(rtf,title="Section 1 - Tables", subtitle="This Section contains all tables")
addParagraph(rtf, "Table 1 - Pharmacokinetic Parameters by Sex:n")
addTable(rtf, output)
done(rtf)

Upload outputs to Amazon S3

After the output has been generated, we put the data back in an S3 bucket. We can achieve this by creating a SageMaker session again, if a session isn’t active already, and uploading the contents of the output folder to an S3 bucket using the session$upload_data function:

#####################################################
# Step 4 - Upload outputs to S3                     #
#####################################################
library("reticulate")

SageMaker = import('sagemaker')
session <- SageMaker$Session()
s3_bucket = "aws-sagemaker-rstudio"
output_location = "output/"
s3_folder_name = "output"
session$upload_data(output_location, s3_bucket, s3_folder_name)

With these steps, we have ingested data, processed it, and uploaded the results to be made available for submission to regulatory authorities.

Clean up

To avoid incurring any unintended costs, you need to quit your current session. On the top right corner of the page, choose the power icon. This will automatically stop the underlying instance and therefore stop incurring any unintended compute costs.

Challenges

The post has outlined steps for ingesting raw data stored in an S3 bucket or from a remote repository. However, there are many other sources of raw data for a clinical trial, primarily eCRF (electronic case report forms) data stored in EDC (electronic data capture) systems such as Oracle Clinical, Medidata Rave, OpenClinica, or Snowflake; lab data; data from eCOA (clinical outcome assessment) and ePRO (electronic Patient-Reported Outcomes); real-world data from apps and medical devices; and electronic health records (EHRs) at the hospitals. Significant preprocessing is involved before this data can be made usable for regulatory submissions. Building connectors to various data sources and collecting them in a centralized data repository (CDR) or a clinical data lake, while maintaining proper access controls, poses significant challenges.

Another key challenge to overcome is that of regulatory compliance. The computer system used for creating regulatory submission outputs must be compliant with appropriate regulations, such as 21 CFR Part 11, HIPAA, GDPR, or any other GxP requirements or ICH guidelines. This translates to working in a validated and qualified environment with controls for access, security, backup, and auditability in place. This also means that any R packages that are used to create regulatory submission outputs must be validated before use.

Conclusion

In this post, we saw that the some of the key deliverables for an eCTD submission were CDISC SDTM, ADaM datasets, and TLF. This post outlined the steps needed to create these regulatory submission deliverables by first ingesting data from a couple of sources into RStudio on SageMaker. We then saw how we can process the ingested data in XPT format; convert it into R data frames to create SDTM, ADaM, and TLF; and then finally upload the results to an S3 bucket.

We hope that with the broad ideas laid out in the post, statistical programmers and biostatisticians can easily visualize the end-to-end process of loading, processing, and analyzing clinical trial research data into RStudio on SageMaker and use the learnings to define a custom workflow suited for your regulatory submissions.

Can you think of any other applications of using RStudio to help researchers, statisticians, and R programmers to make their lives easier? We would love to hear about your ideas! And if you have any questions, please share them in the comments section.

Resources

For more information, visit the following links:


About the authors

Rohit Banga is a Global Clinical Development Industry Specialist based out of London, UK. He is a biostatistician by training and helps Healthcare and LifeScience customers deploy innovative clinical development solutions on AWS. He is passionate about how data science, AI/ML, and emerging technologies can be used to solve real business problems within the Healthcare and LifeScience industry. In his spare time, Rohit enjoys skiing, BBQing, and spending time with family and friends.

Georgios Schinas is a Specialist Solutions Architect for AI/ML in the EMEA region. He is based in London and works closely with customers in UK and Ireland. Georgios helps customers design and deploy machine learning applications in production on AWS with a particular interest in MLOps practices and enabling customers to perform machine learning at scale. In his spare time, he enjoys traveling, cooking and spending time with friends and family.

Read More

Churn prediction using Amazon SageMaker built-in tabular algorithms LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular

Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. These algorithms and models can be used for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, and text.

Customer churn is a problem faced by a wide range of companies, from telecommunications to banking, where customers are typically lost to competitors. It’s in a company’s best interest to retain existing customers rather than acquire new customers because it usually costs significantly more to attract new customers. Mobile operators have historical records in which customers continued using the service or ultimately ended up churning. We can use this historical information of a mobile operator’s churn to train an ML model. After training this model, we can pass the profile information of an arbitrary customer (the same profile information that we used to train the model) to the model, and have it predict whether this customer is going to churn or not.

In this post, we train and deploy four recently released SageMaker algorithms—LightGBM, CatBoost, TabTransformer, and AutoGluon-Tabular—on a churn prediction dataset. We use SageMaker Automatic Model Tuning (a tool for hyperparameter optimization) to find the best hyperparameters for each model, and compare their performance on a holdout test dataset to select the optimal one.

You can also use this solution as a template to search over a collection of state-of-the-art tabular algorithms and use hyperparameter optimization to find the best overall model. You can easily replace the example dataset with your own to solve real business problems you’re interested in. If you want to jump straight into the SageMaker SDK code we go through in this post, you can refer to the following sample Jupyter notebook.

Benefits of SageMaker built-in algorithms

When selecting an algorithm for your particular type of problem and data, using a SageMaker built-in algorithm is the easiest option, because doing so comes with the following major benefits:

  • Low coding – The built-in algorithms require little coding to start running experiments. The only inputs you need to provide are the data, hyperparameters, and compute resources. This allows you to run experiments more quickly, with less overhead for tracking results and code changes.
  • Efficient and scalable algorithm implementations – The built-in algorithms come with parallelization across multiple compute instances and GPU support right out of the box for all applicable algorithms. If you have a lot of data with which to train your model, most built-in algorithms can easily scale to meet the demand. Even if you already have a pre-trained model, it may still be easier to use its corollary in SageMaker and input the hyperparameters you already know rather than port it over and write a training script yourself.
  • Transparency – You’re the owner of the resulting model artifacts. You can take that model and deploy it on SageMaker for several different inference patterns (check out all the available deployment types) and easy endpoint scaling and management, or you can deploy it wherever else you need it.

Data visualization and preprocessing

First, we gather our customer churn dataset. It’s a relatively small dataset with 5,000 records, where each record uses 21 attributes to describe the profile of a customer of an unknown US mobile operator. The attributes range from the US state where the customer resides, to the number of calls they placed to customer service, to the cost they are billed for daytime calls. We’re trying to predict whether the customer will churn or not, which is a binary classification problem. The following is a subset of those features look like, with the label as the last column.

The following are some insights for each column, specifically the summary statistics and histogram of selected features.

We then preprocess the data, split it into training, validation, and test sets, and upload the data to Amazon Simple Storage Service (Amazon S3).

Automatic model tuning of tabular algorithms

Hyperparameters control how our underlying algorithms operate and influence the performance of the model. Those hyperparameters can be the number of layers, learning rate, weight decay rate, and dropout for neural network-based models, or the number of leaves, iterations, and maximum tree depth for tree ensemble models. To select the best model, we apply SageMaker automatic model tuning to each of the four trained SageMaker tabular algorithms. You need only select the hyperparameters to tune and a range for each parameter to explore. For more information about automatic model tuning, refer to Amazon SageMaker Automatic Model Tuning: Using Machine Learning for Machine Learning or Amazon SageMaker automatic model tuning: Scalable gradient-free optimization.

Let’s see how this works in practice.

LightGBM

We start by running automatic model tuning with LightGBM, and adapt that process to the other algorithms. As is explained in the post Amazon SageMaker JumpStart models and algorithms now available via API, the following artifacts are required to train a pre-built algorithm via the SageMaker SDK:

  • Its framework-specific container image, containing all the required dependencies for training and inference
  • The training and inference scripts for the selected model or algorithm

We first retrieve these artifacts, which depend on the model_id (lightgbm-classification-model in this case) and version:

from sagemaker import image_uris, model_uris, script_uris
train_model_id, train_model_version, train_scope = "lightgbm-classification-model", "*", "training"
training_instance_type = "ml.m5.4xlarge"

# Retrieve the docker image
train_image_uri = image_uris.retrieve(region=None,
                                      framework=None,
                                      model_id=train_model_id,
                                      model_version=train_model_version,
                                      image_scope=train_scope,
                                      instance_type=training_instance_type,
                                      )                                      
# Retrieve the training script
train_source_uri = script_uris.retrieve(model_id=train_model_id,
                                        model_version=train_model_version,
                                        script_scope=train_scope
                                        )
# Retrieve the pre-trained model tarball (in the case of tabular modeling, it is a dummy file)
train_model_uri = model_uris.retrieve(model_id=train_model_id,
                                      model_version=train_model_version,
                                      model_scope=train_scope)

We then get the default hyperparameters for LightGBM, set some of them to selected fixed values such as number of boosting rounds and evaluation metric on the validation data, and define the value ranges we want to search over for others. We use the SageMaker parameters ContinuousParameter and IntegerParameter for this:

from sagemaker import hyperparameters
from sagemaker.tuner import ContinuousParameter, IntegerParameter, HyperparameterTuner

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(model_id=train_model_id,
                                                   model_version=train_model_version
                                                   )
# [Optional] Override default hyperparameters with custom values
hyperparameters["num_boost_round"] = "500"
hyperparameters["metric"] = "auc"

# Define search ranges for other hyperparameters
hyperparameter_ranges_lgb = {
    "learning_rate": ContinuousParameter(1e-4, 1, scaling_type="Logarithmic"),
    "num_boost_round": IntegerParameter(2, 30),
    "num_leaves": IntegerParameter(10, 50),
    "feature_fraction": ContinuousParameter(0, 1),
    "bagging_fraction": ContinuousParameter(0, 1),
    "bagging_freq": IntegerParameter(1, 10),
    "max_depth": IntegerParameter(5, 30),
    "min_data_in_leaf": IntegerParameter(5, 50),
}

Finally, we create a SageMaker Estimator, feed it into a HyperarameterTuner, and start the hyperparameter tuning job with tuner.fit():

from sagemaker.estimator import Estimator
from sagemaker.tuner import HyperParameterTuner

# Create SageMaker Estimator instance
tabular_estimator = Estimator(
    role=aws_role,
    image_uri=train_image_uri,
    source_dir=train_source_uri,
    model_uri=train_model_uri,
    entry_point="transfer_learning.py",
    instance_count=1,
    instance_type=training_instance_type,
    max_run=360000,
    hyperparameters=hyperparameters,
)

tuner = HyperparameterTuner(
            tabular_estimator,
            "auc",
            hyperparameter_ranges_lgb,
            [{"Name": "auc", "Regex": "auc: ([0-9\.]+)"}],
            max_jobs=10,
            max_parallel_jobs=5,
            objective_type="Maximize",
            base_tuning_job_name="some_name",
        )

tuner.fit({"training": training_dataset_s3_path}, logs=True)

The max_jobs parameter defines how many total jobs will be run in the automatic model tuning job, and max_parallel_jobs defines how many concurrent training jobs should be started. We also define the objective to “Maximize” the model’s AUC (area under the curve). To dive deeper into the available parameters exposed by HyperParameterTuner, refer to HyperparameterTuner.

Check out the sample notebook to see how we proceed to deploy and evaluate this model on the test set.

CatBoost

The process for hyperparameter tuning on the CatBoost algorithm is the same as before, although we need to retrieve model artifacts under the ID catboost-classification-model and change the range selection of hyperparameters:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)
# [Optional] Override default hyperparameters with custom values
hyperparameters["iterations"] = "500"
hyperparameters["eval_metric"] = "AUC"

# Define search ranges for other hyperparameters
hyperparameter_ranges_cat = {
    "learning_rate": ContinuousParameter(0.00001, 0.1, scaling_type="Logarithmic"),
    "iterations": IntegerParameter(50, 1000),
    "depth": IntegerParameter(1, 10),
    "l2_leaf_reg": IntegerParameter(1, 10),
    "random_strength": ContinuousParameter(0.01, 10, scaling_type="Logarithmic"),
}

TabTransformer

The process for hyperparameter tuning on the TabTransformer model is the same as before, although we need to retrieve model artifacts under the ID pytorch-tabtransformerclassification-model and change the range selection of hyperparameters.

We also change the training instance_type to ml.p3.2xlarge. TabTransformer is a model recently derived from Amazon research, which brings the power of deep learning to tabular data using Transformer models. To train this model in an efficient manner, we need a GPU-backed instance. For more information, refer to Bringing the power of deep learning to data in tables.

from sagemaker import hyperparameters
from sagemaker.tuner import CategoricalParameter

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)
# [Optional] Override default hyperparameters with custom values
hyperparameters["n_epochs"] = 40  # The same hyperparameter is named as "iterations" for CatBoost
hyperparameters["patience"] = 10

# Define search ranges for other hyperparameters
hyperparameter_ranges_tab = {
    "learning_rate": ContinuousParameter(0.001, 0.01, scaling_type="Auto"),
    "batch_size": CategoricalParameter([64, 128, 256, 512]),
    "attn_dropout": ContinuousParameter(0.0, 0.8, scaling_type="Auto"),
    "mlp_dropout": ContinuousParameter(0.0, 0.8, scaling_type="Auto"),
    "input_dim": CategoricalParameter(["16", "32", "64", "128", "256"]),
    "frac_shared_embed": ContinuousParameter(0.0, 0.5, scaling_type="Auto"),
}

AutoGluon-Tabular

In the case of AutoGluon, we don’t run hyperparameter tuning. This is by design, because AutoGluon focuses on ensembling multiple models with sane choices of hyperparameters and stacking them in multiple layers. This ends up being more performant than training one model with the perfect selection of hyperparameters and is also computationally cheaper. For details, check out AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data.

Therefore, we switch the model_id to autogluon-classification-ensemble, and only fix the evaluation metric hyperparameter to our desired AUC score:

from sagemaker import hyperparameters

# Retrieve the default hyper-parameters for fine-tuning the model
hyperparameters = hyperparameters.retrieve_default(
    model_id=train_model_id, model_version=train_model_version
)

hyperparameters["eval_metric"] = "roc_auc"

Instead of calling tuner.fit(), we call estimator.fit() to start a single training job.

Benchmarking the trained models

After we deploy all four models, we send the full test set to each endpoint for prediction and calculate accuracy, F1, and AUC metrics for each (see code in the sample notebook). We present the results in the following table, with an important disclaimer: results and relative performance between these models will depend on the dataset you use for training. These results are representative, and even though the tendency for certain algorithms to perform better is based on relevant factors (for example, AutoGluon intelligently ensembles the predictions of both LightGBM and CatBoost models behind the scenes), the balance in performance might change given a different data distribution.

. LightGBM with Automatic Model Tuning CatBoost with Automatic Model Tuning TabTransformer with Automatic Model Tuning AutoGluon-Tabular
Accuracy 0.8977 0.9622 0.9511 0.98
F1 0.8986 0.9624 0.9517 0.98
AUC 0.9629 0.9907 0.989 0.9979

Conclusion

In this post, we trained four different SageMaker built-in algorithms to solve the customer churn prediction problem with low coding effort. We used SageMaker automatic model tuning to find the best hyperparameters to train these algorithms with, and compared their performance on a selected churn prediction dataset. You can use the related sample notebook as a template, replacing the dataset with your own to solve your desired tabular data-based problem.

Make sure to try these algorithms on SageMaker, and check out sample notebooks on how to use other built-in algorithms available on GitHub.


About the authors

Dr. Xin Huang is an Applied Scientist for Amazon SageMaker JumpStart and Amazon SageMaker built-in algorithms. He focuses on developing scalable machine learning algorithms. His research interests are in the area of natural language processing, explainable deep learning on tabular data, and robust analysis of non-parametric space-time clustering. He has published many papers in ACL, ICDM, KDD conferences, and Royal Statistical Society: Series A journal.

João Moura is an AI/ML Specialist Solutions Architect at Amazon Web Services. He is mostly focused on NLP use-cases and helping customers optimize Deep Learning model training and deployment. He is also an active proponent of low-code ML solutions and ML-specialized hardware.

Read More

Parallel data processing with RStudio on Amazon SageMaker

Last year, we announced the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench integrated development environment (IDE) in the cloud. You can quickly launch the familiar RStudio IDE, and dial up and down the underlying compute resources without interrupting your work, making it easy to build machine learning (ML) and analytics solutions in R at scale.

With ever-increasing data volume being generated, datasets used for ML and statistical analysis are growing in tandem. With this brings the challenges of increased development time and compute infrastructure management. To solve these challenges, data scientists have looked to implement parallel data processing techniques. Parallel data processing, or data parallelization, takes large existing datasets and distributes them across multiple processers or nodes to operate on the data simultaneously. This can allow for faster processing time of larger datasets, along with optimized usage on compute. This can help ML practitioners create reusable patterns for dataset generation, and also help reduce compute infrastructure load and cost.

Solution overview

Within Amazon SageMaker, many customers use SageMaker Processing to help implement parallel data processing. With SageMaker Processing, you can use a simplified, managed experience on SageMaker to run your data processing workloads, such as feature engineering, data validation, model evaluation, and model interpretation. This brings many benefits because there’s no long-running infrastructure to manage—processing instances spin down when jobs are complete, environments can be standardized via containers, data within Amazon Simple Storage Service (Amazon S3) is natively distributed across instances, and infrastructure settings are flexible in terms of memory, compute, and storage.

SageMaker Processing offers options for how to distribute data. For parallel data processing, you must use the ShardedByS3Key option for the S3DataDistributionType. When this parameter is selected, SageMaker Processing takes the provided n instances and distribute objects 1/n objects from the input data source across the instances. For example, if two instances are provided with four data objects, each instance receives two objects.

SageMaker Processing requires three components to run processing jobs:

  • A container image that has your code and dependencies to run your data processing workloads
  • A path to an input data source within Amazon S3
  • A path to an output data source within Amazon S3

The process is depicted in the following diagram.

In this post, we show you how to use RStudio on SageMaker to interface with a series of SageMaker Processing jobs to create a parallel data processing pipeline using the R programming language.

The solution consists of the following steps:

  1. Set up the RStudio project.
  2. Build and register the processing container image.
  3. Run the two-step processing pipeline:
    1. The first step takes multiple data files and processes them across a series of processing jobs.
    2. The second step concatenates the output files and splits them into train, test, and validation datasets.

Prerequisites

Complete the following prerequisites:

  1. Set up the RStudio on SageMaker Workbench. For more information, refer to Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists.
  2. Create a user with RStudio on SageMaker with appropriate access permissions.

Set up the RStudio project

To set up the RStudio project, complete the following steps:

  1. Navigate to your Amazon SageMaker Studio control panel on the SageMaker console.
  2. Launch your app in the RStudio environment.
  3. Start a new RStudio session.
  4. For Session Name, enter a name.
  5. For Instance Type and Image, use the default settings.
  6. Choose Start Session.
  7. Navigate into the session.
  8. Choose New Project, Version control, and then Select Git.
  9. For Repository URL, enter https://github.com/aws-samples/aws-parallel-data-processing-r.git
  10. Leave the remaining options as default and choose Create Project.

You can navigate to the aws-parallel-data-processing-R directory on the Files tab to view the repository. The repository contains the following files:

  • Container_Build.rmd
  • /dataset

    • bank-additional-full-data1.csv
    • bank-additional-full-data2.csv
    • bank-additional-full-data3.csv
    • bank-additional-full-data4.csv
  • /docker
  • Dockerfile-Processing
  • Parallel_Data_Processing.rmd
  • /preprocessing

    • filter.R
    • process.R

Build the container

In this step, we build our processing container image and push it to Amazon Elastic Container Registry (Amazon ECR). Complete the following steps:

  1. Navigate to the Container_Build.rmd file.
  2. Install the SageMaker Studio Image Build CLI by running the following cell. Make sure you have the required permissions prior to completing this step, this is a CLI designed to push and register container images within Studio.
    pip install sagemaker-studio-image-build

  3. Run the next cell to build and register our processing container:
    /home/sagemaker-user/.local/bin/sm-docker build . --file ./docker/Dockerfile-Processing --repository sagemaker-rstudio-parallel-processing:1.0

After the job has successfully run, you receive an output that looks like the following:

Image URI: <Account_Number>.dkr.ecr.<Region>.amazonaws.com/sagemaker-rstudio- parallel-processing:1.0

Run the processing pipeline

After you build the container, navigate to the Parallel_Data_Processing.rmd file. This file contains a series of steps that helps us create our parallel data processing pipeline using SageMaker Processing. The following diagram depicts the steps of the pipeline that we complete.

Start by running the package import step. Import the required RStudio packages along with the SageMaker SDK:

suppressWarnings(library(dplyr))
suppressWarnings(library(reticulate))
suppressWarnings(library(readr))
path_to_python <- system(‘which python’, intern = TRUE)

use_python(path_to_python)
sagemaker <- import('sagemaker')

Now set up your SageMaker execution role and environment details:

role = sagemaker$get_execution_role()
session = sagemaker$Session()
bucket = session$default_bucket()
account_id <- session$account_id()
region <- session$boto_region_name
local_path <- dirname(rstudioapi::getSourceEditorContext()$path)

Initialize the container that we built and registered in the earlier step:

container_uri <- paste(account_id, "dkr.ecr", region, "amazonaws.com/sagemaker-rstudio-parallel-processing:1.0", sep=".")
print(container_uri)

From here we dive into each of the processing steps in more detail.

Upload the dataset

For our example, we use the Bank Marketing dataset from UCI. We have already split the dataset into multiple smaller files. Run the following code to upload the files to Amazon S3:

local_dataset_path <- paste0(local_path,"/dataset/")

dataset_files <- list.files(path=local_dataset_path, pattern="\.csv$", full.names=TRUE)
for (file in dataset_files){
  session$upload_data(file, bucket=bucket, key_prefix="sagemaker-rstudio-example/split")
}

input_s3_split_location <- paste0("s3://", bucket, "/sagemaker-rstudio-example/split")

After the files are uploaded, move to the next step.

Perform parallel data processing

In this step, we take the data files and perform feature engineering to filter out certain columns. This job is distributed across a series of processing instances (for our example, we use two).

We use the filter.R file to process the data, and configure the job as follows:

filter_processor <- sagemaker$processing$ScriptProcessor(command=list("Rscript"),
                                                        image_uri=container_uri,
                                                        role=role,
                                                        instance_count=2L,
                                                        instance_type="ml.m5.large")

output_s3_filter_location <- paste0("s3://", bucket, "/sagemaker-rstudio-example/filtered")
s3_filter_input <- sagemaker$processing$ProcessingInput(source=input_s3_split_location,
                                                        destination="/opt/ml/processing/input",
                                                        s3_data_distribution_type="ShardedByS3Key",
                                                        s3_data_type="S3Prefix")
s3_filter_output <- sagemaker$processing$ProcessingOutput(output_name="bank-additional-full-filtered",
                                                         destination=output_s3_filter_location,
                                                         source="/opt/ml/processing/output")

filtering_step <- sagemaker$workflow$steps$ProcessingStep(name="FilterProcessingStep",
                                                      code=paste0(local_path, "/preprocessing/filter.R"),
                                                      processor=filter_processor,
                                                      inputs=list(s3_filter_input),
                                                      outputs=list(s3_filter_output))

As mentioned earlier, when running a parallel data processing job, you must adjust the input parameter with how the data will be sharded, and the type of data. Therefore, we provide the sharding method by S3Prefix:

s3_data_distribution_type="ShardedByS3Key",
                                                      s3_data_type="S3Prefix")

After you insert these parameters, SageMaker Processing will equally distribute the data across the number of instances selected.

Adjust the parameters as necessary, and then run the cell to instantiate the job.

Generate training, test, and validation datasets

In this step, we take the processed data files, combine them, and split them into test, train, and validation datasets. This allows us to use the data for building our model.

We use the process.R file to process the data, and configure the job as follows:

script_processor <- sagemaker$processing$ScriptProcessor(command=list("Rscript"),
                                                         image_uri=container_uri,
                                                         role=role,
                                                         instance_count=1L,
                                                         instance_type="ml.m5.large")

output_s3_processed_location <- paste0("s3://", bucket, "/sagemaker-rstudio-example/processed")
s3_processed_input <- sagemaker$processing$ProcessingInput(source=output_s3_filter_location,
                                                         destination="/opt/ml/processing/input",
                                                         s3_data_type="S3Prefix")
s3_processed_output <- sagemaker$processing$ProcessingOutput(output_name="bank-additional-full-processed",
                                                         destination=output_s3_processed_location,
                                                         source="/opt/ml/processing/output")

processing_step <- sagemaker$workflow$steps$ProcessingStep(name="ProcessingStep",
                                                      code=paste0(local_path, "/preprocessing/process.R"),
                                                      processor=script_processor,
                                                      inputs=list(s3_processed_input),
                                                      outputs=list(s3_processed_output),
                                                      depends_on=list(filtering_step))

Adjust the parameters are necessary, and then run the cell to instantiate the job.

Run the pipeline

After all the steps are instantiated, start the processing pipeline to run each step by running the following cell:

pipeline = sagemaker$workflow$pipeline$Pipeline(
  name="BankAdditionalPipelineUsingR",
  steps=list(filtering_step, processing_step)
)

upserted <- pipeline$upsert(role_arn=role)
execution <- pipeline$start()

execution$describe()
execution$wait()

The time each of these jobs takes will vary based on the instance size and count selected.

Navigate to the SageMaker console to see all your processing jobs.

We start with the filtering job, as shown in the following screenshot.

When that’s complete, the pipeline moves to the data processing job.

When both jobs are complete, navigate to your S3 bucket. Look within the sagemaker-rstudio-example folder, under processed. You can see the files for the train, test and validation datasets.

Conclusion

With an increased amount of data that will be required to build more and more sophisticated models, we need to change our approach to how we process data. Parallel data processing is an efficient method in accelerating dataset generation, and if coupled with modern cloud environments and tooling such as RStudio on SageMaker and SageMaker Processing, can remove much of the undifferentiated heavy lifting of infrastructure management, boilerplate code generation, and environment management. In this post, we walked through how you can implement parallel data processing within RStudio on SageMaker. We encourage you to try it out by cloning the GitHub repository, and if you have suggestions on how to make the experience better, please submit an issue or a pull request.

To learn more about the features and services used in this solution, refer to RStudio on Amazon SageMaker and Amazon SageMaker Processing.


About the authors

Raj Pathak is a Solutions Architect and Technical advisor to Fortune 50 and Mid-Sized FSI (Banking, Insurance, Capital Markets) customers across Canada and the United States. Raj specializes in Machine Learning with applications in Document Extraction, Contact Center Transformation and Computer Vision.

Jake Wen is a Solutions Architect at AWS with passion for ML training and Natural Language Processing. Jake helps Small Medium Business customers with design and thought leadership to build and deploy applications at scale. Outside of work, he enjoys hiking.

Aditi Rajnish is a first-year software engineering student at University of Waterloo. Her interests include computer vision, natural language processing, and edge computing. She is also passionate about community-based STEM outreach and advocacy. In her spare time, she can be found rock climbing, playing the piano, or learning how to bake the perfect scone.

Sean MorganSean Morgan is an AI/ML Solutions Architect at AWS. He has experience in the semiconductor and academic research fields, and uses his experience to help customers reach their goals on AWS. In his free time, Sean is an active open-source contributor and maintainer, and is the special interest group lead for TensorFlow Add-ons.

Paul Wu is a Solutions Architect working in AWS’ Greenfield Business in Texas. His areas of expertise include containers and migrations.

Read More