As organizations grow in size and scale, the complexities of running workloads increase, and the need to develop and operationalize processes and workflows becomes critical. Therefore, organizations have adopted technology best practices, including microservice architecture, MLOps, DevOps, and more, to improve delivery time, reduce defects, and increase employee productivity. This post introduces a best practice for managing custom code within your Amazon SageMaker Data Wrangler workflow.
Data Wrangler is a low-code tool that facilitates data analysis, preprocessing, and visualization. It contains over 300 built-in data transformation steps to aid with feature engineering, normalization, and cleansing to transform your data without having to write any code.
In addition to the built-in transforms, Data Wrangler contains a custom code editor that allows you to implement custom code written in Python, PySpark, or SparkSQL.
When using Data Wrangler custom transform steps to implement your custom functions, you need to implement best practices around developing and deploying code in Data Wrangler flows.
This post shows how you can use code stored in AWS CodeCommit in the Data Wrangler custom transform step. This provides you with additional benefits, including:
- Improve productivity and collaboration across personnel and teams
- Version your custom code
- Modify your Data Wrangler custom transform step without having to log in to Amazon SageMaker Studio to use Data Wrangler
- Reference parameter files in your custom transform step
- Scan code in CodeCommit using Amazon CodeGuru or any third-party application for security vulnerabilities before using it in Data Wrangler flowssagemake
Solution overview
This post demonstrates how to build a Data Wrangler flow file with a custom transform step. Instead of hardcoding the custom function into your custom transform step, you pull a script containing the function from CodeCommit, load it, and call the loaded function in your custom transform step.
For this post, we use the bank-full.csv
data from the University of California Irving Machine Learning Repository to demonstrate these functionalities. The data is related to the direct marketing campaigns of a banking institution. Often, more than one contact with the same client was required to assess if the product (bank term deposit) would be subscribed (yes
) or not subscribed (no
).
The following diagram illustrates this solution.
The workflow is as follows:
- Create a Data Wrangler flow file and import the dataset from Amazon Simple Storage Service (Amazon S3).
- Create a series of Data Wrangler transformation steps:
- A custom transform step to implement a custom code stored in CodeCommit.
- Two built-in transform steps.
We keep the transformation steps to a minimum so as not to detract from the aim of this post, which is focused on the custom transform step. For more information about available transformation steps and implementation, refer to Transform Data and the Data Wrangler blog.
- In the custom transform step, write code to pull the script and configuration file from CodeCommit, load the script as a Python module, and call a function in the script. The function takes a configuration file as an argument.
- Run a Data Wrangler job and set Amazon S3 as the destination.
Destination options also include Amazon SageMaker Feature Store.
Prerequisites
As a prerequisite, we set up the CodeCommit repository, Data Wrangler flow, and CodeCommit permissions.
Create a CodeCommit repository
For this post, we use an AWS CloudFormation template to set up a CodeCommit repository and copy the required files into this repository. Complete the following steps:
- Choose Launch Stack:
- Select the Region where you want to create the CodeCommit repository.
- Enter a name for Stack name.
- Enter a name for the repository to be created for RepoName.
- Choose Create stack.
AWS CloudFormation takes a few seconds to provision your CodeCommit repository. After the CREATE_COMPLETE
status appears, navigate to the CodeCommit console to see your newly created repository.
Set up Data Wrangler
Download the bank.zip
dataset from the University of California Irving Machine Learning Repository. Then, extract the contents of bank.zip
and upload bank-full.csv
to Amazon S3.
To create a Data Wrangler flow file and import the bank-full.csv
dataset from Amazon S3, complete the following steps:
- Onboard to SageMaker Studio using the quick start for users new to Studio.
- Select your SageMaker domain and user profile and on the Launch menu, choose Studio.
- On the Studio console, on the File menu, choose New, then choose Data Wrangler Flow.
- Choose Amazon S3 for Data sources.
- Navigate to your S3 bucket containing the file and upload the
bank-full.csv
file.
A Preview Error will be thrown.
- Change the Delimiter in the Details pane to the right to SEMICOLON.
A preview of the dataset will be displayed in the result window.
- In the Details pane, on the Sampling drop-down menu, choose None.
This is a relatively small dataset, so you don’t need to sample.
- Choose Import.
Configure CodeCommit permissions
You need to provide Studio with permission to access CodeCommit. We use a CloudFormation template to provision an AWS Identity and Access Management (IAM) policy that gives your Studio role permission to access CodeCommit. Complete the following steps:
- Choose Launch Stack:
- Select the Region you are working in.
- Enter a name for Stack name.
- Enter your Studio domain ID for SageMakerDomainID. The domain information is available on the SageMaker console Domains page, as shown in the following screenshot.
- Enter your Studio domain user profile name for SageMakerUserProfileName. You can view your user profile name by navigating into your Studio domain. If you have multiple user profiles in your Studio domain, enter the name for the user profile used to launch Studio.
- Select the acknowledgement box.
The IAM resources used by this CloudFormation template provide the minimum permissions to successfully create the IAM policy attached to your Studio role for CodeCommit access.
- Choose Create stack.
Transformation steps
Next, we add transformations to process the data.
Custom transform step
In this post, we calculate the Variance Inflation factor (VIF) for each numerical feature and drop features that exceed a VIF threshold. We do this in the custom transform step because Data Wrangler doesn’t have a built-in transform for this task as of this writing.
However, we don’t hardcode this VIF function. Instead, we pull this function from the CodeCommit repository into the custom transform step. Then we run the function on the dataset.
- On the Data Wrangler console, navigate to your data flow.
- Choose the plus sign next to Data types and choose Add transform.
- Choose + Add step.
- Choose Custom transform.
- Optionally, enter a name in the Name field.
- Choose Python (PySpark) on the drop-down menu.
- For Your custom transform, enter the following code (provide the name of the CodeCommit repository and Region where the repository is located):
The code uses the AWS SDK for Python (Boto3) to access CodeCommit API functions. We use the get_file
API function to pull files from the CodeCommit repository into the Data Wrangler environment.
- Choose Preview.
In the Output pane, a table is displayed showing the different numerical features and their corresponding VIF value. For this exercise, the VIF threshold value is set to 1.2. However, you can modify this threshold value in the parameter.json
file found in your CodeCommit repository. You will notice that two columns have been dropped (pdays
and previous
), bringing the total column count to 15.
- Choose Add.
Encode categorical features
Some feature types are categorical variables that need to be transformed into numerical forms. Use the one-hot encode built-in transform to achieve this data transformation. Let’s create numerical features representing the unique value in each categorical feature in the dataset. Complete the following steps:
- Choose + Add step.
- Choose the Encode categorical transform.
- On the Transform drop-down menu, choose One-hot encode.
- For Input column, choose all categorical features, including
poutcome
,y
,month
,marital
,contact
,default
,education
,housing
,job
, andloan
. - For Output style, choose Columns.
- Choose Preview to preview the results.
One-hot encoding might take a while to generate results, given the number of features and unique values within each feature.
- Choose Add.
For each numerical feature created with one-hot encoding, the name combines the categorical feature name appended with an underscore (_
) and the unique categorical value within that feature.
Drop column
The y_yes
feature is the target column for this exercise, so we drop the y_no
feature.
- Choose + Add step.
- Choose Manage columns.
- Choose Drop column under Transform.
- Choose
y_no
under Columns to drop. - Choose Preview, then choose Add.
Create a Data Wrangler job
Now that you have created all the transform steps, you can create a Data Wrangler job to process your input data and store the output in Amazon S3. Complete the following steps:
- Choose Data flow to go back to the Data Flow page.
- Choose the plus sign on the last tile of your flow visualization.
- Choose Add destination and choose Amazon S3.
- Enter the name of the output file for Dataset name.
- Choose Browse and choose the bucket destination for Amazon S3 location.
- Choose Add destination.
- Choose Create job.
- Change the Job name value as you see fit.
- Choose Next, 2. Configure job.
- Change Instance count to 1, because we work with a relatively small dataset, to reduce the cost incurred.
- Choose Create.
This will start an Amazon SageMaker Processing job to process your Data Wrangler flow file and store the output in the specified S3 bucket.
Automation
Now that you have created your Data Wrangler flow file, you can schedule your Data Wrangler jobs to automatically run at specific times and frequency. This is a feature that comes out of the box with Data Wrangler and simplifies the process of scheduling Data Wrangler jobs. Furthermore, CRON expressions are supported and provide additional customization and flexibility in scheduling your Data Wrangler jobs.
However, this post shows how you can automate the Data Wrangler job to run every time there is a modification to the files in the CodeCommit repository. This automation technique ensures that any changes to the custom code functions or changes to values in the configuration file in CodeCommit trigger a Data Wrangler job to reflect these changes immediately.
Therefore, you don’t have to manually start a Data Wrangler job to get the output data that reflects the changes you just made. With this automation, you can improve the agility and scale of your Data Wrangler workloads. To automate your Data Wrangler jobs, you configure the following:
- Amazon SageMaker Pipelines – Pipelines helps you create machine learning (ML) workflows with an easy-to-use Python SDK, and you can visualize and manage your workflow using Studio
- Amazon EventBridge – EventBridge facilitates connection to AWS services, software as a service (SaaS) applications, and custom applications as event producers to launch workflows.
Create a SageMaker pipeline
First, you need to create a SageMaker pipeline for your Data Wrangler job. Then complete the following steps to export your Data Wrangler flow to a SageMaker pipeline:
- Choose the plus sign on your last transform tile (the transform tile before the Destination tile).
- Choose Export to.
- Choose SageMaker Inference Pipeline (via Jupyter Notebook).
This creates a new Jupyter notebook prepopulated with code to create a SageMaker pipeline for your Data Wrangler job. Before running all the cells in the notebook, you may want to change certain variables.
- To add a training step to your pipeline, change the
add_training_step
variable toTrue
.
Be aware that running a training job will incur additional costs on your account.
- Specify a value for the
target_attribute_name
variable toy_yes
.
- To change the name of the pipeline, change the
pipeline_name
variable.
- Lastly, run the entire notebook by choosing Run and Run All Cells.
This creates a SageMaker pipeline and runs the Data Wrangler job.
- To view your pipeline, choose the home icon on the navigation pane and choose Pipelines.
You can see the new SageMaker pipeline created.
- Choose the newly created pipeline to see the run list.
- Note the name of the SageMaker pipeline, as you will use it later.
- Choose the first run and choose Graph to see a Directed Acyclic Graph (DAG) flow of your SageMaker pipeline.
As shown in the following screenshot, we didn’t add a training step to our pipeline. If you added a training step to your pipeline, it will display in your pipeline run Graph tab under DataWranglerProcessingStep.
Create an EventBridge rule
After successfully creating your SageMaker pipeline for the Data Wrangler job, you can move on to setting up an EventBridge rule. This rule listens to activities in your CodeCommit repository and triggers the run of the pipeline in the event of a modification to any file in the CodeCommit repository. We use a CloudFormation template to automate creating this EventBridge rule. Complete the following steps:
- Choose Launch Stack:
- Select the Region you are working in.
- Enter a name for Stack name.
- Enter a name for your EventBridge rule for EventRuleName.
- Enter the name of the pipeline you created for PipelineName.
- Enter the name of the CodeCommit repository you are working with for RepoName.
- Select the acknowledgement box.
The IAM resources that this CloudFormation template uses provide the minimum permissions to successfully create the EventBridge rule.
- Choose Create stack.
It takes a few minutes for the CloudFormation template to run successfully. When the Status changes to CREATE_COMPLTE, you can navigate to the EventBridge console to see the created rule.
Now that you have created this rule, any changes you make to the file in your CodeCommit repository will trigger the run of the SageMaker pipeline.
To test the pipeline edit a file in your CodeCommit repository, modify the VIF threshold in your parameter.json
file to a different number, and go to the SageMaker pipeline details page to see a new run of your pipeline created.
In this new pipeline run, Data Wrangler drops numerical features that have a greater VIF value than the threshold you specified in your parameter.json
file in CodeCommit.
You have successfully automated and decoupled your Data Wrangler job. Furthermore, you can add more steps to your SageMaker pipeline. You can also modify the custom scripts in CodeCommit to implement various functions in your Data Wrangler flow.
It’s also possible to store your scripts and files in Amazon S3 and download them into your Data Wrangler custom transform step as an alternative to CodeCommit. In addition, you ran your custom transform step using the Python (PyScript) framework. However, you can also use the Python (Pandas) framework for your custom transform step, allowing you to run custom Python scripts. You can test this out by changing your framework in the custom transform step to Python (Pandas) and modifying your custom transform step code to pull and implement the Python script version stored in your CodeCommit repository. However, the PySpark option for Data Wrangler provides better performance when working on a large dataset compared to the Python Pandas option.
Clean up
After you’re done experimenting with this use case, clean up the resources you created to avoid incurring additional charges to your account:
- Stop the underlying instance used to create your Data Wrangler flow.
- Delete the resources created by the various CloudFormation template.
- If you see a
DELETE_FAILED
state, when deleting the CloudFormation template, delete the stack one more time to successfully delete it.
Summary
This post showed you how to decouple your Data Wrangler custom transform step by pulling scripts from CodeCommit. We also showed how to automate your Data Wrangler jobs using SageMaker Pipelines and EventBridge.
Now you can operationalize and scale your Data Wrangler jobs without modifying your Data Wrangler flow file. You can also scan your custom code in CodeCommit using CodeGuru or any third-party application for vulnerabilities before implementing it in Data Wrangler. To know more about end-to-end machine learning operations (MLOps) on AWS, visit Amazon SageMaker for MLOps.
About the Author
Uchenna Egbe is an Associate Solutions Architect at AWS. He spends his free time researching about herbs, teas, superfoods, and how to incorporate them into his daily diet.