Data scientists often work towards understanding the effects of various data preprocessing and feature engineering strategies in combination with different model architectures and hyperparameters. Doing so requires you to cover large parameter spaces iteratively, and it can be overwhelming to keep track of previously run configurations and results while keeping experiments reproducible.
This post walks you through an example of how to track your experiments across code, data, artifacts, and metrics by using Amazon SageMaker Experiments in conjunction with Data Version Control (DVC). We show how you can use DVC side by side with Amazon SageMaker processing and training jobs. We train different CatBoost models on the California housing dataset from the StatLib repository, and change holdout strategies while keeping track of the data version with DVC. In each individual experiment, we track input and output artifacts, code, and metrics using SageMaker Experiments.
SageMaker Experiments
SageMaker Experiments is an AWS service for tracking machine learning (ML) experiments. The SageMaker Experiments Python SDK is a high-level interface to this service that helps you track experiment information using Python.
The goal of SageMaker Experiments is to make it as simple as possible to create experiments, populate them with trials, add tracking and lineage information, and run analytics across trials and experiments.
When discussing SageMaker Experiments, we refer to the following concepts:
- Experiment – A collection of related trials. You add trials to an experiment that you want to compare together.
- Trial – A description of a multi-step ML workflow. Each step in the workflow is described by a trial component.
- Trial component – A description of a single step in an ML workflow, such as data cleaning, feature extraction, model training, or model evaluation.
- Tracker – A Python context manager for logging information about a single trial component (for example, parameters, metrics, or artifacts).
Data Version Control
Data Version Control (DVC) is a new type of data versioning, workflow, and experiment management software that builds upon Git (although it can work standalone). DVC reduces the gap between established engineering toolsets and data science needs, allowing you to take advantage of new features while reusing existing skills and intuition.
Data science experiment sharing and collaboration can be done through a regular Git flow (commits, branching, tagging, pull requests) the same way it works for software engineers. With Git and DVC, data science and ML teams can version experiments, manage large datasets, and make projects reproducible.
DVC has the following features:
- DVC is a free, open-source command line tool.
- DVC works on top of Git repositories and has a similar command line interface and flow as Git. DVC can also work standalone, but without versioning capabilities.
- Data versioning is enabled by replacing large files, dataset directories, ML models, and so on with small metafiles (easy to handle with Git). These placeholders point to the original data, which is decoupled from source code management.
- You can use on-premises or cloud storage to store the project’s data separate from its code base. This is how data scientists can transfer large datasets or share a GPU-trained model with others.
- DVC makes data science projects reproducible by creating lightweight pipelines using implicit dependency graphs, and by codifying the data and artifacts involved.
- DVC is platform agnostic. It runs on all major operating systems (Linux, macOS, and Windows), and works independently of the programming languages (Python, R, Julia, shell scripts, and so on) or ML libraries (Keras, TensorFlow, PyTorch, Scipy, and more) used in the project.
- DVC is quick to install and doesn’t require special infrastructure, nor does it depend on APIs or external services. It’s a standalone CLI tool.
SageMaker Experiments and DVC sample
The following GitHub sample shows how to use DVC within the SageMaker environment. In particular, we look at how to build a custom image with DVC libraries installed by default to provide a consistent development environment to your data scientists in Amazon SageMaker Studio, and how to run DVC alongside SageMaker managed infrastructure for processing and training. Furthermore, we show how to enrich SageMaker tracking information with data versioning information from DVC, and visualize them within the Studio console.
The following diagram illustrates the solution architecture and workflow.
Build a custom Studio image with DVC already installed
In this GitHub repository, we explain how to create a custom image for Studio that has DVC already installed. The advantage of creating an image and making it available to all Studio users is that it creates a consistent environment for the Studio users, which they could also run locally. Although the sample is based on AWS Cloud9, you can also build the container on your local machine as long as you have Docker installed and running. This sample is based on the following Dockerfile and environment.yml. The resulting Docker image is stored in Amazon Elastic Container Registry (Amazon EMR) in your AWS account. See the following code:
You can now create a new Studio domain or update an existing Studio domain that has access to the newly created Docker image.
We use AWS Cloud Development Kit (AWS CDK) to create the following resources via AWS CloudFormation:
- A SageMaker execution role with the right permissions to your new or existing Studio domain
- A SageMaker image and SageMaker image version from the Docker image
conda-env-dvc-kernel
that we created earlier - An
AppImageConfig
that specifies how the kernel gateway should be configured - A Studio user (
data-scientist-dvc
) with the correct SageMaker execution role and the custom Studio image available to it
For detailed instructions, refer to Associate a custom image to SageMaker Studio.
Run the lab
To run the lab, complete the following steps:
- In the Studio domain, launch Studio for the
data-scientist-dvc
user. - Choose the Git icon, then choose Clone a Repository.
- Enter the URL of the repository (
https://github.com/aws-samples/amazon-sagemaker-experiments-dvc-demo
) and choose Clone. - In the file browser, choose the
amazon-sagemaker-experiments-dvc-demo
repository. - Open the
dvc_sagemaker_script_mode.ipynb
notebook. - For Custom Image, choose the image conda-env-dvc-kernel.
- Choose Select.
Configure DVC for data versioning
We create a subdirectory where we prepare the data: sagemaker-dvc-sample. Within this subdirectory, we initialize a new Git repository and set the remote to a repository we create in AWS CodeCommit. The goal is to have DVC configurations and files for data tracking versioned in this repository. However, Git offers native capabilities to manage subprojects via, for example, git submodules and git subtrees, and you can extend this sample to use any of the aforementioned tools that best fit your workflow.
The main advantage of using CodeCommit with SageMaker in our case is its integration with AWS Identity and Access Management (IAM) for authentication and authorization, meaning we can use IAM roles to push and pull data without the need to fetch credentials (or SSH keys). Setting the appropriate permissions on the SageMaker execution role also allows the Studio notebook and the SageMaker training and processing job to interact securely with CodeCommit.
Although you can replace CodeCommit with any other source control service, such as GitHub, Gitlab, or Bitbucket, you need consider how to handle the credentials for your system. One possibility is to store these credentials on AWS Secrets Manager and fetch them at run time from the Studio notebook as well as from the SageMaker processing and training jobs.
Process and train with DVC and SageMaker
In this section, we explore two different approaches to tackle our problem and how we can keep track of the two tests using SageMaker Experiments according to the high-level conceptual architecture we showed you earlier.
Set up a SageMaker experiment
To track this test in SageMaker, we need to create an experiment. We need to also define the trial within the experiment. For the sake of simplicity, we just consider one trial for the experiment, but you can have any number of trials within an experiment, for example, if you want to test different algorithms.
We create an experiment named DEMO-sagemaker-experiments-dvc
with two trials, dvc-trial-single-file
and dvc-trial-multi-files
, each representing a different version of the dataset.
Let’s create the DEMO-sagemaker-experiments-dvc
experiment:
Test 1: Generate single files for training and validation
In this section, we create a processing script that fetches the raw data directly from Amazon Simple Storage Service (Amazon S3) as input; processes it to create the train, validation, and test datasets; and stores the results back to Amazon S3 using DVC. Furthermore, we show how you can track output artifacts generated by DVC with SageMaker when running processing and training jobs and via SageMaker Experiments.
First, we create the dvc-trial-single-file
trial and add it to the DEMO-sagemaker-experiments-dvc
experiment. By doing so, we keep all trial components related to this test organized in a meaningful way.
Use DVC in a SageMaker processing job to create the single file version
In this section, we create a processing script that gets the raw data directly from Amazon S3 as input using the managed data loading capability of SageMaker; processes it to create the train, validation, and test datasets; and stores the results back to Amazon S3 using DVC. It’s very important to understand that when using DVC to store data to Amazon S3 (or pull data from Amazon S3), we’re losing SageMaker managed data loading capabilities, which can potentially have an impact on performance and costs of our processing and training jobs, especially when working with very large datasets. For more information on the different SageMaker native input mode capabilities, refer to Access Training Data.
Finally, we unify DVC tracking capabilities with SageMaker tracking capabilities when running processing jobs via SageMaker Experiments.
The processing script expects the address of the Git repository and the branch we want to create to store the DVC metadata passed via environmental variables. The datasets themselves are stored in Amazon S3 by DVC. Although environmental variables are automatically tracked in SageMaker Experiments and visible in the trial component parameters, we might want to enrich the trial components with further information, which then become available for visualization in the Studio UI using a tracker object. In our case, the trial components parameters include the following:
DVC_REPO_URL
DVC_BRANCH
USER
data_commit_hash
train_test_split_ratio
The preprocessing script clones the Git repository; generates the train, validation, and test datasets; and syncs it using DVC. As mentioned earlier, when using DVC, we can’t take advantage of native SageMaker data loading capabilities. Aside from the performance penalties we might suffer on large datasets, we also lose the automatic tracking capabilities for the output artifacts. However, thanks to the tracker and the DVC Python API, we can compensate for these shortcomings, retrieve such information at run time, and store it in the trial component with little effort. The added value by doing so is to have in single view of the input and output artifacts that belong to this specific processing job.
The full preprocessing Python script is available in the GitHub repo.
SageMaker gives us the possibility to run our processing script on container images managed by AWS that are optimized to run on the AWS infrastructure. If our script requires additional dependencies, we can supply a requirements.txt
file. When we start the processing job, SageMaker uses pip-install
to install all the libraries we need (for example, DVC-related libraries). If you need to have a tighter control of all libraries installed on the containers, you can bring your own container in SageMaker, for example for processing and training.
We have now all the ingredients to run our SageMaker processing job:
- A processing script that can process several arguments (
--train-test-split-ratio
) and two environmental variables (DVC_REPO_URL
andDVC_BRANCH
) - A
requiremets.txt
file - A Git repository (in CodeCommit)
- A SageMaker experiment and trial
We then run the processing job with the preprocessing-experiment.py
script, experiment_config
, dvc_repo_url
, and dvc_branch
we defined earlier.
The processing job takes approximately 5 minutes to complete. Now you can view the trial details for the single file dataset.
The following screenshot shows where you can find the stored information within Studio. Note the values for dvc-trial-single-file
in DVC_BRANCH
, DVC_REPO_URL
, and data_commit_hash
on the Parameters tab.
Also note the input and output details on the Artifacts tab.
Create an estimator and fit the model with single file data version
To use DVC integration inside a SageMaker training job, we pass a dvc_repo_url
and dvc_branch
as environmental variables when you create the Estimator object.
We train on the dvc-trial-single-file
branch first.
When pulling data with DVC, we use the following dataset structure:
Now we create a Scikit-learn Estimator using the SageMaker Python SDK. This allows us to specify the following:
- The path to the Python source file, which should be run as the entry point to training.
- The IAM role that controls permissions for accessing Amazon S3 and CodeCommit data and running SageMaker functions.
- A list of dictionaries that define the metrics used to evaluate the training jobs.
- The number and type of training instances. We use one ml.m5.large instance.
- Hyperparameters that are used for training.
- Environment variables to use during the training job. We use
DVC_REPO_URL
,DVC_BRANCH
, andUSER
.
We call the fit method of the Estimator with the experiment_config we defined earlier to start the training.
The training job takes approximately 5 minutes to complete. The logs show those lines, indicating the files pulled by DVC:
Test 2: Generate multiple files for training and validation
We create a new dvc-trial-multi-files
trial and add it to the current DEMO-sagemaker-experiments-dvc
experiment.
Differently from the first processing script, we now create out of the original dataset multiple files for training and validation and store the DVC metadata in a different branch.
You can explore the second preprocessing Python script on GitHub.
The processing job takes approximately 5 minutes to complete. Now you can view the trial details for the multi-file dataset.
The following screenshots show where you can find the stored information within SageMaker Experiments in the Trial components section within the Studio UI. Note the values for dvc-trial-multi-files
in DVC_BRANCH
, DVC_REPO_URL
, and data_commit_hash
on the Parameters tab.
You can also review the input and output details on the Artifacts tab.
We now train on the dvc-trial-multi-files
branch. When pulling data with DVC, we use the following dataset structure:
Similar as we did before, we create a new Scikit-learn Estimator with the trial name dvc-trial-multi-files
and start the training job.
The training job takes approximately 5 minutes to complete. On the training job logs output to the notebook, you can see those lines, indicating the files pulled by DVC:
Host your model in SageMaker
After you train your ML model, you can deploy it using SageMaker. To deploy a persistent, real-time endpoint that makes one prediction at a time, we use SageMaker real-time hosting services.
First, we get the latest test dataset locally on the development notebook in Studio. For this purpose, we can use dvc.api.read()
to load the raw data that was stored in Amazon S3 by the SageMaker processing job.
Then we prepare the data using Pandas, load a test CSV file, and call predictor.predict
to invoke the SageMaker endpoint created earlier, with data, and get predictions.
Delete the endpoint
You should delete endpoints when they’re no longer in use, because they’re billed by the time deployed (for more information, see Amazon SageMaker Pricing). Make sure to delete the endpoint to avoid unexpected costs.
Clean up
Before you remove all the resources you created, make sure that all apps are deleted from the data-scientist-dvc
user, including all KernelGateway apps, as well as the default JupiterServer app.
Then you can destroy the AWS CDK stack by running the following command:
If you used an existing domain, also run the following commands:
Conclusion
In this post, you walked through an example of how to track your experiments across code, data, artifacts, and metrics by using SageMaker Experiments and SageMaker processing and training jobs in conjunction with DVC. We created a Docker image containing DVC, which was required for Studio as the development notebook, and showed how you can use processing and training jobs with DVC. We prepared two versions of the data and used DVC to manage it with Git. Then you used SageMaker Experiments to track the processing and training with the two versions of the data in order to have a unified view of parameters, artifacts, and metrics in a single pane of glass. Finally, you deployed the model to a SageMaker endpoint and used a testing dataset from the second dataset version to invoke the SageMaker endpoint and get predictions.
As next step, you can extend the existing notebook and introduce your own feature engineering strategy and use DVC and SageMaker to run your experiments. Let’s go build!
For further reading, refer to the following resources:
- Amazon SageMaker Experiments – Organize, Track And Compare Your Machine Learning Training
- Manage Machine Learning with Amazon SageMaker Experiments
- SageMaker Experiments Python SDK
- DVC – Get Started
- DVC – Get Started: Data and Model Access
About the Authors
Paolo Di Francesco is a solutions architect at AWS. He has experience in the telecommunications and software engineering. He is passionate about machine learning and is currently focusing on using his experience to help customers reach their goals on AWS, in particular in discussions around MLOps. Outside of work, he enjoys playing football and reading.
Eitan Sela is a Machine Learning Specialist Solutions Architect with Amazon Web Services. He works with AWS customers to provide guidance and technical assistance, helping them build and operate machine learning solutions on AWS. In his spare time, Eitan enjoys jogging and reading the latest machine learning articles.