Managing your machine learning lifecycle with MLflow and Amazon SageMaker

With the rapid adoption of machine learning (ML) and MLOps, enterprises want to increase the velocity of ML projects from experimentation to production.

During the initial phase of an ML project, data scientists collaborate and share experiment results in order to find a solution to a business need. During the operational phase, you also need to manage the different model versions going to production and your lifecycle. In this post, we’ll show how the open-source platform MLflow helps address these issues. For those interested in a fully managed solution, Amazon Web Services recently announced Amazon SageMaker Pipelines at re:Invent 2020, the first purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). You can learn more about SageMaker Pipelines in this post.

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, deployment, and a central model registry. It includes the following components:

  • Tracking – Record and query experiments: code, data, configuration, and results
  • Projects – Package data science code in a format to reproduce runs on any platform
  • Models – Deploy ML models in diverse serving environments
  • Registry – Store, annotate, discover, and manage models in a central repository

The following diagram illustrates our architecture.

In the following sections, we show how to deploy MLflow on AWS Fargate and use it during your ML project with Amazon SageMaker. We use SageMaker to develop, train, tune, and deploy a Scikit-learn based ML model (random forest) using the Boston House Prices dataset. During our ML workflow, we track experiment runs and our models with MLflow.

SageMaker is a fully managed service that provides developers and data scientists the ability to build, train, and deploy ML models quickly. SageMaker removes the heavy lifting from each step of the ML process to make it easier to develop high-quality models.

Walkthrough overview

This post demonstrates how to do the following:

  • Host a serverless MLflow server on Fargate
  • Set Amazon Simple Storage Service (Amazon S3) and Amazon Relational Database Service (Amazon RDS) as artifact and backend stores, respectively
  • Track experiments running on SageMaker with MLflow
  • Register models trained in SageMaker in the MLflow Model Registry
  • Deploy an MLflow model into a SageMaker endpoint

The detailed step-by-step code walkthrough is available in the GitHub repo.

Architecture overview

You can set up a central MLflow tracking server during your ML project. You use this remote MLflow server to manage experiments and models collaboratively. In this section, we show you how you can Dockerize your MLflow tracking server and host it on Fargate.

An MLflow tracking server also has two components for storage: a backend store and an artifact store.

We use an S3 bucket as our artifact store and an Amazon RDS for MySQL instance as our backend store.

The following diagram illustrates this architecture.

Running an MLflow tracking server on a Docker container

You can install MLflow using pip install mlflow and start your tracking server with the mlflow server command.

By default, the server runs on port 5000, so we expose it in our container. Use 0.0.0.0 to bind to all addresses if you want to access the tracking server from other machines. We install boto3 and pymysql dependencies for the MLflow server to communicate with the S3 bucket and the RDS for MySQL database. See the following code:

FROM python:3.8.0

RUN pip install 
    mlflow 
    pymysql 
    boto3 & 
    mkdir /mlflow/

EXPOSE 5000

## Environment variables made available through the Fargate task.
## Do not enter values
CMD mlflow server 
    --host 0.0.0.0 
    --port 5000 
    --default-artifact-root ${BUCKET} 
    --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE}

Hosting an MLflow tracking server with Fargate

In this section, we show how you can run your MLflow tracking server on a Docker container that is hosted on Fargate.

Fargate is an easy way to deploy your containers on AWS. It allows you to use containers as a fundamental compute primitive without having to manage the underlying instances. All you need is to specify an image to deploy and the amount of CPU and memory it requires. Fargate handles updating and securing the underlying Linux OS, Docker daemon, and Amazon Elastic Container Service (Amazon ECS) agent, as well as all the infrastructure capacity management and scaling.

For more information about running an application on Fargate, see Building, deploying, and operating containerized applications with AWS Fargate.

The MLflow container first needs to be built and pushed to an Amazon Elastic Container Registry (Amazon ECR) repository. The container image URI is used at registration of our Amazon ECS task definition. The ECS task has an AWS Identity and Access Management (IAM) role attached to it, allowing it to interact with AWS services such as Amazon S3.

The following screenshot shows our task configuration.

The Fargate service is set up with autoscaling and a network load balancer so it can adjust to the required compute load with minimal maintenance effort on our side.

When running our ML project, we set mlflow.set_tracking_uri(<load balancer uri>) to interact with the MLflow server via the load balancer.

Using Amazon S3 as the artifact store and Amazon RDS for MySQL as backend store

The artifact store is suitable for large data (such as an S3 bucket or shared NFS file system) and is where clients log their artifact output (for example, models). MLflow natively supports Amazon S3 as artifact store, and you can use --default-artifact-root ${BUCKET} to refer to the S3 bucket of your choice.

The backend store is where MLflow Tracking Server stores experiments and runs metadata, as well as parameters, metrics, and tags for runs. MLflow supports two types of backend stores: file store and database-backed store. It’s better to use an external database-backed store to persist the metadata.

As of this writing, you can use databases such as MySQL, SQLite, and PostgreSQL as a backend store with MLflow. For more information, see Backend Stores.

Amazon Aurora is a MySQL and PostgreSQL-compatible relational database and can also be used for this.

For this example, we set up an RDS for MySQL instance. Amazon RDS makes it easy to set up, operate, and scale MySQL deployments in the cloud. With Amazon RDS, you can deploy scalable MySQL servers in minutes with cost-efficient and resizable hardware capacity.

You can use --backend-store-uri mysql+pymysql://${USERNAME}:${PASSWORD}@${HOST}:${PORT}/${DATABASE} to refer MLflow to the MySQL database of your choice.

Launching the example MLflow stack

To launch your MLflow stack, follow these steps:

  1. Launch the AWS CloudFormation stack provided in the GitHub repo
  2. Choose Next.
  3. Leave all options as default until you reach the final screen.
  4. Select I acknowledge that AWS CloudFormation might create IAM resources.
  5. Choose Create.

The stack takes a few minutes to launch the MLflow server on Fargate, with an S3 bucket and a MySQL database on RDS. The load balancer URI is available on the Outputs tab of the stack.

You can then use the load balancer URI to access the MLflow UI.

In this illustrative example stack, our load balancer is launched on a public subnet and is internet facing.

For security purposes, you may want to provision an internal load balancer in your VPC private subnets where there is no direct connectivity from the outside world. For more information, see Access Private applications on AWS Fargate using Amazon API Gateway PrivateLink.

Tracking SageMaker runs with MLflow

You now have a remote MLflow tracking server running accessible through a REST API via the load balancer URI.

You can use the MLflow Tracking API to log parameters, metrics, and models when running your ML project with SageMaker. For this you need to install the MLflow library when running your code on SageMaker and set the remote tracking URI to be your load balancer address.

The following Python API command allows you to point your code running on SageMaker to your MLflow remote server:

import mlflow
mlflow.set_tracking_uri('<YOUR LOAD BALANCER URI>')

Connect to your notebook instance and set the remote tracking URI. The following diagram shows the updated architecture.

Managing your ML lifecycle with SageMaker and MLflow

You can follow this example lab by running the notebooks in the GitHub repo.

This section describes how to develop, train, tune, and deploy a random forest model using Scikit-learn with the SageMaker Python SDK. We use the Boston Housing dataset, present in Scikit-learn, and log our ML runs in MLflow.

You can find the original lab in the SageMaker Examples GitHub repo for more details on using custom Scikit-learn scripts with SageMaker.

Creating an experiment and tracking ML runs

In this project, we create an MLflow experiment named boston-house and launch training jobs for our model in SageMaker. For each training job run in SageMaker, our Scikit-learn script records a new run in MLflow to keep track of input parameters, metrics, and the generated random forest model.

The following example API calls can help you start and manage MLflow runs:

  • start_run() – Starts a new MLflow run, setting it as the active run under which metrics and parameters are logged
  • log_params() – Logs a parameter under the current run
  • log_metric() – Logs a metric under the current run
  • sklearn.log_model() – Logs a Scikit-learn model as an MLflow artifact for the current run

For a complete list of commands, see MLflow Tracking.

The following code demonstrates how you can use those API calls in your train.py script:

# set remote mlflow server
mlflow.set_tracking_uri(args.tracking_uri)
mlflow.set_experiment(args.experiment_name)

with mlflow.start_run():
    params = {
        "n-estimators": args.n_estimators,
        "min-samples-leaf": args.min_samples_leaf,
        "features": args.features
    }
    mlflow.log_params(params)
    
    # TRAIN
    logging.info('training model')
    model = RandomForestRegressor(
        n_estimators=args.n_estimators,
        min_samples_leaf=args.min_samples_leaf,
        n_jobs=-1
    )

    model.fit(X_train, y_train)

    # ABS ERROR AND LOG COUPLE PERF METRICS
    logging.info('evaluating model')
    abs_err = np.abs(model.predict(X_test) - y_test)

    for q in [10, 50, 90]:
        logging.info(f'AE-at-{q}th-percentile: {np.percentile(a=abs_err, q=q)}')
        mlflow.log_metric(f'AE-at-{str(q)}th-percentile', np.percentile(a=abs_err, q=q))

    # SAVE MODEL
    logging.info('saving model in MLflow')
    mlflow.sklearn.log_model(model, "model")

Your train.py script needs to know which MLflow tracking_uri and experiment_name to use to log the runs. You can pass those values to your script using the hyperparameters of the SageMaker training jobs. See the following code:

# uri of your remote mlflow server
tracking_uri = '<YOUR LOAD BALANCER URI>' 
experiment_name = 'boston-house'

hyperparameters = {
    'tracking_uri': tracking_uri,
    'experiment_name': experiment_name,
    'n-estimators': 100,
    'min-samples-leaf': 3,
    'features': 'CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT',
    'target': 'target'
}

estimator = SKLearn(
    entry_point='train.py',
    source_dir='source_dir',
    role=role,
    metric_definitions=metric_definitions,
    hyperparameters=hyperparameters,
    train_instance_count=1,
    train_instance_type='local',
    framework_version='0.23-1',
    base_job_name='mlflow-rf',
)

Performing automatic model tuning with SageMaker and tracking with MLflow

SageMaker automatic model tuning, also known as Hyperparameter Optimization (HPO), finds the best version of a model by running many training jobs on your dataset using the algorithm and ranges of hyperparameters that you specify. It then chooses the hyperparameter values that result in a model that performs the best, as measured by a metric that you choose.

In the 2_track_experiments_hpo.ipynb example notebook, we show how you can launch a SageMaker tuning job and track its training jobs with MLflow. It uses the same train.py script and data used in single training jobs, so you can accelerate your hyperparameter search for your MLflow model with minimal effort.

When the SageMaker jobs are complete, you can navigate to the MLflow UI and compare results of different runs (see the following screenshot).

This can be useful to promote collaboration within your development team.

Managing models trained with SageMaker using the MLflow Model Registry

The MLflow Model Registry component allows you and your team to collaboratively manage the lifecycle of a model. You can add, modify, update, transition, or delete models created during the SageMaker training jobs in the Model Registry through the UI or the API.

In your project, you can select a run with the best model performance and register it into the MLflow Model Registry. The following screenshot shows example registry details.

After a model is registered, you can navigate to the Registered Models page and view its properties.

Deploying your model in SageMaker using MLflow

This sections shows how to use the mlflow.sagemaker module provided by MLflow to deploy a model into a SageMaker-managed endpoint. As of this writing, MLflow only supports deployments to SageMaker endpoints, but you can use the model binaries from the Amazon S3 artifact store and adapt them to your deployment scenarios.

Next, you need to build a Docker container with inference code and push it to Amazon ECR.

You can build your own image or use the mlflow sagemaker build-and-push-container command to have MLflow create one for you. This builds an image locally and pushes it to an Amazon ECR repository called mlflow-pyfunc.

The following example code shows how to use mlflow.sagemaker.deploy to deploy your model into a SageMaker endpoint:

# URL of the ECR-hosted Docker image the model should be deployed into
image_uri = '<YOUR mlflow-pyfunc ECR IMAGE URI>'
endpoint_name = 'boston-housing'
# The location, in URI format, of the MLflow model to deploy to SageMaker.
model_uri = '<YOUR MLFLOW MODEL LOCATION>'

mlflow.sagemaker.deploy(
    mode='create',
    app_name=endpoint_name,
    model_uri=model_uri,
    image_url=image_uri,
    execution_role_arn=role,
    instance_type='ml.m5.xlarge',
    instance_count=1,
    region_name=region
)

The command launches a SageMaker endpoint into your account, and you can use the following code to generate predictions in real time:

# load boston dataset
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)

runtime= boto3.client('runtime.sagemaker')
# predict on the first row of the dataset
payload = df.iloc[[0]].to_json(orient="split")

runtime_response = runtime.invoke_endpoint(EndpointName=endpoint_name, ContentType='application/json', Body=payload)
result = json.loads(runtime_response['Body'].read().decode())
print(f'Payload: {payload}')
print(f'Prediction: {result}')

Current limitation on user access control

As of this writing, the open-source version of MLflow doesn’t provide user access control features in case you have multiple tenants on your MLflow server. This means any user with access to the server can modify experiments, model versions, and stages. This can be a challenge for enterprises in regulated industries that need to keep strong model governance for audit purposes.

Summary

In this post, we covered how you can host an open-source MLflow server on AWS using Fargate, Amazon S3, and Amazon RDS. We then showed an example ML project lifecycle of tracking SageMaker training and tuning jobs with MLflow, managing model versions in the MLflow Model Registry, and deploying an MLflow model into a SageMaker endpoint for prediction. Try out the solution on your own by accessing the GitHub repo and let us know if you have any questions in the comments!


About the Authors

Sofian Hamiti is an AI/ML specialist Solutions Architect at AWS. He helps customers across industries accelerate their AI/ML journey by helping them build and operationalize end-to-end machine learning solutions.

 

 

 

Shreyas Subramanian is a Principal AI/ML specialist Solutions Architect, and helps Manufacturing, Industrial, Automotive and Aerospace customers build Machine Learning and optimization related architectures to solve their business challenges using the AWS platform.

Read More