Deploy Amazon SageMaker Autopilot models to serverless inference endpoints

Amazon SageMaker Autopilot automatically builds, trains, and tunes the best machine learning (ML) models based on your data, while allowing you to maintain full control and visibility. Autopilot can also deploy trained models to real-time inference endpoints automatically.

If you have workloads with spiky or unpredictable traffic patterns that can tolerate cold starts, then deploying the model to a serverless inference endpoint would be more cost efficient.

Amazon SageMaker Serverless Inference is a purpose-built inference option ideal for workloads with unpredictable traffic patterns and that can tolerate cold starts. Unlike a real-time inference endpoint, which is backed by a long-running compute instance, serverless endpoints provision resources on demand with built-in auto scaling. Serverless endpoints scale automatically based on the number of incoming requests and scale down resources to zero when there are no incoming requests, helping you minimize your costs.

In this post, we show how to deploy Autopilot trained models to serverless inference endpoints using the Boto3 libraries for Amazon SageMaker.

Autopilot training modes

Before creating an Autopilot experiment, you can either let Autopilot select the training mode automatically, or you can select the training mode manually.

Autopilot currently supports three training modes:

  • Auto – Based on dataset size, Autopilot automatically chooses either ensembling or HPO mode. For datasets larger than 100 MB, Autopilot chooses HPO; otherwise, it chooses ensembling.
  • Ensembling – Autopilot uses the AutoGluon ensembling technique using model stacking and produces an optimal predictive model.
  • Hyperparameter optimization (HPO) – Autopilot finds the best version of a model by tuning hyperparameters using Bayesian optimization or multi-fidelity optimization while running training jobs on your dataset. HPO mode selects the algorithms most relevant to your dataset and selects the best range of hyperparameters to tune your models.

To learn more about Autopilot training modes, refer to Training modes.

Solution overview

In this post, we use the UCI Bank Marketing dataset to predict if a client will subscribe to a term deposit offered by the bank. This is a binary classification problem type.

We launch two Autopilot jobs using the Boto3 libraries for SageMaker. The first job uses ensembling as the chosen training mode. We then deploy the single ensemble model generated to a serverless endpoint and send inference requests to this hosted endpoint.

The second job uses the HPO training mode. For classification problem types, Autopilot generates three inference containers. We extract these three inference containers and deploy them to separate serverless endpoints. Then we send inference requests to these hosted endpoints.

For more information about regression and classification problem types, refer to Inference container definitions for regression and classification problem types.

We can also launch Autopilot jobs from the Amazon SageMaker Studio UI. If you launch jobs from the UI, ensure to turn off the Auto deploy option in the Deployment and Advanced settings section. Otherwise, Autopilot will deploy the best candidate to a real-time endpoint.

Prerequisites

Ensure you have the latest version of Boto3 and the SageMaker Python packages installed:

pip install -U boto3 sagemaker

We need the SageMaker package version >= 2.110.0 and Boto3 version >= boto3-1.24.84.

Launch an Autopilot job with ensembling mode

To launch an Autopilot job using the SageMaker Boto3 libraries, we use the create_auto_ml_job API. We then pass in AutoMLJobConfig, InputDataConfig, and AutoMLJobObjective as inputs to the create_auto_ml_job. See the following code:

bucket = session.default_bucket()
role = sagemaker.get_execution_role()
prefix = "autopilot/bankadditional"
sm_client = boto3.Session().client(service_name='sagemaker',region_name=region)

timestamp_suffix = strftime('%d%b%Y-%H%M%S', gmtime())
automl_job_name = f"uci-bank-marketing-{timestamp_suffix}"
max_job_runtime_seconds = 3600
max_runtime_per_job_seconds = 1200
target_column = "y"
problem_type="BinaryClassification"
objective_metric = "F1"
training_mode = "ENSEMBLING"

automl_job_config = {
    'CompletionCriteria': {
      'MaxRuntimePerTrainingJobInSeconds': max_runtime_per_job_seconds,
      'MaxAutoMLJobRuntimeInSeconds': max_job_runtime_seconds
    },    
    "Mode" : training_mode
}

automl_job_objective= { "MetricName": objective_metric }

input_data_config = [
    {
      'DataSource': {
        'S3DataSource': {
          'S3DataType': 'S3Prefix',
          'S3Uri': f's3://{bucket}/{prefix}/raw/bank-additional-full.csv'
        }
      },
      'TargetAttributeName': target_column
    }
  ]

output_data_config = {
	    'S3OutputPath': f's3://{bucket}/{prefix}/output'
	}


sm_client.create_auto_ml_job(
				AutoMLJobName=auto_ml_job_name,
				InputDataConfig=input_data_config,
				OutputDataConfig=output_data_config,
				AutoMLJobConfig=automl_job_config,
				ProblemType=problem_type,
				AutoMLJobObjective=automl_job_objective,
				RoleArn=role)

Autopilot returns the BestCandidate model object that has the InferenceContainers required to deploy the models to inference endpoints. To get the BestCandidate for the preceding job, we use the describe_automl_job function:

job_response = sm_client.describe_auto_ml_job(AutoMLJobName=automl_job_name)
best_candidate = job_response['BestCandidate']
inference_container = job_response['BestCandidate']['InferenceContainers'][0]
print(inference_container)

Deploy the trained model

We now deploy the preceding inference container to a serverless endpoint. The first step is to create a model from the inference container, then create an endpoint configuration in which we specify the MemorySizeInMB and MaxConcurrency values for the serverless endpoint along with the model name. Finally, we create an endpoint with the endpoint configuration created above.

We recommend choosing your endpoint’s memory size according to your model size. The memory size should be at least as large as your model size. Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB).

The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB.

To help determine whether a serverless endpoint is the right deployment option from a cost and performance perspective, we encourage you to refer to the SageMaker Serverless Inference Benchmarking Toolkit, which tests different endpoint configurations and compares the most optimal one against a comparable real-time hosting instance.

Note that serverless endpoints only accept SingleModel for inference containers. Autopilot in ensembling mode generates a single model, so we can deploy this model container as is to the endpoint. See the following code:

# Create Model
	model_response = sm_client.create_model(
				ModelName=model_name,
				ExecutionRoleArn=role,
				Containers=[inference_container]
	)

# Create Endpoint Config
	epc_response = sm_client.create_endpoint_config(
		EndpointConfigName = endpoint_config_name,
		ProductionVariants=[
			{
				"ModelName": model_name,
				"VariantName": "AllTraffic",
				"ServerlessConfig": {
					"MemorySizeInMB": memory,
					"MaxConcurrency": max_concurrency
				}
			}
		]
	)

# Create Endpoint
	ep_response = sm_client.create_endpoint(
		EndpointName=endpoint_name,
		EndpointConfigName=endpoint_config_name
	)

When the serverless inference endpoint is InService, we can test the endpoint by sending an inference request and observe the predictions. The following diagram illustrates the architecture of this setup.

Note that we can send raw data as a payload to the endpoint. The ensemble model generated by Autopilot automatically incorporates all required feature-transform and inverse-label transform steps, along with the algorithm model and packages, into a single model.

Send inference request to the trained model

Use the following code to send inference on your model trained using ensembling mode:

from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer


payload = "34,blue-collar,married,basic.4y,no,no,no,telephone,may,tue,800,4,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0"

predictor = Predictor(
        endpoint_name=endpoint,
        sagmaker_session=session,
        serializer=CSVSerializer(),
    )

prediction = predictor.predict(payload).decode(‘utf-8’)
print(prediction)

Launch an Autopilot Job with HPO mode

In HPO mode, for CompletionCriteria, besides MaxRuntimePerTrainingJobInSeconds and MaxAutoMLJobRuntimeInSeconds, we could also specify the MaxCandidates to limit the number of candidates an Autopilot job will generate. Note that these are optional parameters and are only set to limit the job runtime for demonstration. See the following code:

training_mode = "HYPERPARAMETER_TUNING"

automl_job_config["Mode"] = training_mode
automl_job_config["CompletionCriteria"]["MaxCandidates"] = 15
hpo_automl_job_name =  f"{model_prefix}-HPO-{timestamp_suffix}"

response = sm_client.create_auto_ml_job(
					  AutoMLJobName=hpo_automl_job_name,
					  InputDataConfig=input_data_config,
					  OutputDataConfig=output_data_config,
					  AutoMLJobConfig=automl_job_config,
					  ProblemType=problem_type,
					  AutoMLJobObjective=automl_job_objective,
					  RoleArn=role,
					  Tags=tags_config
				)

To get the BestCandidate for the preceding job, we can again use the describe_automl_job function:

job_response = sm_client.describe_auto_ml_job(AutoMLJobName=automl_job_name)
best_candidate = job_response['BestCandidate']
inference_containers = job_response['BestCandidate']['InferenceContainers']
print(inference_containers)

Deploy the trained model

Autopilot in HPO mode for the classification problem type generates three inference containers.

The first container handles the feature-transform steps. Next, the algorithm container generates the predicted_label with the highest probability. Finally, the post-processing inference container performs an inverse transform on the predicted label and maps it to the original label. For more information, refer to Inference container definitions for regression and classification problem types.

We extract these three inference containers and deploy them to a separate serverless endpoints. For inference, we invoke the endpoints in sequence by sending the payload first to the feature-transform container, then passing the output from this container to the algorithm container, and finally passing the output from the previous inference container to the post-processing container, which outputs the predicted label.

The following diagram illustrates the architecture of this setup. diagram that illustrates Autopilot model in HPO mode deployed to three serverless endpoints

We extract the three inference containers from the BestCandidate with the following code:

job_response = sm_client.describe_auto_ml_job(AutoMLJobName=automl_job_name)
inference_containers = job_response['BestCandidate']['InferenceContainers']

models = list()
endpoint_configs = list()
endpoints = list()

# For brevity, we've encapsulated create_model, create endpoint_config and create_endpoint as helper functions
for idx, container in enumerate(inference_containers):
    (status, model_arn) = create_autopilot_model(
								    sm_client,
								    automl_job_name,
            						role,
								    container,
								    idx)
    model_name = model_arn.split('/')[1]
    models.append(model_name)

    endpoint_config_name = f"epc-{model_name}"
    endpoint_name = f"ep-{model_name}"
    (status, epc_arn) = create_serverless_endpoint_config(
								    sm_client,
								    endpoint_config_name,
								    model_name,
            						memory=2048,
								    max_concurrency=10)
	endpoint_configs.append(endpoint_config_name)

	response = create_serverless_endpoint(
								    sm_client,
								    endpoint_name,
								    endpoint_config_name)
	endpoints.append(endpoint_name)

Send inference request to the trained model

For inference, we send the payload in sequence: first to the feature-transform container, then to the model container, and finally to the inverse-label transform container.

visual of inference request flow of three inference containers from HPO mode

See the following code:

from sagemaker.predictor import Predictor
from sagemaker.serializers import CSVSerializer

payload = "51,technician,married,professional.course,no,yes,no,cellular,apr,thu,687,1,0,1,success,-1.8,93.075,-47.1,1.365,5099.1"


for _, endpoint in enumerate(endpoints):
    try:
        print(f"payload: {payload}")
        predictor = Predictor(
            endpoint_name=endpoint,
            sagemaker_session=session,
            serializer=CSVSerializer(),
        )
        prediction = predictor.predict(payload)
        payload=prediction
    except Exception as e:
        print(f"Error invoking Endpoint; {endpoint} n {e}")
        break

The full implementation of this example is available in the following jupyter notebook.

Clean up

To clean up resources, you can delete the created serverless endpoints, endpoint configs, and models:

sm_client = boto3.Session().client(service_name='sagemaker',region_name=region)

for _, endpoint in enumerate(endpoints):
    try:
        sm_client.delete_endpoint(EndpointName=endpoint)
    except Exception as e:
        print(f"Exception:n{e}")
        continue
        
for _, endpoint_config in enumerate(endpoint_configs):
    try:
        sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config)
    except Exception as e:
        print(f"Exception:n{e}")
        continue

for _, autopilot_model in enumerate(models):
    try:
        sm_client.delete_model(ModelName=autopilot_model)
    except Exception as e:
        print(f"Exception:n{e}")
        continue

Conclusion

In this post, we showed how we can deploy Autopilot generated models both in ensemble and HPO modes to serverless inference endpoints. This solution can speed up your ability to use and take advantage of cost-efficient and fully managed ML services like Autopilot to generate models quickly from raw data, and then deploy them to fully managed serverless inference endpoints with built-in auto scaling to reduce costs.

We encourage you to try this solution with a dataset relevant to your business KPIs. You can refer to the solution implemented in a Jupyter notebook in the GitHub repo.

Additional references


About the Author

Praveen Chamarthi is a Senior AI/ML Specialist with Amazon Web Services. He is passionate about AI/ML and all things AWS. He helps customers across the Americas to scale, innovate, and operate ML workloads efficiently on AWS. In his spare time, Praveen loves to read and enjoys sci-fi movies.

Read More