AWS recently released the new Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like databases or message queues simply by using the Kubernetes API. The new SageMaker ACK Operators make it easier for machine learning (ML) developers and data scientists who use Kubernetes as their control plane to train, tune, and deploy ML models in Amazon SageMaker without signing in to the SageMaker console.
Kubernetes and SageMaker
Building scalable ML workflows involves many iterative steps, including sourcing and preparing data, building ML models, training and evaluating these models, deploying them to production, and monitoring workloads after deployment.
SageMaker is a fully managed service designed and optimized specifically for managing these ML workflows. It removes the undifferentiated heavy lifting of infrastructure management and eliminates the need to invest in IT and DevOps to manage clusters for ML model building, training, and inference. Compute resources are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing near 100% utilization. SageMaker provides many performance and cost optimizations for distributed training, spot training, automatic model tuning, inference latency, and multi-model endpoints.
Many AWS customers who have portability requirements implement a hybrid cloud approach, or implement on-premises and use Kubernetes, an open-source, general-purpose container orchestration system, to set up repeatable ML pipelines running training and inference workloads. However, to support ML workloads, these developers still need to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. Kubernetes customers therefore want to use fully managed ML services such as SageMaker for cost-optimized and managed infrastructure, but want platform and infrastructure teams to continue using Kubernetes for orchestration and managing pipelines to retain standardization and portability.
To address this need, AWS allows you to train, tune, and deploy models in SageMaker by using the new SageMaker ACK Operators, which includes a set of custom resource definitions for SageMaker resources that extends the Kubernetes API. With the SageMaker ACK Operators, you can take advantage of fully managed SageMaker infrastructure, tools, and optimizations natively from Kubernetes.
How did we get here?
In late 2019, AWS introduced the SageMaker Operators for Kubernetes to enable developers and data scientists to manage the end-to-end SageMaker training and production lifecycle using Kubernetes as the control plane. SageMaker operators were installed from the GitHub repo by downloading a YAML configuration file that configured your Kubernetes cluster with the custom resource definitions and operator controller service.
In 2020, AWS introduced ACK to facilitate a Kubernetes-native way of managing AWS Cloud resources. ACK includes a common controller runtime, a code generator, and a set of AWS service-specific controllers, one of which is the SageMaker controller.
Going forward, new functionality will be added to the SageMaker Operators for Kubernetes through the ACK project.
How does ACK work?
The following diagram illustrates how ACK works.
In this example, Alice is a Kubernetes user. She wants to run model training on SageMaker from within the Kubernetes cluster using the Kubernetes API. Alice issues a call to kubectl apply
, passing in a file that describes a Kubernetes custom resource describing her SageMaker training job. kubectl apply
passes this file, called a manifest, to the Kubernetes API server running in the Kubernetes controller node (Step 1 in the workflow diagram).
The Kubernetes API server receives the manifest with the SageMaker training job specification and determines whether Alice has permissions to create a custom resource of kind sageMaker.services.k8s.aws/TrainingJob
, and whether the custom resource is properly formatted (Step 2).
If Alice is authorized and the custom resource is valid, the Kubernetes API server writes (Step 3) the custom resource to its etcd
data store and then responds back (Step 4) to Alice that the custom resource has been created.
The SageMaker controller, which is running on a Kubernetes worker node within the context of a normal Kubernetes Pod, is notified (Step 5) that a new custom resource of kind SageMaker.services.k8s.aws/TrainingJob
has been created.
The SageMaker controller then communicates (Step 6) with the SageMaker API, calling the SageMaker CreateTrainingJob API to create the training job in AWS. After communicating with the SageMaker API, the SageMaker controller calls the Kubernetes API server to update (Step 7) the custom resource’s status with information it received from SageMaker. The SageMaker controller therefore provides the same information to the developers that they would have received using the AWS SDK. This results in a better and consistent developer experience.
Machine learning use case
For this post, we follow the SageMaker example provided in the following notebook. However, you can reuse the components in this example with your preference of SageMaker built-in or custom algorithms and your own datasets.
We use the Abalone dataset originally from the UCI data repository [1]. In the libsvm converted version, the nominal feature (male/female/infant) has been converted into a real valued feature. The age of abalone is to be predicted from eight physical measurements. This dataset is already processed and stored in Amazon Simple Storage Service (Amazon S3). We train an XGBoost model on the UCI Abalone dataset to replicate the flow in the example Jupyter notebook.
Prerequisites
For this walkthrough, you should have the following prerequisites:
- An AWS account.
An existing Amazon Elastic Kubernetes Service (Amazon EKS) cluster. It should be Kubernetes version 1.16+. For automated cluster creation using eksctl
, see Getting started with Amazon EKS – eksctl
and create your cluster with Amazon EC2 Linux managed nodes.
Install the following tools on the client machine used to access your Kubernetes cluster (you can use AWS Cloud9, a cloud-based integrated development environment (IDE) for the Kubernetes cluster setup):
- kubectl – A command line tool for working with Kubernetes clusters.
- Helm version 3.7+ – A tool for installing and managing Kubernetes applications.
- AWS Command Line Interface (AWS CLI) – A command line tool for interacting with AWS services.
- eksctl – A command line tool for working with Amazon EKS clusters that automates many individual tasks.
- yq – A command line YAML processor. (For Linux environments, use the wget plain binary installation).
Set up IAM role-based authentication for the controller Pod
IAM roles for service accounts (IRSA) allows fine-grained roles at the Kubernetes Pod level by combining an OpenID Connect (OIDC) identity provider with Kubernetes service account annotations. In this section, we associate the Amazon EKS cluster with an OIDC provider and create an AWS Identity and Access Management (IAM) role that is assumed by the ACK controller Pod via its service account to access AWS services.
Create a cluster and OIDC ID provider
Make sure you’re connected to the right cluster. Substitute the values for CLUSTER_NAME
and CLUSTER_REGION
below:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0
# Set the cluster name, region where the cluster exists
export CLUSTER_NAME=<CLUSTER_NAME>
export CLUSTER_REGION=<CLUSTER_REGION>
export RANDOM_VAR=$RANDOM
aws eks update-kubeconfig --name $CLUSTER_NAME --region $CLUSTER_REGION
kubectl config get-contexts
# Ensure cluster has compute
kubectl get nodes
Set up the OIDC ID provider (IdP) in AWS and associate it with your Amazon EKS cluster:
eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME}
--region ${CLUSTER_REGION} --approve
Get the identity issuer URL by running the following code:
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
OIDC_PROVIDER_URL=$(aws eks describe-cluster --name $CLUSTER_NAME --region $CLUSTER_REGION --query "cluster.identity.oidc.issuer" --output text | cut -c9-)
Set up an IAM role
Next, let’s set up the IAM role that defines the access to the SageMaker and Application Auto Scaling services. For this, we also need to have an IAM trust policy in place, allowing the specified Kubernetes service account (for example, ack-sagemaker-controller
) to assume the IAM role.
Create a file named trust.json
and insert the following trust relationship code block required for IAM role:
printf '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::'$AWS_ACCOUNT_ID':oidc-provider/'$OIDC_PROVIDER_URL'"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"'$OIDC_PROVIDER_URL':aud": "sts.amazonaws.com",
"'$OIDC_PROVIDER_URL':sub": [
"system:serviceaccount:ack-system:ack-sagemaker-controller",
"system:serviceaccount:ack-system:ack-applicationautoscaling-controller"
]
}
}
}
]
}
' > ./trust.json
Updating an Application Auto Scaling Scalable Target requires additional permissions. First, create a service-linked role for Application Auto Scaling.
aws iam create-service-linked-role --aws-service-name sagemaker.application-autoscaling.amazonaws.com
Create a file named pass_role_policy.json
to create the policy required for the IAM role.
printf '{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::'$AWS_ACCOUNT_ID':role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint"
}
]
}
' > ./pass_role_policy.json
Run the following command to create a role with the trust relationship defined in trust.json
. This trust relationship is required so that Amazon EKS (via a webhook) can inject the necessary environment variables and mount volumes into the Pod that are required by the AWS SDK to assume this role.
OIDC_ROLE_NAME=ack-controller-role-$CLUSTER_NAME
aws iam create-role --role-name $OIDC_ROLE_NAME --assume-role-policy-document file://trust.json
# Attach the AmazonSageMakerFullAccess Policy to the Role. This policy provides full access to
# Amazon SageMaker. Also provides select access to related services (e.g., Application Autoscaling,
# S3, ECR, CloudWatch Logs).
aws iam attach-role-policy --role-name $OIDC_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
# Attach the iam:PassRole policy required for updating ApplicationAutoscaling ScalableTarget
aws iam put-role-policy --role-name $OIDC_ROLE_NAME --policy-name "iam-pass-role-policy" --policy-document file://pass_role_policy.json
export IAM_ROLE_ARN_FOR_IRSA=$(aws iam get-role --role-name $OIDC_ROLE_NAME --output text --query 'Role.Arn')
echo $IAM_ROLE_ARN_FOR_IRSA
Install SageMaker and Application Auto Scaling controllers
Choose an AWS Region for the SageMaker and automatic scaling resources we create in this post. For convenience, we recommend using us-east-1
:
export SERVICE_REGION="us-east-1"
# Namespace for controller
export ACK_K8S_NAMESPACE="ack-system"
Now, let’s install the SageMaker and Application Auto Scaling controller using the following helper script. This script pulls the helm charts from ACK’s public Amazon Elastic Container Registry (Amazon ECR) repository and configures the values of the AWS account, default Region for resources to be created, and IAM role (created in previous step) in the service account to be used by the controller Pod to assume the role. Create a file named install-controllers.sh
and insert the following code block:
#!/usr/bin/env bash
# Deploy ACK Helm Charts
export HELM_EXPERIMENTAL_OCI=1
export ACK_K8S_NAMESPACE=${ACK_K8S_NAMESPACE:-"ack-system"}
function install_ack_controller() {
local service="$1"
local release_version="$2"
local chart_export_path=/tmp/chart
local chart_ref=$service-chart
local chart_repo=public.ecr.aws/aws-controllers-k8s/$chart_ref
local chart_package=$chart_ref-$release_version.tgz
# Download helm chart
mkdir -p $chart_export_path
helm pull oci://"$chart_repo" --version "$release_version" -d $chart_export_path
tar xvf "$chart_export_path"/"$chart_package" -C "$chart_export_path"
# Update the values in helm chart
pushd $chart_export_path/$service-chart
yq e '.aws.region = env(SERVICE_REGION)' -i values.yaml
yq e '.serviceAccount.annotations."eks.amazonaws.com/role-arn" = env(IAM_ROLE_ARN_FOR_IRSA)' -i values.yaml
popd
# Create a namespace and install the helm chart
helm install -n $ACK_K8S_NAMESPACE --create-namespace ack-$service-controller $chart_export_path/$service-chart
}
install_ack_controller "sagemaker" "v0.3.0"
install_ack_controller "applicationautoscaling" "v0.2.0"
Run the script:
chmod +x install-controllers.sh
./install-controllers.sh
The output contains the following:
Pulled: public.ecr.aws/aws-controllers-k8s/sagemaker-chart:v0.3.0
...
NAME: ack-sagemaker-controller
LAST DEPLOYED: Tue Nov 16 01:53:34 2021
NAMESPACE: ack-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Pulled: public.ecr.aws/aws-controllers-k8s/applicationautoscaling-chart:v0.2.0
...
NAME: ack-applicationautoscaling-controller
LAST DEPLOYED: Tue Nov 16 01:53:35 2021
NAMESPACE: ack-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Next, we run the following commands to verify custom resource definitions were applied and controller Pods are running:
kubectl get crds | grep "services.k8s.aws"
The output of the command should contain a number of custom resource definitions related to SageMaker (such as trainingjobs
or endpoint
) and Application Auto Scaling (such as scalingpolicies
and scalabletargets
):
# Get pods in controller namespace
kubectl get pods -n $ACK_K8S_NAMESPACE
We see one controller Pod per service running in the ack-system
namespace:
NAME READY STATUS RESTARTS AGE
ack-applicationautoscaling-controller-7479dc78dd-ts9ng 1/1 Running 0 4m52s
ack-sagemaker-controller-788858fc98-6fgr6 1/1 Running 0 4m56s
Prepare SageMaker resources
Next, we create an S3 bucket and IAM role for SageMaker.
To train a model with SageMaker, we need an S3 bucket to store the dataset and artifacts from the training process. We simply use the preprocessed dataset at s3://SageMaker-sample-files/datasets/tabular/uci_abalone
[1].
Let’s create a variable for the S3 bucket:
export SAGEMAKER_BUCKET=ack-sagemaker-bucket-$RANDOM_VAR
Create a file named create-bucket.sh
and insert the following code block:
printf '
#!/usr/bin/env bash
# create bucket
if [[ $SERVICE_REGION != "us-east-1" ]]; then
aws s3api create-bucket --bucket "$SAGEMAKER_BUCKET" --region "$SERVICE_REGION" --create-bucket-configuration LocationConstraint="$SERVICE_REGION"
else
aws s3api create-bucket --bucket "$SAGEMAKER_BUCKET" --region "$SERVICE_REGION"
fi
# sync dataset
aws s3 sync s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train s3://"$SAGEMAKER_BUCKET"/datasets/tabular/uci_abalone/train
aws s3 sync s3://sagemaker-sample-files/datasets/tabular/uci_abalone/validation s3://"$SAGEMAKER_BUCKET"/datasets/tabular/uci_abalone/validation
' > ./create-bucket.sh
Run the script to create the S3 bucket and copy the dataset:
chmod +x create-bucket.sh
./create-bucket.sh
The SageMaker training job that we run later in the post needs an IAM role to access Amazon S3 and SageMaker. Run the following commands to create a SageMaker execution IAM role that is used by SageMaker to access AWS resources:
export SAGEMAKER_EXECUTION_ROLE_NAME=ack-sagemaker-execution-role-$RANDOM_VAR
TRUST="{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }"
aws iam create-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --assume-role-policy-document "$TRUST"
aws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
SAGEMAKER_EXECUTION_ROLE_ARN=$(aws iam get-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --output text --query 'Role.Arn')
echo $SAGEMAKER_EXECUTION_ROLE_ARN
Note down the execution role ARN to use in later steps.
Train an XGBoost model
Now, we create a training.yaml
file to specify the parameters for a SageMaker training job. SageMaker training jobs enable remote training of ML models. You can customize each training job to run your own ML scripts with custom architectures, data loaders, hyperparameters, and more. To submit a SageMaker training job, we require a job name. Let’s create that variable first:
export JOB_NAME=ack-xgboost-training-job-$RANDOM_VAR
In the following code, we create a training.yaml
file that contains the hyperparameters for the training job as well as the location of the training and validation data. It’s also where we specify the Amazon ECR image used for training.
Note: If your $SERVICE_REGION
isn’t us-east-1
, change the following image URI. For the XGBoost algorithm version 1.2-1 Region-specific image URI, see Docker Registry Paths and Example Code.
export XGBOOST_IMAGE=683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.2-1
printf '
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
name: '$JOB_NAME'
spec:
# Name that will appear in SageMaker console
trainingJobName: '$JOB_NAME'
hyperParameters:
max_depth: "5"
gamma: "4"
eta: "0.2"
min_child_weight: "6"
subsample: "0.7"
objective: "reg:linear"
num_round: "50"
verbosity: "2"
algorithmSpecification:
trainingImage: '$XGBOOST_IMAGE'
trainingInputMode: File
roleARN: '$SAGEMAKER_EXECUTION_ROLE_ARN'
outputDataConfig:
# The output path of our model
s3OutputPath: s3://'$SAGEMAKER_BUCKET'
resourceConfig:
instanceCount: 1
instanceType: ml.m4.xlarge
volumeSizeInGB: 5
stoppingCondition:
maxRuntimeInSeconds: 3600
inputDataConfig:
- channelName: train
dataSource:
s3DataSource:
s3DataType: S3Prefix
# The input path of our train data
s3URI: s3://'$SAGEMAKER_BUCKET'/datasets/tabular/uci_abalone/train/abalone.train
s3DataDistributionType: FullyReplicated
contentType: text/libsvm
compressionType: None
- channelName: validation
dataSource:
s3DataSource:
s3DataType: S3Prefix
# The input path of our validation data
s3URI: s3://'$SAGEMAKER_BUCKET'/datasets/tabular/uci_abalone/validation/abalone.validation
s3DataDistributionType: FullyReplicated
contentType: text/libsvm
compressionType: None
' > ./training.yaml
Now, we can create the training job:
kubectl apply -f training.yaml
You should see the following output:
trainingjob.sagemaker.services.k8s.aws/ack-xgboost-training-job-7420 created
You can watch the status of the training job. It takes a few minutes for STATUS
to show as Completed
.
kubectl get trainingjob.sagemaker --watch
NAME SECONDARYSTATUS STATUS
ack-xgboost-training-job-7420 Starting InProgress
ack-xgboost-training-job-7420 Downloading InProgress
ack-xgboost-training-job-7420 Training InProgress
ack-xgboost-training-job-7420 Completed Completed
Deploy the results of the SageMaker training job
To deploy the model, we need to specify a model name, an endpoint config name, and an endpoint name:
export MODEL_NAME=ack-xgboost-model-$RANDOM_VAR
export ENDPOINT_CONFIG_NAME=ack-xgboost-endpoint-config-$RANDOM_VAR
export ENDPOINT_NAME=ack-xgboost-endpoint-$RANDOM_VAR
We deploy this model on a c5.large instance type. In the following .yaml file, we define the model, the endpoint config, and the endpoint:
printf '
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
name: '$MODEL_NAME'
spec:
modelName: '$MODEL_NAME'
primaryContainer:
containerHostname: xgboost
# The source of the model data
modelDataURL: s3://'$SAGEMAKER_BUCKET'/'$JOB_NAME'/output/model.tar.gz
image: '$XGBOOST_IMAGE'
executionRoleARN: '$SAGEMAKER_EXECUTION_ROLE_ARN'
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: EndpointConfig
metadata:
name: '$ENDPOINT_CONFIG_NAME'
spec:
endpointConfigName: '$ENDPOINT_CONFIG_NAME'
productionVariants:
- modelName: '$MODEL_NAME'
variantName: AllTraffic
instanceType: ml.c5.large
initialInstanceCount: 1
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Endpoint
metadata:
name: '$ENDPOINT_NAME'
spec:
endpointName: '$ENDPOINT_NAME'
endpointConfigName: '$ENDPOINT_CONFIG_NAME'
' > ./deploy.yaml
Now, the endpoint is ready to be deployed:
kubectl apply -f deploy.yaml
You should see the following output:
model.sagemaker.services.k8s.aws/ack-xgboost-model-7420 created
endpointconfig.sagemaker.services.k8s.aws/ack-xgboost-endpoint-config-7420 created
endpoint.sagemaker.services.k8s.aws/ack-xgboost-endpoint-7420 created
We can observe that the model and endpoint config were created. Deploying the endpoint may take some time:
kubectl describe models.sagemaker
kubectl describe endpointconfigs.sagemaker
kubectl describe endpoints.sagemaker
We can watch this process using the following command:
kubectl get endpoints.sagemaker --watch
After some time, the STATUS
changes to InService
:
NAME STATUS
ack-xgboost-endpoint-7420 Creating
ack-xgboost-endpoint-7420 InService
This indicates the deployed endpoint is ready for use.
Verify the inference capabilities of the trained model
We invoke the model endpoint using Python to emulate a typical use case. We reuse the code in SageMaker example notebook.
We first download the test set from Amazon S3. Then we load a single sample from the test set and use it to invoke the endpoint we deployed in the previous section. Download the test file with the following code:
pip install boto3 numpy
aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/test/abalone.test abalone.test
head -1 abalone.test > abalone.single.test
Use the Python interpreter to test inference. The Python interpreter is usually installed as /usr/local/bin/python<version>
on those machines where it’s available; putting /usr/local/bin
in your Unix/Linux shell’s search path makes it possible to start it by entering the Python command.
Create a file named predict.py
and insert the following code block:
printf '
import sys
import math
import json
import boto3
import numpy as np
import os
region = os.environ.get("SERVICE_REGION")
endpoint_name = os.environ.get("ENDPOINT_NAME")
runtime_client = boto3.client("runtime.sagemaker", region_name=region)
file_name = "abalone.single.test"
with open(file_name, "r") as f:
payload = f.read().strip()
response = runtime_client.invoke_endpoint(
EndpointName=endpoint_name, ContentType="text/x-libsvm", Body=payload
)
result = response["Body"].read().decode("utf-8").split(",")
result = [math.ceil(float(i)) for i in result]
label = payload.strip(" ").split()[0]
print("Label: " + label)
print("Prediction:" + str(result[0]))
' > ./predict.py
python predict.py
Running this sample should give us the following result:
Label: 12
Prediction: 13
The age of the abalone that is provided in the test example is estimated to be 13 by the ML model. The actual age was 12. This suggests that our ML model has been trained and provides reasonable predictions. However, the experienced ML user may realize that we haven’t performed hyperparameter tuning and other methods of increasing accuracy yet, which is outside the scope of this post.
Dynamically scale the endpoint according to the load
SageMaker ACK Operators support custom resource definitions for automatic scaling (using ScalableTarget and ScalingPolicy) for your hosted models. The following resources adjust the number of instances (minimum 1 to maximum 20) provisioned for a model in response to changes in metric SageMakerVariantInvocationsPerInstancetracking
, which is the average number of times per minute that each instance for a variant is invoked:
printf '
apiVersion: applicationautoscaling.services.k8s.aws/v1alpha1
kind: ScalableTarget
metadata:
name: ack-scalable-target-predfined
spec:
maxCapacity: 20
minCapacity: 1
resourceID: endpoint/'$ENDPOINT_NAME'/variant/AllTraffic
scalableDimension: "sagemaker:variant:DesiredInstanceCount"
serviceNamespace: sagemaker
---
apiVersion: applicationautoscaling.services.k8s.aws/v1alpha1
kind: ScalingPolicy
metadata:
name: ack-scaling-policy-predefined
spec:
policyName: ack-scaling-policy-predefined
policyType: TargetTrackingScaling
resourceID: endpoint/'$ENDPOINT_NAME'/variant/AllTraffic
scalableDimension: "sagemaker:variant:DesiredInstanceCount"
serviceNamespace: sagemaker
targetTrackingScalingPolicyConfiguration:
targetValue: 60
scaleInCooldown: 700
scaleOutCooldown: 300
predefinedMetricSpecification:
predefinedMetricType: SageMakerVariantInvocationsPerInstance
' > ./scale-endpoint.yaml
Apply with the following code:
kubectl apply -f scale-endpoint.yaml
You should see the following output:
scalabletarget.applicationautoscaling.services.k8s.aws/ack-scalable-target-predfined created
scalingpolicy.applicationautoscaling.services.k8s.aws/ack-scaling-policy-predefined created
We can observe that scalingpolicy
was created:
kubectl describe scalingpolicy.applicationautoscaling
The output of scalingpolicy
looks like the following:
Status:
Ack Resource Metadata:
Arn: arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:b33d12b8-aa81-4cb8-855e-c7b6dcb9d6e7:resource/SageMaker/endpoint/ack-xgboost-endpoint/variant/AllTraffic:policyName/ack-scaling-policy-predefined
Owner Account ID: 123456789012
Alarms:
Alarm ARN: arn:aws:cloudwatch:us-east-1:123456789012:alarm:TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmHigh-966b8232-a9b9-467d-99f3-95436f5c0383
Alarm Name: TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmHigh-966b8232-a9b9-467d-99f3-95436f5c0383
Alarm ARN: arn:aws:cloudwatch:us-east-1:123456789012:alarm:TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmLow-71e39f85-1afb-401d-9703-b788cdc10a93
Alarm Name: TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmLow-71e39f85-1afb-401d-9703-b788cdc10a93
Clean up
Run the following commands to delete the resources created in this post:
kubectl delete -f scale-endpoint.yaml
kubectl delete -f deploy.yaml
kubectl delete -f training.yaml
Create a file named uninstall-controller.sh
and insert the following code block required for deleting the controller and custom resource definitions:
printf '
#!/usr/bin/env bash
# Uninstall Controller
export HELM_EXPERIMENTAL_OCI=1
export ACK_K8S_NAMESPACE=${ACK_K8S_NAMESPACE:-"ack-system"}
function uninstall_ack_controller() {
local service="$1"
local chart_export_path=/tmp/chart
helm uninstall -n $ACK_K8S_NAMESPACE ack-$service-controller
kubectl delete -f $chart_export_path/ack-$service-controllerchart/crds
}
uninstall_ack_controller "sagemaker"
uninstall_ack_controller "applicationautoscaling"
' > ./uninstall-controller.sh
Run the following commands to uninstall the controller and custom resource definitions, and delete the namespace, IAM roles, and S3 bucket you created:
# uninstall controller and remove CRDs
chmod +x uninstall-controller.sh
./uninstall-controller.sh
# Delete controller namespace
kubectl delete namespace $ACK_K8S_NAMESPACE
# Delete S3 bucket
aws s3 rb s3://$SAGEMAKERageMaker_BUCKET --force
# Delete SageMaker execution role
aws iam detach-role-policy --role-name $SAGEMAKER_EXECUTION_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam detach-role-policy --role-name $SAGEMAKER_EXECUTION_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam delete-role --role-name $SAGEMAKER_EXECUTION_ROLE_NAME
# Delete application autoscaling service linked role
aws iam delete-service-linked-role --role-name AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint
# Delete IAM role created for IRSA
aws iam detach-role-policy --role-name $OIDC_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam delete-role-policy --role-name $OIDC_ROLE_NAME --policy-name "iam-pass-role-policy"
aws iam delete-role --role-name $OIDC_ROLE_NAME
Conclusion
SageMaker ACK Operators provide engineering teams with a native Kubernetes experience for creating and interacting with the ML jobs on SageMaker, either with the Kubernetes API or with Kubernetes command line utilities such as kubectl. You can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these controllers—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with fully managed SageMaker training, tuning, and inference jobs, as you would with Kubernetes jobs running locally. Logs from SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in the command line.
ACK is a community-driven project and will soon include service controllers for other AWS service APIs.
Links
[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.About the Authors
Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.
Suraj Kota is a Software Engineer specialized in Machine Learning infrastructure. He builds tools to easily get started and scale machine learning workload on AWS. He worked on the AWS Deep Learning Containers, Deep Learning AMI, SageMaker Operators for Kubernetes, and other open source integrations like Kubeflow.
Archis Joglekar is an AI/ML Partner Solutions Architect in the Emerging Technologies team. He is interested in performant, scalable deep learning and scientific computing using the building blocks at AWS. His past experiences range from computational physics research to machine learning platform development in academia, national labs, and startups. His time away from the computer is spent playing soccer and with friends and family.