Use Amazon SageMaker ACK Operators to train and deploy machine learning models

AWS recently released the new Amazon SageMaker Operators for Kubernetes using the AWS Controllers for Kubernetes (ACK). ACK is a framework for building Kubernetes custom controllers, where each controller communicates with an AWS service API. These controllers allow Kubernetes users to provision AWS resources like databases or message queues simply by using the Kubernetes API. The new SageMaker ACK Operators make it easier for machine learning (ML) developers and data scientists who use Kubernetes as their control plane to train, tune, and deploy ML models in Amazon SageMaker without signing in to the SageMaker console.

Kubernetes and SageMaker

Building scalable ML workflows involves many iterative steps, including sourcing and preparing data, building ML models, training and evaluating these models, deploying them to production, and monitoring workloads after deployment.

SageMaker is a fully managed service designed and optimized specifically for managing these ML workflows. It removes the undifferentiated heavy lifting of infrastructure management and eliminates the need to invest in IT and DevOps to manage clusters for ML model building, training, and inference. Compute resources are only provisioned when requested, scaled as needed, and shut down automatically when jobs complete, thereby providing near 100% utilization. SageMaker provides many performance and cost optimizations for distributed training, spot training, automatic model tuning, inference latency, and multi-model endpoints.

Many AWS customers who have portability requirements implement a hybrid cloud approach, or implement on-premises and use Kubernetes, an open-source, general-purpose container orchestration system, to set up repeatable ML pipelines running training and inference workloads. However, to support ML workloads, these developers still need to write custom code to optimize the underlying ML infrastructure, provide high availability and reliability, provide data science productivity tools, and comply with appropriate security and regulatory requirements. Kubernetes customers therefore want to use fully managed ML services such as SageMaker for cost-optimized and managed infrastructure, but want platform and infrastructure teams to continue using Kubernetes for orchestration and managing pipelines to retain standardization and portability.

To address this need, AWS allows you to train, tune, and deploy models in SageMaker by using the new SageMaker ACK Operators, which includes a set of custom resource definitions for SageMaker resources that extends the Kubernetes API. With the SageMaker ACK Operators, you can take advantage of fully managed SageMaker infrastructure, tools, and optimizations natively from Kubernetes.

How did we get here?

In late 2019, AWS introduced the SageMaker Operators for Kubernetes to enable developers and data scientists to manage the end-to-end SageMaker training and production lifecycle using Kubernetes as the control plane. SageMaker operators were installed from the GitHub repo by downloading a YAML configuration file that configured your Kubernetes cluster with the custom resource definitions and operator controller service.

In 2020, AWS introduced ACK to facilitate a Kubernetes-native way of managing AWS Cloud resources. ACK includes a common controller runtime, a code generator, and a set of AWS service-specific controllers, one of which is the SageMaker controller.

Going forward, new functionality will be added to the SageMaker Operators for Kubernetes through the ACK project.

How does ACK work?

The following diagram illustrates how ACK works.

In this example, Alice is a Kubernetes user. She wants to run model training on SageMaker from within the Kubernetes cluster using the Kubernetes API. Alice issues a call to kubectl apply, passing in a file that describes a Kubernetes custom resource describing her SageMaker training job. kubectl apply passes this file, called a manifest, to the Kubernetes API server running in the Kubernetes controller node (Step 1 in the workflow diagram).

The Kubernetes API server receives the manifest with the SageMaker training job specification and determines whether Alice has permissions to create a custom resource of kind sageMaker.services.k8s.aws/TrainingJob, and whether the custom resource is properly formatted (Step 2).

If Alice is authorized and the custom resource is valid, the Kubernetes API server writes (Step 3) the custom resource to its etcd data store and then responds back (Step 4) to Alice that the custom resource has been created.

The SageMaker controller, which is running on a Kubernetes worker node within the context of a normal Kubernetes Pod, is notified (Step 5) that a new custom resource of kind SageMaker.services.k8s.aws/TrainingJob has been created.

The SageMaker controller then communicates (Step 6) with the SageMaker API, calling the SageMaker CreateTrainingJob API to create the training job in AWS. After communicating with the SageMaker API, the SageMaker controller calls the Kubernetes API server to update (Step 7) the custom resource’s status with information it received from SageMaker. The SageMaker controller therefore provides the same information to the developers that they would have received using the AWS SDK. This results in a better and consistent developer experience.

Machine learning use case

For this post, we follow the SageMaker example provided in the following notebook. However, you can reuse the components in this example with your preference of SageMaker built-in or custom algorithms and your own datasets.

We use the Abalone dataset originally from the UCI data repository [1]. In the libsvm converted version, the nominal feature (male/female/infant) has been converted into a real valued feature. The age of abalone is to be predicted from eight physical measurements. This dataset is already processed and stored in Amazon Simple Storage Service (Amazon S3). We train an XGBoost model on the UCI Abalone dataset to replicate the flow in the example Jupyter notebook.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account.

An existing Amazon Elastic Kubernetes Service (Amazon EKS) cluster. It should be Kubernetes version 1.16+. For automated cluster creation using eksctl, see Getting started with Amazon EKS – eksctl and create your cluster with Amazon EC2 Linux managed nodes.

Install the following tools on the client machine used to access your Kubernetes cluster (you can use AWS Cloud9, a cloud-based integrated development environment (IDE) for the Kubernetes cluster setup):

  • kubectl – A command line tool for working with Kubernetes clusters.
  • Helm version 3.7+ – A tool for installing and managing Kubernetes applications.
  • AWS Command Line Interface (AWS CLI) – A command line tool for interacting with AWS services.
  • eksctl – A command line tool for working with Amazon EKS clusters that automates many individual tasks.
  • yq – A command line YAML processor. (For Linux environments, use the wget plain binary installation).

Set up IAM role-based authentication for the controller Pod

IAM roles for service accounts (IRSA) allows fine-grained roles at the Kubernetes Pod level by combining an OpenID Connect (OIDC) identity provider with Kubernetes service account annotations. In this section, we associate the Amazon EKS cluster with an OIDC provider and create an AWS Identity and Access Management (IAM) role that is assumed by the ACK controller Pod via its service account to access AWS services.

Create a cluster and OIDC ID provider

Make sure you’re connected to the right cluster. Substitute the values for CLUSTER_NAME and CLUSTER_REGION below:

# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: MIT-0

# Set the cluster name, region where the cluster exists
export CLUSTER_NAME=<CLUSTER_NAME>
export CLUSTER_REGION=<CLUSTER_REGION>
export RANDOM_VAR=$RANDOM

aws eks update-kubeconfig --name $CLUSTER_NAME --region $CLUSTER_REGION
kubectl config get-contexts 

# Ensure cluster has compute
kubectl get nodes

Set up the OIDC ID provider (IdP) in AWS and associate it with your Amazon EKS cluster:

eksctl utils associate-iam-oidc-provider --cluster ${CLUSTER_NAME} 
--region ${CLUSTER_REGION} --approve

Get the identity issuer URL by running the following code:

export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query "Account" --output text)
OIDC_PROVIDER_URL=$(aws eks describe-cluster --name $CLUSTER_NAME --region $CLUSTER_REGION --query "cluster.identity.oidc.issuer" --output text | cut -c9-)

Set up an IAM role

Next, let’s set up the IAM role that defines the access to the SageMaker and Application Auto Scaling services. For this, we also need to have an IAM trust policy in place, allowing the specified Kubernetes service account (for example, ack-sagemaker-controller) to assume the IAM role.

Create a file named trust.json and insert the following trust relationship code block required for IAM role:

printf '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::'$AWS_ACCOUNT_ID':oidc-provider/'$OIDC_PROVIDER_URL'"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "'$OIDC_PROVIDER_URL':aud": "sts.amazonaws.com",
          "'$OIDC_PROVIDER_URL':sub": [
            "system:serviceaccount:ack-system:ack-sagemaker-controller",
            "system:serviceaccount:ack-system:ack-applicationautoscaling-controller"
          ]
        }
      }
    }
  ]
}
' > ./trust.json

Updating an Application Auto Scaling Scalable Target requires additional permissions. First, create a service-linked role for Application Auto Scaling.

aws iam create-service-linked-role --aws-service-name sagemaker.application-autoscaling.amazonaws.com

Create a file named pass_role_policy.json to create the policy required for the IAM role.

printf '{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::'$AWS_ACCOUNT_ID':role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint"
    }
  ]
}
' > ./pass_role_policy.json

Run the following command to create a role with the trust relationship defined in trust.json. This trust relationship is required so that Amazon EKS (via a webhook) can inject the necessary environment variables and mount volumes into the Pod that are required by the AWS SDK to assume this role.

OIDC_ROLE_NAME=ack-controller-role-$CLUSTER_NAME

aws iam create-role --role-name $OIDC_ROLE_NAME --assume-role-policy-document file://trust.json

# Attach the AmazonSageMakerFullAccess Policy to the Role. This policy provides full access to 
# Amazon SageMaker. Also provides select access to related services (e.g., Application Autoscaling,
# S3, ECR, CloudWatch Logs).
aws iam attach-role-policy --role-name $OIDC_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

# Attach the iam:PassRole policy required for updating ApplicationAutoscaling ScalableTarget
aws iam put-role-policy --role-name $OIDC_ROLE_NAME --policy-name "iam-pass-role-policy" --policy-document file://pass_role_policy.json

export IAM_ROLE_ARN_FOR_IRSA=$(aws iam get-role --role-name $OIDC_ROLE_NAME --output text --query 'Role.Arn')
echo $IAM_ROLE_ARN_FOR_IRSA

Install SageMaker and Application Auto Scaling controllers

Choose an AWS Region for the SageMaker and automatic scaling resources we create in this post. For convenience, we recommend using us-east-1:

export SERVICE_REGION="us-east-1"
# Namespace for controller
export ACK_K8S_NAMESPACE="ack-system"

Now, let’s install the SageMaker and Application Auto Scaling controller using the following helper script. This script pulls the helm charts from ACK’s public Amazon Elastic Container Registry (Amazon ECR) repository and configures the values of the AWS account, default Region for resources to be created, and IAM role (created in previous step) in the service account to be used by the controller Pod to assume the role. Create a file named install-controllers.sh and insert the following code block:

#!/usr/bin/env bash

# Deploy ACK Helm Charts
export HELM_EXPERIMENTAL_OCI=1
export ACK_K8S_NAMESPACE=${ACK_K8S_NAMESPACE:-"ack-system"}

function install_ack_controller() {
    local service="$1"
    local release_version="$2"
    local chart_export_path=/tmp/chart
    local chart_ref=$service-chart
    local chart_repo=public.ecr.aws/aws-controllers-k8s/$chart_ref
    local chart_package=$chart_ref-$release_version.tgz
    
    # Download helm chart
    mkdir -p $chart_export_path
    helm pull oci://"$chart_repo" --version "$release_version" -d $chart_export_path
    tar xvf "$chart_export_path"/"$chart_package" -C "$chart_export_path"

    # Update the values in helm chart
    pushd $chart_export_path/$service-chart
        yq e '.aws.region = env(SERVICE_REGION)' -i values.yaml 
        yq e '.serviceAccount.annotations."eks.amazonaws.com/role-arn" = env(IAM_ROLE_ARN_FOR_IRSA)' -i values.yaml
    popd

    # Create a namespace and install the helm chart
    helm install -n $ACK_K8S_NAMESPACE --create-namespace ack-$service-controller $chart_export_path/$service-chart
}

install_ack_controller "sagemaker" "v0.3.0"
install_ack_controller "applicationautoscaling" "v0.2.0"

Run the script:

chmod +x install-controllers.sh
./install-controllers.sh

The output contains the following:

Pulled: public.ecr.aws/aws-controllers-k8s/sagemaker-chart:v0.3.0
...

NAME: ack-sagemaker-controller
LAST DEPLOYED: Tue Nov 16 01:53:34 2021
NAMESPACE: ack-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
Pulled: public.ecr.aws/aws-controllers-k8s/applicationautoscaling-chart:v0.2.0
...

NAME: ack-applicationautoscaling-controller
LAST DEPLOYED: Tue Nov 16 01:53:35 2021
NAMESPACE: ack-system
STATUS: deployed
REVISION: 1
TEST SUITE: None

Next, we run the following commands to verify custom resource definitions were applied and controller Pods are running:

kubectl get crds | grep "services.k8s.aws"

The output of the command should contain a number of custom resource definitions related to SageMaker (such as trainingjobs or endpoint) and Application Auto Scaling (such as scalingpolicies and scalabletargets):

# Get pods in controller namespace
kubectl get pods -n $ACK_K8S_NAMESPACE

We see one controller Pod per service running in the ack-system namespace:

NAME                                                     READY   STATUS    RESTARTS   AGE
ack-applicationautoscaling-controller-7479dc78dd-ts9ng   1/1     Running   0          4m52s
ack-sagemaker-controller-788858fc98-6fgr6                1/1     Running   0          4m56s

Prepare SageMaker resources

Next, we create an S3 bucket and IAM role for SageMaker.

To train a model with SageMaker, we need an S3 bucket to store the dataset and artifacts from the training process. We simply use the preprocessed dataset at s3://SageMaker-sample-files/datasets/tabular/uci_abalone[1].

Let’s create a variable for the S3 bucket:

export SAGEMAKER_BUCKET=ack-sagemaker-bucket-$RANDOM_VAR

Create a file named create-bucket.sh and insert the following code block:

printf '
#!/usr/bin/env bash
# create bucket
if [[ $SERVICE_REGION != "us-east-1" ]]; then
  aws s3api create-bucket --bucket "$SAGEMAKER_BUCKET" --region "$SERVICE_REGION" --create-bucket-configuration LocationConstraint="$SERVICE_REGION"
else
  aws s3api create-bucket --bucket "$SAGEMAKER_BUCKET" --region "$SERVICE_REGION"
fi
# sync dataset
aws s3 sync s3://sagemaker-sample-files/datasets/tabular/uci_abalone/train s3://"$SAGEMAKER_BUCKET"/datasets/tabular/uci_abalone/train
aws s3 sync s3://sagemaker-sample-files/datasets/tabular/uci_abalone/validation s3://"$SAGEMAKER_BUCKET"/datasets/tabular/uci_abalone/validation
' > ./create-bucket.sh

Run the script to create the S3 bucket and copy the dataset:

chmod +x create-bucket.sh
./create-bucket.sh

The SageMaker training job that we run later in the post needs an IAM role to access Amazon S3 and SageMaker. Run the following commands to create a SageMaker execution IAM role that is used by SageMaker to access AWS resources:

export SAGEMAKER_EXECUTION_ROLE_NAME=ack-sagemaker-execution-role-$RANDOM_VAR

TRUST="{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" } ] }"
aws iam create-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --assume-role-policy-document "$TRUST"
aws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam attach-role-policy --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess

SAGEMAKER_EXECUTION_ROLE_ARN=$(aws iam get-role --role-name ${SAGEMAKER_EXECUTION_ROLE_NAME} --output text --query 'Role.Arn')

echo $SAGEMAKER_EXECUTION_ROLE_ARN

Note down the execution role ARN to use in later steps.

Train an XGBoost model

Now, we create a training.yaml file to specify the parameters for a SageMaker training job. SageMaker training jobs enable remote training of ML models. You can customize each training job to run your own ML scripts with custom architectures, data loaders, hyperparameters, and more. To submit a SageMaker training job, we require a job name. Let’s create that variable first:

export JOB_NAME=ack-xgboost-training-job-$RANDOM_VAR

In the following code, we create a training.yaml file that contains the hyperparameters for the training job as well as the location of the training and validation data. It’s also where we specify the Amazon ECR image used for training.

Note: If your $SERVICE_REGION isn’t us-east-1, change the following image URI. For the XGBoost algorithm version 1.2-1 Region-specific image URI, see Docker Registry Paths and Example Code.

export XGBOOST_IMAGE=683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-xgboost:1.2-1

printf '
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: TrainingJob
metadata:
  name: '$JOB_NAME'
spec:
  # Name that will appear in SageMaker console
  trainingJobName: '$JOB_NAME'
  hyperParameters: 
    max_depth: "5"
    gamma: "4"
    eta: "0.2"
    min_child_weight: "6"
    subsample: "0.7"
    objective: "reg:linear"
    num_round: "50"
    verbosity: "2"
  algorithmSpecification:
    trainingImage: '$XGBOOST_IMAGE'
    trainingInputMode: File
  roleARN: '$SAGEMAKER_EXECUTION_ROLE_ARN'
  outputDataConfig:
    # The output path of our model
    s3OutputPath: s3://'$SAGEMAKER_BUCKET'
  resourceConfig:
    instanceCount: 1
    instanceType: ml.m4.xlarge
    volumeSizeInGB: 5
  stoppingCondition:
    maxRuntimeInSeconds: 3600
  inputDataConfig:
    - channelName: train
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          # The input path of our train data 
          s3URI: s3://'$SAGEMAKER_BUCKET'/datasets/tabular/uci_abalone/train/abalone.train
          s3DataDistributionType: FullyReplicated
      contentType: text/libsvm
      compressionType: None
    - channelName: validation
      dataSource:
        s3DataSource:
          s3DataType: S3Prefix
          # The input path of our validation data 
          s3URI: s3://'$SAGEMAKER_BUCKET'/datasets/tabular/uci_abalone/validation/abalone.validation
          s3DataDistributionType: FullyReplicated
      contentType: text/libsvm
      compressionType: None 
' > ./training.yaml

Now, we can create the training job:

kubectl apply -f training.yaml

You should see the following output:

trainingjob.sagemaker.services.k8s.aws/ack-xgboost-training-job-7420 created

You can watch the status of the training job. It takes a few minutes for STATUS to show as Completed.

kubectl get trainingjob.sagemaker --watch
NAME                            SECONDARYSTATUS   STATUS
ack-xgboost-training-job-7420   Starting          InProgress
ack-xgboost-training-job-7420   Downloading       InProgress
ack-xgboost-training-job-7420   Training          InProgress
ack-xgboost-training-job-7420   Completed         Completed

Deploy the results of the SageMaker training job

To deploy the model, we need to specify a model name, an endpoint config name, and an endpoint name:

export MODEL_NAME=ack-xgboost-model-$RANDOM_VAR
export ENDPOINT_CONFIG_NAME=ack-xgboost-endpoint-config-$RANDOM_VAR
export ENDPOINT_NAME=ack-xgboost-endpoint-$RANDOM_VAR

We deploy this model on a c5.large instance type. In the following .yaml file, we define the model, the endpoint config, and the endpoint:

printf '
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Model
metadata:
  name: '$MODEL_NAME'
spec:
  modelName: '$MODEL_NAME'
  primaryContainer:
    containerHostname: xgboost
    # The source of the model data
    modelDataURL: s3://'$SAGEMAKER_BUCKET'/'$JOB_NAME'/output/model.tar.gz
    image: '$XGBOOST_IMAGE'
  executionRoleARN: '$SAGEMAKER_EXECUTION_ROLE_ARN'
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: EndpointConfig
metadata:
  name: '$ENDPOINT_CONFIG_NAME'
spec:
  endpointConfigName: '$ENDPOINT_CONFIG_NAME'
  productionVariants:
  - modelName: '$MODEL_NAME'
    variantName: AllTraffic
    instanceType: ml.c5.large
    initialInstanceCount: 1
---
apiVersion: sagemaker.services.k8s.aws/v1alpha1
kind: Endpoint
metadata:
  name: '$ENDPOINT_NAME'
spec:
  endpointName: '$ENDPOINT_NAME'
  endpointConfigName: '$ENDPOINT_CONFIG_NAME'
' > ./deploy.yaml

Now, the endpoint is ready to be deployed:

kubectl apply -f deploy.yaml

You should see the following output:

model.sagemaker.services.k8s.aws/ack-xgboost-model-7420 created
endpointconfig.sagemaker.services.k8s.aws/ack-xgboost-endpoint-config-7420 created
endpoint.sagemaker.services.k8s.aws/ack-xgboost-endpoint-7420 created

We can observe that the model and endpoint config were created. Deploying the endpoint may take some time:

kubectl describe models.sagemaker
kubectl describe endpointconfigs.sagemaker
kubectl describe endpoints.sagemaker

We can watch this process using the following command:

kubectl get endpoints.sagemaker --watch

After some time, the STATUS changes to InService:

NAME                        STATUS
ack-xgboost-endpoint-7420   Creating         
ack-xgboost-endpoint-7420   InService        

This indicates the deployed endpoint is ready for use.

Verify the inference capabilities of the trained model

We invoke the model endpoint using Python to emulate a typical use case. We reuse the code in SageMaker example notebook.

We first download the test set from Amazon S3. Then we load a single sample from the test set and use it to invoke the endpoint we deployed in the previous section. Download the test file with the following code:

pip install boto3 numpy
aws s3 cp s3://sagemaker-sample-files/datasets/tabular/uci_abalone/test/abalone.test abalone.test
head -1 abalone.test > abalone.single.test

Use the Python interpreter to test inference. The Python interpreter is usually installed as /usr/local/bin/python<version> on those machines where it’s available; putting /usr/local/bin in your Unix/Linux shell’s search path makes it possible to start it by entering the Python command.

Create a file named predict.py and insert the following code block:

printf '
import sys
import math
import json
import boto3
import numpy as np
import os

region = os.environ.get("SERVICE_REGION")
endpoint_name = os.environ.get("ENDPOINT_NAME")

runtime_client = boto3.client("runtime.sagemaker", region_name=region)

file_name = "abalone.single.test"
with open(file_name, "r") as f:
    payload = f.read().strip()

response = runtime_client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="text/x-libsvm", Body=payload
)

result = response["Body"].read().decode("utf-8").split(",")
result = [math.ceil(float(i)) for i in result]
label = payload.strip(" ").split()[0]
print("Label: " + label)
print("Prediction:" + str(result[0]))
' > ./predict.py
python predict.py

Running this sample should give us the following result:

Label: 12
Prediction: 13

The age of the abalone that is provided in the test example is estimated to be 13 by the ML model. The actual age was 12. This suggests that our ML model has been trained and provides reasonable predictions. However, the experienced ML user may realize that we haven’t performed hyperparameter tuning and other methods of increasing accuracy yet, which is outside the scope of this post.

Dynamically scale the endpoint according to the load

SageMaker ACK Operators support custom resource definitions for automatic scaling (using ScalableTarget and ScalingPolicy) for your hosted models. The following resources adjust the number of instances (minimum 1 to maximum 20) provisioned for a model in response to changes in metric SageMakerVariantInvocationsPerInstancetracking, which is the average number of times per minute that each instance for a variant is invoked:

printf '
apiVersion: applicationautoscaling.services.k8s.aws/v1alpha1
kind: ScalableTarget
metadata:
  name: ack-scalable-target-predfined
spec:
  maxCapacity: 20
  minCapacity: 1
  resourceID: endpoint/'$ENDPOINT_NAME'/variant/AllTraffic
  scalableDimension: "sagemaker:variant:DesiredInstanceCount"
  serviceNamespace: sagemaker
---
apiVersion: applicationautoscaling.services.k8s.aws/v1alpha1
kind: ScalingPolicy
metadata:
  name: ack-scaling-policy-predefined
spec:
  policyName: ack-scaling-policy-predefined
  policyType: TargetTrackingScaling
  resourceID: endpoint/'$ENDPOINT_NAME'/variant/AllTraffic
  scalableDimension: "sagemaker:variant:DesiredInstanceCount"
  serviceNamespace: sagemaker
  targetTrackingScalingPolicyConfiguration:
    targetValue: 60
    scaleInCooldown: 700
    scaleOutCooldown: 300
    predefinedMetricSpecification:
        predefinedMetricType: SageMakerVariantInvocationsPerInstance
 ' > ./scale-endpoint.yaml

Apply with the following code:

kubectl apply -f scale-endpoint.yaml

You should see the following output:

scalabletarget.applicationautoscaling.services.k8s.aws/ack-scalable-target-predfined created
scalingpolicy.applicationautoscaling.services.k8s.aws/ack-scaling-policy-predefined created

We can observe that scalingpolicy was created:

kubectl describe scalingpolicy.applicationautoscaling

The output of scalingpolicy looks like the following:

Status:
  Ack Resource Metadata:
    Arn:               arn:aws:autoscaling:us-east-1:123456789012:scalingPolicy:b33d12b8-aa81-4cb8-855e-c7b6dcb9d6e7:resource/SageMaker/endpoint/ack-xgboost-endpoint/variant/AllTraffic:policyName/ack-scaling-policy-predefined
    Owner Account ID:  123456789012
  Alarms:
    Alarm ARN:   arn:aws:cloudwatch:us-east-1:123456789012:alarm:TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmHigh-966b8232-a9b9-467d-99f3-95436f5c0383
    Alarm Name:  TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmHigh-966b8232-a9b9-467d-99f3-95436f5c0383
    Alarm ARN:   arn:aws:cloudwatch:us-east-1:123456789012:alarm:TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmLow-71e39f85-1afb-401d-9703-b788cdc10a93
    Alarm Name:  TargetTracking-endpoint/ack-xgboost-endpoint/variant/AllTraffic-AlarmLow-71e39f85-1afb-401d-9703-b788cdc10a93

Clean up

Run the following commands to delete the resources created in this post:

kubectl delete -f scale-endpoint.yaml
kubectl delete -f deploy.yaml
kubectl delete -f training.yaml

Create a file named uninstall-controller.sh and insert the following code block required for deleting the controller and custom resource definitions:

printf '
#!/usr/bin/env bash

# Uninstall Controller

export HELM_EXPERIMENTAL_OCI=1
export ACK_K8S_NAMESPACE=${ACK_K8S_NAMESPACE:-"ack-system"}

function uninstall_ack_controller() {
   local service="$1"
   local chart_export_path=/tmp/chart
   
   helm uninstall -n $ACK_K8S_NAMESPACE ack-$service-controller
   kubectl delete -f $chart_export_path/ack-$service-controllerchart/crds
}

uninstall_ack_controller "sagemaker"
uninstall_ack_controller "applicationautoscaling"
' > ./uninstall-controller.sh

Run the following commands to uninstall the controller and custom resource definitions, and delete the namespace, IAM roles, and S3 bucket you created:

# uninstall controller and remove CRDs
chmod +x uninstall-controller.sh
./uninstall-controller.sh

# Delete controller namespace
kubectl delete namespace $ACK_K8S_NAMESPACE

# Delete S3 bucket
aws s3 rb s3://$SAGEMAKERageMaker_BUCKET --force

# Delete SageMaker execution role
aws iam detach-role-policy --role-name $SAGEMAKER_EXECUTION_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam detach-role-policy --role-name $SAGEMAKER_EXECUTION_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam delete-role --role-name $SAGEMAKER_EXECUTION_ROLE_NAME

# Delete application autoscaling service linked role
aws iam delete-service-linked-role --role-name AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint

# Delete IAM role created for IRSA
aws iam detach-role-policy --role-name $OIDC_ROLE_NAME --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam delete-role-policy --role-name $OIDC_ROLE_NAME --policy-name "iam-pass-role-policy"
aws iam delete-role --role-name $OIDC_ROLE_NAME

Conclusion

SageMaker ACK Operators provide engineering teams with a native Kubernetes experience for creating and interacting with the ML jobs on SageMaker, either with the Kubernetes API or with Kubernetes command line utilities such as kubectl. You can build automation, tooling, and custom interfaces for data scientists in Kubernetes by using these controllers—all without building, maintaining, or optimizing ML infrastructure. Data scientists and developers familiar with Kubernetes can compose and interact with fully managed SageMaker training, tuning, and inference jobs, as you would with Kubernetes jobs running locally. Logs from SageMaker jobs stream back to Kubernetes, allowing you to natively view logs for your model training, tuning, and prediction jobs in the command line.

ACK is a community-driven project and will soon include service controllers for other AWS service APIs.

Links

[1] Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.


About the Authors

Kanwaljit Khurmi is a Senior Solutions Architect at Amazon Web Services. He works with the AWS customers to provide guidance and technical assistance helping them improve the value of their solutions when using AWS. Kanwaljit specializes in helping customers with containerized and machine learning applications.

Suraj Kota is a Software Engineer specialized in Machine Learning infrastructure. He builds tools to easily get started and scale machine learning workload on AWS. He worked on the AWS Deep Learning Containers, Deep Learning AMI, SageMaker Operators for Kubernetes, and other open source integrations like Kubeflow.

Archis Joglekar is an AI/ML Partner Solutions Architect in the Emerging Technologies team. He is interested in performant, scalable deep learning and scientific computing using the building blocks at AWS. His past experiences range from computational physics research to machine learning platform development in academia, national labs, and startups. His time away from the computer is spent playing soccer and with friends and family.

Read More