Optimizing costs for Amazon SageMaker Canvas with automatic shutdown of idle apps

Amazon SageMaker Canvas is a rich, no-code Machine Learning (ML) and Generative AI workspace that has allowed customers all over the world to more easily adopt ML technologies to solve old and new challenges thanks to its visual, no-code interface. It does so by covering the ML workflow end-to-end: whether you’re looking for powerful data preparation and AutoML, managed endpoint deployment, simplified MLOps capabilities, and ready-to-use models powered by AWS AI services and Generative AI, SageMaker Canvas can help you to achieve your goals.

As companies of all sizes adopt SageMaker Canvas, customers asked for ways to optimize cost. As defined in the AWS Well-Architected Framework, a cost-optimized workload fully uses all resources, meets your functional requirements, and achieves an outcome at the lowest possible price point.

Today, we’re introducing a new way to further optimize costs for SageMaker Canvas applications. SageMaker Canvas now collects Amazon CloudWatch metrics that provide insight into app usage and idleness. Customers can use this information to shut down automatically idle SageMaker Canvas applications to avoiding incurring unintended costs.

In this post, we’ll show you how to automatically shut down idle SageMaker Canvas apps to control costs by using a simple serverless architecture. Templates used in this post are available in GitHub.

Understanding and tracking costs

Education is always the first step into understanding and controlling costs for any workload, either on-premises or in the cloud. Let’s start by reviewing the SageMaker Canvas pricing model. In a nutshell, SageMaker Canvas has a pay-as-you-go pricing model, based on two dimensions:

  • Workspace instance: ­ formerly known as session time, is the cost associated with running the SageMaker Canvas app
  • AWS service charges: ­ costs associated with training the models, deploying the endpoints, generating inferences (resources to spin up SageMaker Canvas).

Customers always have full control over the resources that are launched by SageMaker Canvas and can keep track of costs associated with the SageMaker Canvas app by using the AWS Billing and Cost Management service. For more information, refer to Manage billing and cost in SageMaker Canvas.

To limit the cost associated with the workspace instances, as a best practice, you must log out, do not close the browser tab. To log out, choose the Log out button on the left panel of the SageMaker Canvas app.

Automatically shutting down SageMaker Canvas applications

For IT Administrators that are looking to provide automated controls for shutting down SageMaker Canvas applications and keeping costs under control, there are two approaches:

  1. Shutdown applications on a schedule (every day at 19:00 or every Friday at 18:00)
  2. Shutdown automatically idle applications (when the application hasn’t been used for two hours)

Shutdown applications on a schedule

Canvas Scheduled Shutdown Architecture

Scheduled shutdown of SageMaker Canvas applications can be achieved with very little effort by using a cron expression (with Amazon EventBridge Cron Rule), a compute component (an AWS Lambda function) that calls the Amazon SageMaker API DeleteApp. This approach has been discussed in the Provision and manage ML environments with Amazon SageMaker Canvas using AWS CDK and AWS Service Catalog post, and implemented in the associated GitHub repository.

One of the advantages of the above architecture is that it is very simple to duplicate it to achieve scheduled creation of the SageMaker Canvas app. By using a combination of scheduled creation and scheduled deletion, a cloud administrator can make sure that the SageMaker Canvas application is ready to be used whenever users start their business day (e.g. 9AM on a work day), and that the app also automatically shuts down at the end of the business day (e.g. 7PM on a work day, always shut down during weekends). All that is needed to do is change the line of code calling the DeleteApp API into CreateApp, as well as updating the cron expression to reflect the desired app creation time.

While this approach is very easy to implement and test, a drawback of the suggested architecture is that it does not take into account whether an application is currently being used or not, shutting it down regardless of its current activity status. According to different situations, this might cause friction with active users, which might suddenly see their session terminated.

You can retrieve the template associated to this architecture from the following GitHub repository:

Shutdown automatically idle applications

Canvas Shutdown on Idle Architecture

Starting today, Amazon SageMaker Canvas emits CloudWatch metrics that provide insight into app usage and idleness. This allows an administrator to define a solution that reads the idleness metric, compares it against a threshold, and defines a specific logic for automatic shutdown. A more detailed overview of the idleness metric emitted by SageMaker Canvas is shown in the following paragraph.

To achieve automatic shutdown of SageMaker Canvas applications based on the idleness metrics, we provide an AWS CloudFormation template. This template consists of three main components:

  1. An Amazon CloudWatch Alarm, which runs a query to check the MAX value of the TimeSinceLastActive metric. If this value is greater than a threshold provided as input to the CloudFormation template, it triggers the rest of the automation. This query can be run on a single user profile, on a single domain, or across all domains. According to the level of control that you wish to have, you can use:
    1. the all-domains-all-users template, which checks this across all users and all domains in the region where the template is deployed
    2. the one-domain-all-users template, which checks this across all users in one domain in the region where the template is deployed
    3. the one-domain-one-user template, which checks this for one user profile, in one domain, in the region where the template is deployed
  2. The alarm state change creates an event on the default event bus in Amazon EventBridge, which has an Amazon EventBridge Rule set up to trigger an AWS Lambda function
  3. The AWS Lambda function identifies which SageMaker Canvas app has been running in idle for more than the specified threshold, and deletes it with the DeleteApp API.

You can retrieve the AWS CloudFormation templates associated to this architecture from the following GitHub repository:

How SageMaker Canvas idleness metric work

SageMaker Canvas emits a TimeSinceLastActive metric in the /aws/sagemaker/Canvas/AppActivity namespace, which shows the number of seconds that the app has been idle with no user activity. We can use this new metric to trigger an automatic shutdown of the SageMaker Canvas app when it has been idle for a defined period. SageMaker Canvas exposes the TimeSinceLastActive with the following schema:

{
    "Namespace": "/aws/sagemaker/Canvas/AppActivity",
    "Dimensions": [
        [
            "DomainId",
            "UserProfileName"
        ]
    ],
    "Metrics": [
        {
            "Name": "TimeSinceLastActive",
            "Unit": "Seconds",
            "Value": 12345
        }
    ]
}

The key components of this metric are as follows:

  • Dimensions, in particular DomainID and UserProfileName, that allow an administrator to pinpoint which applications are idle across all domains and users
  • Value of the metric, which indicates the number of seconds since the last activity in the SageMaker Canvas applications. SageMaker Canvas considers the following as activity:
    • Any action taken in the SageMaker Canvas application (clicking a button, transforming a dataset, generating an in-app inference, deploying a model);
    • Using a ready-to-use model or interacting with the Generative AI models using chat interface;
    • A batch inference scheduled to run at a specific time; for more information, refer to  Manage automations.

This metric can be read via Amazon CloudWatch API such as get_metric_data. For example, using the AWS SDK for Python (boto3):

import boto3, datetime

cw = boto3.client('cloudwatch')
metric_data_results = cw.get_metric_data(
    MetricDataQueries=[
        {
            "Id": "q1",
            "Expression": 'SELECT MAX(TimeSinceLastActive) FROM "/aws/sagemaker/Canvas/AppActivity" GROUP BY DomainId, UserProfileName',
            "Period": 900
        }
    ],
    StartTime=datetime.datetime(2023, 1, 1),
    EndTime=datetime.datetime.now(),
    ScanBy='TimestampAscending'
)

The Python query extracts the MAX value of TimeSinceLastActive from the namespace associated to SageMaker Canvas after grouping these values by DomainID and UserProfileName.

Deploying and testing the auto-shutdown solution

To deploy the auto-shutdown stack, do the following:

  1. Download the AWS CloudFormation template that refers to the solution you want to implement from the above GitHub repository. Choose whether you want to implement a solution for all SageMaker Domains, for a single SageMaker Domain, or for a single user;
  2. Update template parameters:
    1. The idle timeout – time (in seconds) that the SageMaker Canvas app is allowed to stay in idle before it gets shutdown; default value is 2 hours
    2. The alarm period – aggregation time (in seconds) used by CloudWatch Alarm to compute the idle timeout; default value is 20 minutes
    3. (optional) SageMaker Domain ID and user profile name
  3. Deploy the CloudFormation stack to create the resources

Once deployed (should take less than two minutes), the AWS Lambda function and Amazon CloudWatch alarm are configured to automatically shut down the Canvas app when idle. To test the auto-shutdown script, do the following:

  1. Make sure that the SageMaker Canvas app is running within the right domain and with the right user profile (if you have configured them).
  2. Stop using the SageMaker Canvas app and wait for the idle timeout period (default, 2 hours)
  3. Check that the app is stopped after being idle for the threshold time by checking that the CloudWatch alarm has been triggered and, after triggering the automation, it has gone back to the normal state.

In our test, we have set the idle timeout period to two hours (7200 seconds). In the following graph plotted by Amazon CloudWatch Metrics, you can see that the SageMaker Canvas app has been emitting the TimeSinceLastActive metric until the threshold was met (1), which triggered the alarm. Once the alarm was triggered, the AWS Lambda function was executed, which deleted the app and brought the metric back below the threshold (2).

Canvas Auto-shutdown Metrics Plot

Conclusion

In this post, we implemented an automated shutdown solution for idle SageMaker Canvas apps using AWS Lambda and CloudWatch Alarm and the newly emitted metric of idleness from SageMaker Canvas. Thanks to this solution, customers not only can optimize costs for their ML workloads but can also avoid unintended charges for applications that they forgot were running in their SageMaker Domain.

We’re looking forward to seeing what new use cases and workloads customers can solve with the peace of mind brought by this solution. For more examples of how SageMaker Canvas can help you achieve your business goals, refer to the following posts:

To learn how you can run production-level workloads with Amazon SageMaker Canvas, refer to the following posts:


About the authors


Davide Gallitelli is a Senior Specialist Solutions Architect for AI/ML. He is based in Brussels and works closely with customers all around the globe that are looking to adopt Low-Code/No-Code Machine Learning technologies, and Generative AI. He has been a developer since he was very young, starting to code at the age of 7. He started learning AI/ML at university, and has fallen in love with it since then.


Huong Nguyen is a Sr. Product Manager at AWS. She is leading the data ecosystem integration for SageMaker, with 14 years of experience building customer-centric and data-driven products for both enterprise and consumer spaces.


Gunjan Garg is a Principal Engineer at Amazon SageMaker team in AWS, providing technical leadership for the product. She has worked in several roles in the AI/ML org for last 5 years and is currently focused on Amazon SageMaker Canvas.


Ziyao Huang is a Software Development Engineer with Amazon SageMaker Data Wrangler. He is passionate about building great product that makes ML easy for the customers. Outside of work, Ziyao likes to read, and hang out with his friends.

Read More