Speed up your AI inference workloads with new NVIDIA-powered capabilities in Amazon SageMaker

Speed up your AI inference workloads with new NVIDIA-powered capabilities in Amazon SageMaker

This post is co-written with Abhishek Sawarkar, Eliuth Triana, Jiahong Liu and Kshitiz Gupta from NVIDIA. 

At re:Invent 2024, we are excited to announce new capabilities to speed up your AI inference workloads with NVIDIA accelerated computing and software offerings on Amazon SageMaker. These advancements build upon our collaboration with NVIDIA, which includes adding support for inference-optimized GPU instances and integration with NVIDIA technologies. They represent our continued commitment to delivering scalable, cost-effective, and flexible GPU-accelerated AI inference capabilities to our customers.

Today, we are introducing three key advancements that further expand our AI inference capabilities:

  1. NVIDIA NIM microservices are now available in AWS Marketplace for SageMaker Inference deployments, providing customers with easy access to state-of-the-art generative AI models.
  2. NVIDIA Nemotron-4 is now available on Amazon SageMaker JumpStart, significantly expanding the range of high-quality, pre-trained models available to our customers. This integration provides a powerful multilingual model that excels in reasoning benchmarks.
  3. Inference-optimized P5e and G6e instances are now generally available on Amazon SageMaker, giving customers access to NVIDIA H200 Tensor Core and L40S GPUs for AI inference workloads.

In this post, we will explore how you can use these new capabilities to enhance your AI inference on Amazon SageMaker. We’ll walk through the process of deploying NVIDIA NIM microservices from AWS Marketplace for SageMaker Inference. We’ll then dive into NVIDIA’s model offerings on SageMaker JumpStart, showcasing how to access and deploy the Nemotron-4 model directly in the JumpStart interface. This will include step-by-step instructions on how to find the Nemotron-4 model in the JumpStart catalog, select it for your use case, and deploy it with a few clicks. We’ll also demonstrate how to fine-tune and optimize this model for your specific requirements. Additionally, we’ll introduce you to the new inference-optimized P5e and G6e instances powered by NVIDIA H200 and L40S GPUs, showcasing how they can significantly boost your AI inference performance. By the end of this post, you’ll have a practical understanding of how to implement these advancements in your own AI projects, enabling you to accelerate your inference workloads and drive innovation in your organization.

Announcing NVIDIA NIM in AWS Marketplace for SageMaker Inference

NVIDIA NIM, part of the NVIDIA AI Enterprise software platform, offers a set of high-performance microservices designed to help organizations rapidly deploy and scale generative AI applications on NVIDIA-accelerated infrastructure. SageMaker Inference is a fully managed capability for customers to run generative AI and machine learning models at scale, providing purpose-built features and a broad array of inference-optimized instances. AWS Marketplace serves as a curated digital catalog where customers can find, buy, deploy, and manage third-party software, data, and services needed to build solutions and run businesses. We’re excited to announce that AWS customers can now access NVIDIA NIM microservices for SageMaker Inference deployments through the AWS Marketplace , simplifying the deployment of generative AI models and helping partners and enterprises to scale their AI capabilities. The initial availability includes a portfolio of models packaged as NIM microservices, expanding the options for AI inference on Amazon SageMaker, including:

  • NVIDIA Nemotron-4: a cutting-edge large language model (LLM) designed to generate diverse synthetic data that closely mimics real-world data, enhancing the performance and robustness of custom LLMs across various domains.
  • Llama 3.1 8B-Instruct: an 8-billion-parameter multilingual LLM that is a pre-trained and instruction-tuned generative model optimized for language understanding, reasoning, and text generation use cases.
  • Llama 3.1 70B-Instruct: a 70-billion-parameter pre-trained, instruction-tuned model optimized for multilingual dialogue.
  • Mixtral 8x7B Instruct v0.1: a high-quality sparse mixture of experts model (SMoE) with open weights that can follow instructions, complete requests, and generate creative text formats.

Key benefits of deploying NIM on AWS

  • Ease of deployment: AWS Marketplace integration makes it straightforward to select and deploy models directly, eliminating complex setup processes. Select your preferred model from the marketplace, configure your infrastructure options, and deploy within minutes.
  • Seamless integration with AWS services: AWS offers robust infrastructure options, including GPU-optimized instances for inference, managed AI services such as SageMaker, and Kubernetes support with EKS, helping your deployments scale effectively.
  • Security and control: Maintain full control over your infrastructure settings on AWS, allowing you to optimize your runtime environments to match specific use cases.

How to get started with NVIDIA NIM on AWS

To deploy NVIDIA NIM microservices from the AWS Marketplace, follow these steps:

  1. Visit the NVIDIA NIM page on the AWS Marketplace and select your desired model, such as Llama 3.1 or Mixtral.
  2. Choose the AWS Regions to deploy to, GPU instance types, and resource allocations to fit your needs.
  3. Use the notebook examples to start your deployment using SageMaker to create the model, configure the endpoint, and deploy the model, and AWS will handle the orchestration of resources, networking, and scaling as needed.

NVIDIA NIM microservices in the AWS Marketplace facilitates seamless deployment in SageMaker so that organizations across various industries can develop, deploy, and scale their generative AI applications more quickly and effectively than ever.

SageMaker JumpStart now includes NVIDIA models: Introducing NVIDIA NIM microservices for Nemotron models

SageMaker JumpStart is a model hub and no-code solution within SageMaker that makes advanced AI inference capabilities more accessible to AWS customers by providing a streamlined path to access and deploy popular models from different providers. It offers an intuitive interface where organizations can easily deploy popular AI models with a few clicks, eliminating the complexity typically associated with model deployment and infrastructure management. The integration offers enterprise-grade features including model evaluation metrics, fine-tuning and customization capabilities, and collaboration tools, all while giving customers full control of their deployment.

We are excited to announce that NVIDIA models are now available in SageMaker JumpStart, marking a significant milestone in our ongoing collaboration. This integration brings NVIDIA’s cutting-edge AI models directly to SageMaker Inference customers, starting with the powerful Nemotron-4 model. With JumpStart, customers can access their state-of-the-art models within the SageMaker ecosystem to combine NVIDIA’s AI models with the scalable and price performance inference from SageMaker.

Support for Nemotron-4 – A multilingual and fine-grained reasoning model

We are also excited to announce that NVIDIA Nemotron-4 is now available in JumpStart model hub. Nemotron-4 is a cutting-edge LLM designed to generate diverse synthetic data that closely mimics real-world data, enhancing the performance and robustness of custom LLMs across various domains. Compact yet powerful, it has been fine-tuned on carefully curated datasets that emphasize high-quality sources and underrepresented domains. This refined approach enables strong results in commonsense reasoning, mathematical problem-solving, and programming tasks. Moreover, Nemotron-4 exhibits outstanding multilingual capabilities compared to similarly sized models, and even outperforms those over four times larger and those explicitly specialized for multilingual tasks.

Nemotron-4 – performance and optimization benefits

Nemotron-4 demonstrates great performance in common sense reasoning tasks like SIQA, ARC, PIQA, and Hellaswag with an average score of 73.4, outperforming similarly sized models and demonstrating similar performance against larger ones such as Llama-2 34B. Its exceptional multilingual capabilities also surpass specialized models like mGPT 13B and XGLM 7.5B on benchmarks like XCOPA and TyDiQA, highlighting its versatility and efficiency. When deployed through NVIDIA NIM microservices on SageMaker, these models deliver optimized inference performance, allowing businesses to generate and validate synthetic data with unprecedented speed and accuracy.

Through SageMaker JumpStart, customers can access pre-optimized models from NVIDIA that significantly simplify deployment and management. These containers are specifically tuned for NVIDIA GPUs on AWS, providing optimal performance out of the box. NIM microservices deliver efficient deployment and scaling, allowing organizations to focus on their use cases rather than infrastructure management.

Quick start guide

  1. From SageMaker Studio console, select JumpStart and choose the NVIDIA model family as shown in the following image.
  2. Select the NVIDIA Nemotron-4 NIM microservice.
  3. On the model details page, choose Deploy, and a pop-up window will remind you that you need an AWS Marketplace subscription. If you haven’t subscribed to this model, you can choose Subscribe, which will direct you to the AWS Marketplace to complete the subscription. Otherwise, you can choose Deploy to proceed with model deployment.
  4. On the model deployment page, you can configure the endpoint name, select the endpoint instance type and instance count, in addition to other advanced settings, such as IAM role and VPC setting.
  5. After you finish setting up the endpoint and choose Deploy at the bottom right corner, the NVIDIA Nemotron-4 model will be deployed to a SageMaker endpoint. After the endpoint’s status is In Service, you can start testing the model by invoking the endpoint using the following code. Take a look at the example notebook if you want to deploy the model programmatically.
     messages = [
     {"role": "user", "content": "Hello! How are you?"},
     {"role": "assistant", "content": "Hi! I am quite well, how can I help you today?"},
     {"role": "user", "content": "Write a short limerick about the wonders of GPU Computing."}
    ]
    payload = {
     "model": payload_model,
     "messages": messages,
     "max_tokens": 100,
     "stream": True
    }
    response = client.invoke_endpoint_with_response_stream(
     EndpointName=endpoint_name,
     Body=json.dumps(payload),
     ContentType="application/json",
     Accept="application/jsonlines",
    )

  6. To clean up the endpoint, you can delete the endpoint from the SageMaker Studio console or call the delete endpoint API.
    sagemaker.delete_endpoint(EndpointName=<endpoint_name>)

SageMaker JumpStart provides an additional streamlined path to access and deploy NVIDIA NIM microservices, making advanced AI capabilities even more accessible to AWS customers. Through JumpStart’s intuitive interface, organizations can deploy Nemotron models with a few clicks, eliminating the complexity typically associated with model deployment and infrastructure management. The integration offers enterprise-grade features including model evaluation metrics, customization capabilities, and collaboration tools, all while maintaining data privacy within the customer’s VPC. This comprehensive integration enables organizations to accelerate their AI initiatives while using the combined strengths of the scalable infrastructure provided by AWS and NVIDIA’s optimized models.

P5e and G6e instances powered by NVIDIA H200 Tensor Core and L40S GPUs are now available on SageMaker Inference

SageMaker now supports new P5e and G6e instances, powered by NVIDIA GPUs for AI inference.

P5e instances use NVIDIA H200 Tensor Core GPUs for AI and machine learning. These instances offer 1.7 times larger GPU memory and 1.4 times higher memory bandwidth than previous generations. With eight powerful H200 GPUs per instance connected using NVIDIA NVLink for seamless GPU-to-GPU communication and blazing-fast 3,200 Gbps multi-node networking through EFA technology, P5e instances are purpose-built for deploying and training even the most demanding ML models. These instances deliver performance, reliability, and scalability for your cutting-edge inference applications.

G6e instances, powered by NVIDIA L40S GPUs, are one of the most cost-efficient GPU instances for deploying generative AI models and the highest-performance universal GPU instances for spatial computing, AI, and graphics workloads. They offer 2 times higher GPU memory (48 GB) and 2.9 times faster GPU memory bandwidth compared to G6 instances. G6e instances deliver up to 2.5 times better performance compared to G5 instances. Customers can use G6e instances to deploy LLMs and diffusion models for generating images, video, and audio. G6e instances feature up to eight NVIDIA L40S GPUs with 384 GB of total GPU memory (48 GB of memory per GPU) and third-generation AMD EPYC processors. They also support up to 192 vCPUs, up to 400 Gbps of network bandwidth, up to 1.536 TB of system memory, and up to 7.6 TB of local NVMe SSD storage.

Both instances’ families are now available on SageMaker Inference. Checkout AWS Region availability and pricing on our pricing page.

Conclusion

These new capabilities let you deploy NVIDIA NIM microservices on SageMaker through the AWS Marketplace, use new NVIDIA Nemotron models, and tap the latest GPU instance types to power your ML workloads. We encourage you to give these offerings a look and use them to accelerate your AI workloads on SageMaker Inference.


About the authors

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Eliuth Triana is a Developer Relations Manager at NVIDIA empowering Amazon’s AI MLOps, DevOps, Scientists and AWS technical experts to master the NVIDIA computing stack for accelerating and optimizing Generative AI Foundation models spanning from data curation, GPU training, model inference and production deployment on AWS GPU instances. In addition, Eliuth is a passionate mountain biker, skier, tennis and poker player.

Abhishek Sawarkar is a product manager in the NVIDIA AI Enterprise team working on integrating NVIDIA AI Software in Cloud MLOps platforms. He focuses on integrating the NVIDIA AI end-to-end stack within Cloud platforms & enhancing user experience on accelerated computing.

Jiahong Liu is a Solutions Architect on the Cloud Service Provider team at NVIDIA. He assists clients in adopting machine learning and AI solutions that leverage NVIDIA-accelerated computing to address their training and inference challenges. In his leisure time, he enjoys origami, DIY projects, and playing basketball.

Kshitiz Gupta is a Solutions Architect at NVIDIA. He enjoys educating cloud customers about the GPU AI technologies NVIDIA has to offer and assisting them with accelerating their machine learning and deep learning applications. Outside of work, he enjoys running, hiking, and wildlife watching.

Read More

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce a new feature for Amazon SageMaker inference endpoints: the ability to scale SageMaker inference endpoints to zero instances. This long-awaited capability is a game changer for our customers using the power of AI and machine learning (ML) inference in the cloud. Previously, SageMaker inference endpoints maintained a minimum number of instances to provide continuous availability, even during periods of low or no traffic. With this update, available when using SageMaker inference components, you have more options to align your resource usage with your specific needs and traffic patterns.

Refer to the accompanying notebooks to get started with the new scale down to zero feature.

The new feature expands the possibilities for managing SageMaker inference endpoints. It allows you to configure the endpoints so they can scale to zero instances during periods of inactivity, providing an additional tool for resource management. With this feature, you can closely match your compute resource usage to your actual needs, potentially reducing costs during times of low demand. This enhancement builds upon the existing auto scaling capabilities in SageMaker, offering more granular control over resource allocation. You can now configure your scaling policies to include scaling to zero, allowing for more precise management of your AI inference infrastructure.

The scale down to zero feature presents new opportunities for how businesses can approach their cloud-based ML operations. It provides additional options for managing resources across various scenarios, from development and testing environments to production deployments with variable traffic patterns. As with any new feature, you are encouraged to carefully evaluate how it fits into your overall architecture and operational needs, considering factors such as response times and the specific requirements of your applications.

In this post, we explore the new scale to zero feature for SageMaker inference endpoints, demonstrating how to implement and use this capability to optimize costs and manage resources more effectively. We cover the key scenarios where scaling to zero is beneficial, provide best practices for optimizing scale-up time, and walk through the step-by-step process of implementing this functionality. Additionally, we discuss how to set up scheduled scaling actions for predictable traffic patterns and test the behavior of your scaled-to-zero endpoints.

Determining when to scale to zero

Before we dive into the implementation details of the new scale to zero feature, it’s crucial to understand when and why you should consider using it. Although the ability to scale SageMaker inference endpoints to zero instances offers significant cost-saving potential, it’s crucial to understand when and how to apply this feature effectively. Not all scenarios benefit equally from scaling to zero, and in some cases, it may even impact the performance of your applications. Let’s explore why it’s important to carefully consider when to implement this feature and how to identify the scenarios where it provides the most value.

The ability to scale SageMaker inference endpoints to zero instances is particularly beneficial in three key scenarios:

  • Predictable traffic patterns – If your inference traffic is predictable and follows a consistent schedule, you can use this scaling functionality to automatically scale down to zero during periods of low or no usage. This eliminates the need to manually delete and recreate inference components and endpoints.
  • Sporadic or variable traffic – For applications that experience sporadic or variable inference traffic patterns, scaling down to zero instances can provide significant cost savings. However, scaling from zero instances back up to serving traffic is not instantaneous. During the scale-out process, any requests sent to the endpoint will fail, and these NoCapacityInvocationFailures will be captured in Amazon CloudWatch.
  • Development and testing environments – The scale to zero functionality is also beneficial when testing and evaluating new ML models. During model development and experimentation, you might create temporary inference endpoints to test different configurations. However, it’s possible to forget to delete these endpoints when you’re done. Scaling down to zero makes sure these test endpoints automatically scale back to zero instances when not in use, preventing unwanted charges. This allows you to freely experiment without closely monitoring infrastructure usage or remembering to manually delete endpoints. The automatic scaling to zero provides a cost-effective way to test out ideas and iterate on your ML solutions.

By carefully evaluating your specific use case against these scenarios, you can make informed decisions about implementing scale to zero functionality. This approach makes sure you maximize cost savings without compromising on the performance and availability requirements of your ML applications. It’s important to note that although scaling to zero can provide significant benefits, it also introduces a trade-off in terms of initial response time when scaling back up. Therefore, it’s crucial to assess whether your application can tolerate this potential delay and to implement appropriate strategies to manage it. In the following sections, we dive deeper into each scenario and provide guidance on how to determine if scaling to zero is the right choice for your specific needs. We also discuss best practices for implementation and strategies to mitigate potential drawbacks.

Scale down to zero is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

Now that we understand when to use the scale to zero feature, let’s dive into how to optimize its performance and implement it effectively. Scaling up from zero instances to serving traffic introduces a brief delay (cold start), which can impact your application’s responsiveness. To mitigate this, we first explore best practices for minimizing scale-up time. Then we walk through the step-by-step process of implementing the scale to zero functionality for your SageMaker inference endpoints.

Optimizing scale-up time best practices

When using the scale to zero feature, it’s crucial to minimize the time it takes for your endpoint to scale up and begin serving requests. The following are several best practices you can implement to decrease the scale-out time for your SageMaker inference endpoints:

  • Decrease model or container download time – Use uncompressed model format to reduce the time it takes to download the model artifacts when scaling up. Compressed model files may save storage space, but they require additional time to uncompress and files can’t be downloaded in parallel, which can slow down the scale-up process. To learn more, see Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference.
  • Reduce model server startup time – Look for ways to optimize the startup and initialization of your model server container. This could include techniques like building in packages into the image, using multi-threading, or minimizing unnecessary initialization steps. For more details, see Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1.
  • Use faster auto scaling metrics – Take advantage of more granular auto scaling metrics like ConcurrentRequestsPerCopy to more accurately monitor and react to changes in inference traffic. These sub-minute metrics can help trigger scale-out actions more precisely, reducing the number of NoCapacityInvocationFailures your users might experience. For more information, see Amazon SageMaker inference launches faster auto scaling for generative AI models.
  • Handle failed requests – When scaling from zero instances, there will be a brief period where requests fail due to NoCapacityInvocationFailures because SageMaker provisions resources. To handle this, you can use queues or implement client-side retries:
  • Use a serverless queue like Amazon Simple Queue Service (Amazon SQS) to buffer requests during scale-out. When a failure occurs, enqueue the request and dequeue after the model copies have scaled up from zero.
  • Alternatively, have your client reject failed requests, but then retry after some time after the model copies have scaled. You can retrieve the number of copies of an inference component at any time by making the DescribeInferenceComponent API call and checking the CurrentCopyCount. This allows time for the model copies to scale out from zero, transparently handling the transition for end-users.

By implementing these best practices, you can help make sure your SageMaker inference endpoints can scale out quickly and efficiently to meet changes in traffic, providing a responsive and reliable experience for your end-users.

Solution overview

With these best practices in mind, let’s now walk through the process of enabling your SageMaker inference endpoints to scale down to zero instances. This process involves a few key steps that are crucial for optimizing your endpoint’s performance and cost-efficiency:

  • Configure your endpoint – The first and most critical step is to enable managed instance scaling for your SageMaker endpoint. This is the foundational action that allows you to implement advanced scaling features, including scaling to zero. By enabling managed instance scaling, you’re creating an inference component endpoint, which is essential for the fine-grained control over scaling behaviors we discuss later in this post. After you configure managed instance scaling, you then configure the SageMaker endpoint to set the MinInstanceCount parameter to 0. This parameter allows the endpoint to scale all the way down to zero instances when not in use, maximizing cost-efficiency. Enabling managed instance scaling and setting MinInstanceCount to 0 work together to provide a highly flexible and cost-effective endpoint configuration. However, scaling up from zero will introduce cold starts, potentially impacting response times for initial requests after periods of inactivity. The inference component endpoint created through managed instance scaling serves as the foundation for implementing the sophisticated scaling policies we explore in the next step.
  • Define scaling policies – Next, you need to create two scaling policies that work in tandem to manage the scaling behavior of your endpoint effectively:
    • Scaling policy for inference component copies – This target tracking scaling policy will manage the scaling of your inference component copies. It’s a dynamic policy that adjusts the number of copies based on a specified metric, such as CPU utilization or request count. The policy is designed to scale the copy count to zero when there is no traffic, making sure you’re not paying for unused resources. Conversely, it will scale back up to your desired capacity when needed, allowing your endpoint to handle incoming requests efficiently. When configuring this policy, you need to carefully choose the target metric and threshold that best reflect your workload patterns and performance requirements.
    • Scale out from zero policy – This policy is crucial for enabling your endpoint to scale out from zero model copies when traffic arrives. It’s implemented as a step scaling policy that adds model copies when triggered by incoming requests. This allows SageMaker to provision the necessary instances to support the model copies and handle the incoming traffic. When configuring this policy, you need to consider factors such as the expected traffic patterns, the desired responsiveness of your endpoint, and the potential cold start latency. You may want to set up multiple steps in your policy to handle different levels of incoming traffic more granularly.

By implementing these scaling policies, you create a flexible and cost-effective infrastructure that can automatically adjust to your workload demands and scale to zero when needed.

Now let’s see how to use this feature step by step.

Set up your endpoint

The first crucial step in enabling your SageMaker endpoint to scale to zero is properly configuring the endpoint and its associated components. This process involves three main steps:

  1. Create the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale down all the way to zero instances when not in use.
    sagemaker_client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ExecutionRoleArn=role,
        ProductionVariants=[
            {
                "VariantName": variant_name,
                "InstanceType": instance_type,
                "InitialInstanceCount": 1,
                "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
                "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
                "ManagedInstanceScaling": {
                    "Status": "ENABLED",
                    "MinInstanceCount": 0,
                    "MaxInstanceCount": max_instance_count,
                },
                "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
            }
        ],
    )

  2. Create the SageMaker endpoint:
    sagemaker_client.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name,
    )

  3. Create the inference component for your endpoint:
    sagemaker_client.create_inference_component(
        InferenceComponentName=inference_component_name,
        EndpointName=endpoint_name,
        VariantName=variant_name,
        Specification={
            "ModelName": model_name,
            "StartupParameters": {
                "ModelDataDownloadTimeoutInSeconds": 3600,
                "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            },
            "ComputeResourceRequirements": {
                "MinMemoryRequiredInMb": 1024,
                "NumberOfAcceleratorDevicesRequired": 1,
            },
        },
        RuntimeConfig={
            "CopyCount": 1,
        },
    )

Add scaling policies

After the endpoint is deployed and InService, you can add the necessary scaling policies:

  • A target tracking policy that can scale down the copy count for our inference component model copies to zero, and from 1 to n
  • A step scaling policy that will allow the endpoint to scale up from zero

Scaling policy for inference components model copies

After you create your SageMaker endpoint and inference components, you register a new auto scaling target for Application Auto Scaling. In the following code block, you set MinCapacity to 0, which is required for your endpoint to scale down to zero:

# Register scalable target
resource_id = f"inference-component/{inference_component_name}"
service_namespace = "sagemaker"
scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"

aas_client.register_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    MinCapacity=0,
    MaxCapacity=max_copy_count,  # Replace with your desired maximum number of model copies
)

After you have registered your new scalable target, the next step is to define your target tracking policy. In the following code example, we set the TargetValue to 5. This setting instructs the auto scaling system to increase capacity when the number of concurrent requests per model reaches or exceeds 5.

# Create Target Tracking Scaling Policy

aas_client.put_scaling_policy(
    PolicyName="inference-component-target-tracking-scaling-policy",
    PolicyType="TargetTrackingScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
        },
        # Low TPS + load TPS
        "TargetValue": 5,  # you need to adjust this value based on your use case
        "ScaleInCooldown": 300,  # default
        "ScaleOutCooldown": 300,  # default
    },
)

Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 1 minute (using one 1-minute data point), and the second triggers scale-in after 15 minutes (using 90 10-second data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react.

Scale out from zero model copies policy

To enable your endpoint to scale out from zero instances, complete the following steps:

  1. Create a step scaling policy that defines when and how to scale out from zero. This policy will add one model copy when triggered, enabling SageMaker to provision the instances required to handle incoming requests after being idle. The following code shows you how to define a step scaling policy. Here we have configured to scale from zero to one model copy ("ScalingAdjustment": 1). Depending on your use case, you can adjust ScalingAdjustment as required.
    aas_client.put_scaling_policy(
        PolicyName="inference-component-step-scaling-policy",
        PolicyType="StepScaling",
        ServiceNamespace=service_namespace,
        ResourceId=resource_id,
        ScalableDimension=scalable_dimension,
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "MetricAggregationType": "Maximum",
            "Cooldown": 60,
            "StepAdjustments":
              [
                 {
                   "MetricIntervalLowerBound": 0,
                   "ScalingAdjustment": 1 # you need to adjust this value based on your use case
                 }
              ]
        },
    )

  2. Create a CloudWatch alarm with the metric NoCapacityInvocationFailures.

When triggered, the alarm initiates the previously defined scaling policy. For more information about the NoCapacityInvocationFailures metric, see documentation.

We have also set the following:

  • EvaluationPeriods to 1
  • DatapointsToAlarm to 1
  • ComparisonOperator to GreaterThanOrEqualToThreshold

This results in waiting approximately 1 minute for the step scaling policy to trigger after our endpoint receives a single request.

cw_client.put_metric_alarm(
    AlarmName='ic-step-scaling-policy-alarm',
    AlarmActions=<step_scaling_policy_arn>,  # Replace with your actual ARN
    MetricName='NoCapacityInvocationFailures',
    Namespace='AWS/SageMaker',
    Statistic='Maximum',
    Dimensions=[
        {
            'Name': 'InferenceComponentName',
            'Value': inference_component_name  # Replace with actual InferenceComponentName
        }
    ],
    Period=30,
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing'
)

Replace <STEP_SCALING_POLICY_ARN> with the Amazon Resource Name (ARN) of the scaling policy you created in the previous step.

Notice the "MinInstanceCount": 0 setting in the endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker inference endpoint will now be able to automatically scale down to zero instances when not in use.

Test the solution

When our SageMaker endpoint doesn’t receive requests for 15 minutes, it will automatically scale down to zero the number of model copies:

time.sleep(500)
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)
    
desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
print(desc)

After 10 additional minutes of inactivity, SageMaker automatically stops all underlying instances of the endpoint, eliminating all associated instance costs.

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error:

An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.

sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name,
    Body=json.dumps(
        {
            "inputs": "The diamondback terrapin was the first reptile to be",
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 256,
                "min_new_tokens": 256,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

However, after 1 minute, our step scaling policy should start. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests.

Schedule scaling down to zero

In some scenarios, you might observe consistent weekly traffic patterns: a steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:

  • Weekend scale-in (Friday evening) – Configure a scheduled action to reduce the number of model copies to zero. This will instruct SageMaker to scale the number instance behind the endpoint to zero, completely eliminating costs during the weekend period of no usage.
  • Workweek scale-out (Monday morning) – Set up a complementary scheduled action to restore the required model capacity for the inference component on Monday morning, so your application is ready for weekday operations.

You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your inference component using the UpdateInferenceComponentRuntimeConfig API. This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity.

sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={
        'CopyCount': 0
    }
)

Amazon EventBridge Scheduler can automate SageMaker API calls using cron/rate expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf. For more information about how to create this role, see Set up the execution role. The specific permissions needed depend on the target API being called.

The following code creates two scheduled actions for the inference component during 2024–2025. The first schedule scales in the CopyCount to zero every Friday at 18:00 UTC+1, and the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule will start on November 29, 2024, end on December 31, 2025, and be deleted after completion.

import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 0}, "InferenceComponentName": inference_component_name })
}

# Scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024
scheduler.create_schedule(
    Name="scale-to-zero-schedule",
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

# Specify the SageMaker target API for the scale out schedule
scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 2}, "InferenceComponentName": inference_component_name })
}

# Scale out our endpoint every Monday at 07:00 UTC+1
scheduler.create_schedule(
    Name="scale-out-schedule",
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

The second method is to delete the inference components by calling the DeleteInferenceComponent API. This approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically deletes the inference component every Friday at 18:00 UTC during 2024–2025. It also creates a complementary scheduled action that recreates the inference component every Monday at 07:00 UTC+1.

import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:deleteInferenceComponent",
    "Input": json.dumps({"InferenceComponentName": inference_component_name })
}

# Scale in our endpoint by deleting the IC every friday at 18:00 UTC+1
scheduler.create_schedule(
    Name="scale-to-zero-schedule",
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

# Specify the SageMaker target API for the scale up schedule
input_config = {
  "EndpointName": endpoint_name,
  "InferenceComponentName": inference_component_name,
  "RuntimeConfig": {
    "CopyCount": 2
  },
  "Specification": {
    "ModelName": model_name,
    "StartupParameters": {
        "ModelDataDownloadTimeoutInSeconds": 3600,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
    },
    "ComputeResourceRequirements": {
      "MinMemoryRequiredInMb": 1024,
      "NumberOfAcceleratorDevicesRequired": 1
    }
  },
  "VariantName": variant_name
}

scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:createInferenceComponent",
    "Input": json.dumps(input_config)
}

# Scale out our endpoint by recreating the IC every Monday at 07:00 UTC+1
scheduler.create_schedule(
    Name="scale-out-schedule",
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger an AWS Lambda function that handles either deletion or zero-setting of all inference components.

Performance evaluation

We evaluated the performance implications of the Scale to Zero feature by conducting tests using a Llama3-8B instruct model. These tests utilized container caching and optimized model loading techniques, and were performed with both Target Tracking and Step Scaling policies in place. Our findings for Llama3-8B instruct show that when using the Target Tracking policy, SageMaker will scale the endpoint to zero model copies in approximately 15 minutes, and then take an additional 10 minutes to fully scale down the underlying instances, for a total scale-in time of 25 minutes. Conversely, when scaling the endpoint back up from zero, the Step Scaling policy triggers the provisioning of new instances in around 1 minute, followed by provisioning the instance(s) in ~1.748 minutes. Scaling out of model copies in approximately 2.28 minutes, resulting in a total scale-out time of around 5.028 minutes.

The performance tests on LLaMa3.1 models (8B and 70B variants) demonstrate SageMaker’s Scale to Zero feature’s effectiveness, with intentionally conservative scaling times to prevent endpoint thrashing and accommodate spiky traffic patterns. For both model sizes, scaling in takes a total of 25 minutes, allowing a 15-minute buffer before initiating scale-down and an additional 10 minutes to fully decommission instances. This cautious approach helps avoid premature scaling during temporary lulls in traffic. When scaling out, the 8B model takes about 5 minutes, while the 70B model needs approximately 6 minutes. These times include a 1-minute trigger delay, followed by instance provisioning and model copy instantiation. The slightly longer scale-out times, especially for larger models, provide a balance between responsiveness and stability, ensuring the system can handle sudden traffic increases without constantly scaling up and down. This measured approach to scaling helps maintain consistent performance and cost-efficiency in environments with variable workloads.

LLaMa3.1 8B Instruct
Scale in Time to trigger target tracking (min) Time to scale in instance count to zero (min) Total time (min)
15 10 25
Scale out Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instatiate a new model copy (min) Total time (min)
1 1.748 2.28 5.028
LLaMa3.1 70B
Scale in Time to trigger target tracking (min) Time to scale in instance count to zero (min) Total time (min)
15 10 25
Scale out Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instatiate a new model copy (min) Total time (min)
1 3.018 1.984 6.002

Scale up Trials

LLaMa3.1 8B Instruct
Trial Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instantiate a new model copy (min) Total time (min)
1 1 1.96 3.1 6.06
2 1 1.75 2.6 5.35
3 1 1.4 2.1 4.5
4 1 1.96 1.9 4.86
5 1 1.67 1.7 4.37
Average 1 1.748 2.28 5.028
LLaMa3.1 70B
Trial Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instantiate a new model copy (min) Total time (min)
1 1 3.1 1.98 6.08
2 1 2.92 1.98 5.9
3 1 2.82 1.98 5.8
4 1 3.27 2 6.27
5 1 2.98 1.98 5.96
Average 1 3.018 1.984 6.002
  • Target Tracking: Scale Model Copies to Zero (min) – This refers to the time it took target tracking to trigger the alarm and SageMaker to decrease model copies to zero on the instance
  • Scale in Instance Count to Zero (min) – This refers to the time it takes SageMaker to scale the instances down to zero after all inference component model copies are zero
  • Step Scaling: Scale up Model Copies from Zero (min) – This refers to the time it took step scaling to trigger the alarm and for SageMaker to provision the instances
  • Scale out Instance Count from Zero (min) – This refers to the time it takes for SageMaker to scale out and add inference component model copies

If you want more customization and faster scaling, consider using step scaling to scale model copies instead of target tracking.

Customers testimonials

The new Scale to Zero feature for SageMaker inference endpoints has sparked considerable interest across customers. We gathered initial reactions from companies who have previewed and evaluated this capability, highlighting its potential impact on AI and machine learning operations.

Atlassian, headquartered in Sydney, Australia, is a software company specializing in collaboration tools for software development and project management:

“The new Scale to Zero feature for SageMaker inference strongly aligns with our commitment to efficiency and innovation. We’re enthusiastic about its potential to revolutionize how we manage our machine learning inference resources, and we look forward to integrating it into our operations”

– Guarav Awadhwal – Senior Engineering Manager at Atlassian

iFood is a Latin American online food delivery firm based in Brazil. It works with over 300,000 restaurants, connecting them with millions of customers every month.

“The Scale to Zero feature for SageMaker Endpoints will be fundamental for iFood’s Machine Learning Operations. Over the years, we’ve collaborated closely with the SageMaker team to enhance our inference capabilities. This feature represents a significant advancement, as it allows us to improve cost efficiency without compromising the performance and quality of our ML services, given that inference constitutes a substantial part of our infrastructure expenses.”

– Daniel Vieira, MLOps Engineer Manager at iFoods

VIDA, headquartered in Jakarta, Indonesia, is a leading digital identity provider that enable individuals and business to conduct business in a safe and secure digital environment.

“SageMaker’s new Scale to Zero feature for GPU inference endpoints shows immense promise for deep fake detection operations. The potential to efficiently manage our face liveness and document verification inference models while optimizing infrastructure costs aligns perfectly with our goals. We’re excited to leverage this capability to enhance our identity verification solutions.”

– Keshav Sharma, ML Platform Architect at VIDA

APOIDEA Group is a leading AI-focused FinTech ISV company headquartered in Hong Kong. Leveraging cutting-edge generative AI and deep learning technologies, the company develops innovative AI FinTech solutions for multinational banks. APOIDEA’s products automate repetitive human analysis tasks, extracting valuable financial insights from extensive financial documents to accelerate AI-driven transformation across the industry.

“SageMaker’s Scale to Zero feature is a game changer for our AI financial analysis solution in operations. It delivers significant cost savings by scaling down endpoints during quiet periods, while maintaining the flexibility we need for batch inference and model testing. This capability is transforming how we manage our GenAI workloads and evaluate new models. We’re eager to harness its power to further optimize our deep learning and NLP model deployments.”

– Mickey Yip, VP of Product at APOIDEA Group

Fortiro, based in Melbourne, Australia, is a FinTech company specializing in automated document fraud detection and financial verification for trusted financial institutions.

“The new Scale-to-Zero capability in SageMaker is a game-changer for our MLOps and delivers great cost savings. Being able to easily scale inference endpoints and GPUs means we can take advantage of a fast, highly responsive environment, without incurring unnecessary costs. Our R&D teams constantly experiment with new AI-based document fraud detection methods, which involves a lot of testing and repeating. This capability empowers us to do this both faster and more efficiently.”

– Amir Vahid , Chief Technology Officer at Fortiro

These testimonials underscore the anticipation for SageMaker’s Scale to Zero feature. As organizations begin to implement this capability, we expect to see innovative applications that balance cost efficiency with performance in machine learning deployments.

Conclusion

In this post, we introduced the new scale to zero feature in SageMaker, an innovative capability that enables you to optimize costs by automatically scaling in your inference endpoints when they’re not in use. We guided you through the detailed process of implementing this feature, including configuring endpoints, setting up auto scaling policies, and managing inference components for both automatic and scheduled scaling scenarios.

This cost-saving functionality presents new possibilities for how you can approach your ML operations. With this feature, you can closely align your compute resource usage with actual needs, potentially reducing costs during periods of low demand. We encourage you to try this capability and start optimizing your SageMaker inference costs today.

To help you get started quickly, we’ve prepared a comprehensive notebooks containing an end-to-end example of how to configure an endpoint to scale to zero.

We encourage you to try this capability and start optimizing your SageMaker inference costs today!


About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He helps enterprise customers architect and implement cutting-edge AI solutions using AWS’s comprehensive suite of tools, with a focus on production-ready systems that follow industry best practices. In his spare time, Christian enjoys exploring nature and spending time with family and friends.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Raghu Ramesha is a Senior GenAI/ML Solutions Architect on the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in computer science from UT Dallas. In his free time, he enjoys traveling and photography.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Raj Vippagunta is a Principal Engineer at Amazon SageMaker Machine Learning(ML) platform team in AWS. He uses his vast experience of 18+ years in large-scale distributed systems and his passion for machine learning to build practical service offerings in the AI and ML space. He has helped build various at-scale solutions for AWS and Amazon. In his spare time, he likes reading books, pursue long distance running and exploring new places with his family.

Read More

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce the new Container Caching capability in Amazon SageMaker, which significantly reduces the time required to scale generative AI  models for inference. This innovation allows you to scale your models faster, observing up to 56% reduction in latency when scaling a new model copy and up to 30% when adding a model copy on a new instance. These improvements are available across a wide range of SageMaker’s Deep Learning Containers (DLCs), including Large Model Inference (LMI, powered by vLLM and multiple other frameworks), Hugging Face Text Generation Inference (TGI), PyTorch (Powered by TorchServe), and NVIDIA Triton. Fast container startup times are critical to scale generative AI models effectively, making sure end-users aren’t negatively impacted as inference demand increases.

As generative AI models and their hosting containers grow in size and complexity, scaling these models efficiently for inference becomes increasingly challenging. Until now, each time SageMaker scaled up an inference endpoint by adding new instances, it needed to pull the container image (often several tens of gigabytes in size) from Amazon Elastic Container Registry (Amazon ECR), a process that could take minutes. For generative AI models requiring multiple instances to handle high-throughput inference requests, this added significant overhead to the total scaling time, potentially impacting application performance during traffic spikes.

Container Caching addresses this scaling challenge by pre-caching the container image, eliminating the need to download it when scaling up. This new feature brings several key benefits for generative AI inference workloads: dramatically faster scaling to handle traffic spikes, improved resource utilization on GPU instances, and potential cost savings through more efficient scaling and reduced idle time during scale-up events. These benefits are particularly impactful for popular frameworks and tools like vLLM-powered LMI, Hugging Face TGI, PyTorch with TorchServe, and NVIDIA Triton, which are widely used in deploying and serving generative AI models on SageMaker inference.

In our tests, we’ve seen substantial improvements in scaling times for generative AI model endpoints across various frameworks. The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. This enhancement allows customers running high-throughput production workloads to handle sudden traffic spikes more efficiently, providing more predictable scaling behavior and minimal impact on end-user latency across their ML infrastructure, regardless of the chosen inference framework.

In this post, we explore the new Container Caching feature for SageMaker inference, addressing the challenges of deploying and scaling large language models (LLMs). We discuss how this innovation significantly reduces container download and load times during scaling events, a major bottleneck in LLM and generative AI inference. You’ll learn about the key benefits of Container Caching, including faster scaling, improved resource utilization, and potential cost savings. We showcase its real-world impact on various applications, from chatbots to content moderation systems. We then guide you through getting started with Container Caching, explaining its automatic enablement for SageMaker provided DLCs and how to reference cached versions. Finally, we delve into the supported frameworks, with a focus on LMI, PyTorch, Hugging Face TGI, and NVIDIA Triton, and conclude by discussing how this feature fits into our broader efforts to enhance machine learning (ML) workloads on AWS.

This feature is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

The challenge of deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

  • Provisioning new compute resources
  • Downloading container image
  • Loading container image
  • Loading the model weights into memory
  • Initializing the inference runtime
  • Shifting traffic to serve new requests

The cumulative time for these steps can range from several minutes to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during traffic spikes, making it a critical area for optimization in the field of AI inference infrastructure.

The introduction of Container Caching for SageMaker DLCs brings several key benefits for inference workloads:

  • Faster scaling – By having the latest DLCs pre-cached, the time required to scale inference endpoints in response to traffic spikes is substantially reduced. This provides a more consistent and responsive experience for inference hosting, allowing systems to adapt quickly to changing demand patterns. ML engineers can now design more aggressive auto scaling policies, knowing that new instances can be brought online in a fraction of the time previously required.
  • Quick endpoint startup – Using pre-cached containers significantly decreases the startup time for new model deployments. This acceleration in the deployment pipeline enables more frequent model updates and iterations, fostering a more agile development cycle. AI and ML engineers can now move from model training to production deployment with unprecedented speed, reducing time-to-market for new AI features and improvements.
  • Improved resource utilization – Container Caching minimizes idle time on expensive GPU instances during the initialization phase. Instead of waiting for container downloads, these high-performance resources can immediately focus on inference tasks. This optimization provides more efficient use of computational resources, potentially allowing for higher throughput and better cost-effectiveness.
  • Cost savings – The cumulative effect of faster deployments and more efficient scaling can lead to significant reductions in overall inference costs. By minimizing idle time and improving resource utilization, organizations can potentially serve the same workload with fewer instances or handle increased demand without proportional increases in infrastructure costs. Additionally, the improved responsiveness can lead to better user experiences, potentially driving higher engagement and revenue in customer-facing applications.
  • Enhanced compatibility – By focusing on the latest SageMaker DLCs, this caching mechanism makes sure users always have quick access to the most recent and optimized environments for their models. This can be particularly beneficial for teams working with cutting-edge AI technologies that require frequent updates to the underlying frameworks and libraries.

Container Caching represents a significant advancement in AI inference infrastructure. It addresses a critical bottleneck in the deployment process, empowering organizations to build more responsive, cost-effective, and scalable AI systems.

Getting started with Container Caching for inference

Container Caching is automatically enabled for popular SageMaker DLCs like LMI, Hugging Face TGI, NVIDIA Triton, and PyTorch used for inference. To use cached containers, you only need to make sure you’re using a supported SageMaker container. No additional configuration or steps are required.

The following table lists the supported DLCs.

SageMaker DLC Starting Version Starting Container
LMI 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
LMI-TRT 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
LMI-Neuron 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1
TGI-GPU 2.4.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0
TGI-Neuron 2.1.2 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04-v1.0
Pytorch-GPU 2.5.1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker
Pytorch-CPU 2.5.1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-cpu-py311-ubuntu22.04-sagemaker
Triton 24.09 763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:24.09-py3

In the following sections, we discuss how to get started with several popular SageMaker DLCs.

Hugging Face TGI

Developed by Hugging Face, TGI is an inference framework for deploying and serving LLMs, offering a purpose-built solution that combines security, performance, and ease of management. TGI is specifically designed to deliver high-performance text generation through advanced features like tensor parallelism and continuous batching. It supports a wide range of popular open source LLMs, making it a popular choice for diverse AI applications. What sets TGI apart is its optimization for both NVIDIA GPUs and AWS accelerators with AWS Inferentia and AWS Trainium, providing optimal performance across different hardware configurations.

With the introduction of Container Caching, customers using the latest release of TGI containers on SageMaker will experience improved scaling performance. The caching mechanism works automatically, requiring no additional configuration or code changes. This seamless integration means that organizations can immediately benefit from faster scaling without any operational overhead.

Philipp Schmid, Technical Lead at Hugging Face, shares his perspective on this enhancement: “Hugging Face TGI containers are widely used by SageMaker inference customers, offering a powerful solution optimized for running popular models from the Hugging Face. We are excited to see Container Caching speed up auto scaling for users, expanding the reach and adoption of open models from Hugging Face.”

You can use Container Caching with Hugging Face TGI using the following code:

// Using Container Caching for Huggingface TGI
//Create an IC with Hugging face image

create_inference_component(
        image="763104351884.dkr.ecr.<region>.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0", 
        model_url= "s3://path/to/your/model/artifacts"
        )

** We will cache latest version of currently maintained images - https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only

NVIDIA Triton

NVIDIA Triton Inference Server is a model server from NVIDIA that supports multiple deep learning frameworks and model formats. On SageMaker, Triton offers a comprehensive serving stack with support for various backends, including TensorRT, PyTorch, Python, and more. Triton is particularly powerful because of its ability to optimize inference across different hardware configurations while providing features like dynamic batching, concurrent model execution, and ensemble models. The Triton architecture enables efficient model serving through features like multi-framework support, optimized GPU utilization, and flexible model management.

With Container Caching, Triton deployments on SageMaker become even more efficient, especially when scaling large-scale inference workloads. This is particularly beneficial when deploying multiple models using Triton’s Python backend or when running model ensembles that require complex preprocessing and postprocessing pipelines. This improves the deployment and scaling experience for Triton workloads by eliminating the need to repeatedly download container images during scaling events.

Eliuth Triana, Global Lead Amazon Developer Relations at NVIDIA, comments on this enhancement:

“The integration of Container Caching with NVIDIA Triton Inference Server on SageMaker represents a significant advancement in serving machine learning models at scale. This feature perfectly complements Triton’s advanced serving capabilities by reducing deployment latency and optimizing resource utilization during scaling events. For customers running production workloads with Triton’s multi-framework support and dynamic batching, Container Caching provides faster response to demand spikes while maintaining Triton’s performance optimizations.”

To use Container Caching with NVIDIA Triton, use the following code:

// Using Container Caching for Triton
create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/sagemaker-tritonserver:24.09-py3", 
    model_url="s3://path/to/your/model/artifacts" 
)

PyTorch and TorchServe (now with vLLM engine integration)

SageMaker Deep Learning Container for PyTorch is powered by TorchServe . It offers a comprehensive solution for deploying and serving PyTorch models, including Large Language Models (LLMs), in production environments. TorchServe provides robust model serving capabilities through HTTP REST APIs, like flexible configuration options and performance optimization features like server-side batching, multi-model serving and dynamic model loading. The container supports a wide range of models and advanced features, including quantization, and parameter-efficient methods like LoRA.

The latest version of PyTorch also uses TorchServe integrated with vLLM engine which leverages advanced features such as vLLM’s state-of-the-art inference engine with PagedAttention and continuous batching. It supports single-node, multi-GPU distributed inference, enabling tensor parallel sharding for larger models. The integration of Container Caching significantly reduces scaling times, particularly beneficial for large models during auto-scaling events. TorchServe’s handler system allows for easy customization of pre- and post-processing logic, making it adaptable to various use cases. With its growing feature set, TorchServe is a popular choice for deploying and scaling machine learning models among inference customers.

You can use Container Caching with PyTorch using the following code:

 // Using Container Caching for PyTorch 
 create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker", 
    model_url="s3://path/to/your/model/artifacts" 
 )

LMI container

The Large Model Inference (LMI) container is a high-performance serving solution that can be used through a no-code interface with smart defaults that can be extended to fit your unique needs. LMI delivers performance differentiation through advanced optimizations, outpacing open source backends like vLLM, TensorRT-LLM, and Transformers NeuronX while offering a unified UI.

Popular features such as continuous batching, token streaming, and speculative decoding are available out of the box to provide superior throughput, latency, and scalability. LMI supports a wide array of use cases like multi-node inference and model personalization through LoRA adapters, and performance optimizations like quantization and compilation.

With Container Caching, LMI containers deliver even faster scaling capabilities, particularly beneficial for large-scale LLM deployments where container startup times can significantly impact auto scaling responsiveness. This enhancement works seamlessly across all supported backends while maintaining the container’s advanced features and optimization capabilities.

Contributors of LMI containers comment on this enhancement:

“The addition of Container Caching to LMI containers represents a significant step forward in making LLM deployment more efficient and responsive. This feature complements our efforts to speed up model loading through pre-sharding, weight streaming, and compiler caching, enabling customers to achieve both high-performance inference and rapid scaling capabilities, which is crucial for production LLM workloads.”

To use Container Caching with LMI, use the following code:

# Using Container Caching for LMI
create_inference_component(
    image= "763104351884.dkr.ecr.<region>.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124",
    model_url="s3://path/to/your/model/artifacts"
)

Performance Evaluation:

The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. For the scenario of scaling the model by adding a new instance, the E2E scaling time decreased from 580 seconds (9.67 minutes) to 407 seconds (6.78 minutes), yielding an improvement of 172 seconds (2.87 minutes), which translates to a 30% reduction in scaling time. These results demonstrate that Container Caching substantially and reliably enhances the efficiency of model scaling operations, particularly for large language models like Llama3.1 70B, with more pronounced benefits observed when scaling on existing instances.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models.

The following table summarizes our setup.

Region CMH
Instance Type p4d.24xlarge
Container LMI V13.31
Container Image 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model Llama 3.1 70B

Scaling the model by adding a new instance

For this scenario, we explore scaling the model by adding a new instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 223 339 602
2 40 203 339 582
3 40 175 339 554
4 40 210 339 589
5 40 191 339 570
Average 200 339 580

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 185 173 398
2 40 175 188 403
3 40 164 208 412
4 40 185 187 412
5 40 185 187 412
Average 178.8 188.6 407.4

Scaling the model on an available instance

In this scenario, we explore scaling the model on an available instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 339 379
2 40 339 379
3 40 339 379
4 40 339 379
5 40 339 379
Average 339 379

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 150 190
2 40 122 162
3 40 121 161
4 40 119 159
5 40 119 159
Average 126.2 166.2

Summary of findings

The following table summarizes our results in both scenarios.

. End-to End Scaling Time Before End-to-End Scaling Time After Improvement in Absolute Numbers % Improvements
Scaling the model on an available instance 379 166 213 56
Scaling the model by adding a new instance 580 407 172 30

Customers using ODCRs for GPUs may experience a lower time to spin up new instances as compared to on demand depending on instance type.

Conclusion

Container Caching for inference is just one of the many ways SageMaker can improve the efficiency and performance of ML workloads on AWS. We encourage you to try out this new feature for your inference workloads and share your experiences with us. Your feedback is invaluable as we continue to innovate and improve our ML platform.

To learn more about Container Caching and other SageMaker features for inference, refer to Amazon SageMaker Documentation or check out our GitHub repositories for examples and tutorials on deploying models for inference.


About the Authors

Wenzhao Sun, PhD, is a Sr. Software Dev Engineer with the SageMaker Inference team. He possesses a strong passion for pushing the boundaries of technical solutions, striving to maximize their theoretical potential. His primary focus is on delivering secure, high-performance, and user-friendly machine learning features for AWS customers. Outside of work, he enjoys traveling and video games.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Author - Aakash DeepAakash Deep is a Software Development Engineering Manager with the Amazon SageMaker Inference team. He enjoys working on machine learning and distributed systems. His mission is to deliver secure, highly performant, highly scalable and user friendly machine learning features for AWS customers. Outside of work, he enjoys hiking and traveling.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.

Read More

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

The generative AI landscape has been rapidly evolving, with large language models (LLMs) at the forefront of this transformation. These models have grown exponentially in size and complexity, with some now containing hundreds of billions of parameters and requiring hundreds of gigabytes of memory. As LLMs continue to expand, AI engineers face increasing challenges in deploying and scaling these models efficiently for inference. One of the primary bottlenecks in the inference deployment process has been the time required to load these massive models onto accelerators. With LLMs now reaching hundreds of gigabytes in size, it has become increasingly difficult for many users to address bursty traffic patterns and scale quickly. For LLMs that often require high throughput and low-latency inference requests, this loading process can add significant overhead to the total deployment and scaling time, potentially impacting application performance during traffic spikes. SageMaker Large Model Inference (LMI) is deep learning container to help customers quickly get started with LLM deployments on SageMaker Inference.

Today at AWS re:Invent 2024, we are excited to announce a new capability in Amazon SageMaker Inference that significantly reduces the time required to deploy and scale LLMs for inference using LMI: Fast Model Loader. This innovation allows you to scale your models faster, observing up to 19% reduction in latency when scaling a new model copy on a new instance for inference. It represents a substantial leap forward in loading large models efficiently. Fast Model Loader introduces a novel approach by streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster model loading.

In our internal testing, we observed that Fast Model Loader can load large models up to 15 times faster compared to the traditional loading methods. This dramatic improvement in loading speed opens up new possibilities for responsive AI systems, potentially enabling faster scaling and more dynamic applications that can adapt quickly to changing demands. During our performance testing we were able to load the llama-3.1-70B model on an ml.p4d.24xlarge instance in just 1 minute. This model, with its 70 billion parameters, typically requires over 140 GB of memory in full precision, underscoring the magnitude of the loading challenge that Fast Model Loader addresses.

Fast Model Loader is designed to tackle scaling challenges, potentially leading to improved resource utilization on GPU instances and more efficient scaling during autoscaling events. This feature aims to provide you with a powerful new option for managing the deployment and scaling of your LLMs on SageMaker inference, whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM-based services.

This post is Part 1 of a series exploring Fast Model Loader. In this post, we delve into the technical details of Fast Model Loader, explore its integration with existing SageMaker workflows, discuss how you can get started with this powerful new feature, and share customer success stories. In Part 2, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments.

Challenges in deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

  • Provisioning new compute instances
  • Downloading the container image
  • Loading the container image
  • Downloading the model artifacts from Amazon S3 to disk
  • Loading the model artifacts on the host (using CPU and memory)
  • Preparing the model to be loaded on GPU (quantization, model sharding, and so on)
  • Loading the final model artifacts on the GPU

The cumulative time for these steps can take up to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during scaling activities, making it a critical area for optimization in the field of AI inference infrastructure.

To reduce the time it takes to download and load the container image, SageMaker now supports container caching. To learn more about this new feature, refer to Supercharge your auto scaling for generative AI inference- Introducing Container Caching in SageMaker Inference

For model loading, a typical deployment follows the steps described in this section, which can lead non-ideal deployment latency. This can lead to requests sitting in the queue waiting to be processed while the deployment concludes, or can result in dropped requests when timeouts are exceeded, as shown in the following diagrams.

You can take a step to optimize deployment with ahead of time (AoT) compilation of your model. This requires you to create or use existing pre-sharded models to avoid the step needed to process the model during runtime deployment. By taking on the cost of pre-creating these artifacts and referencing them as persisted objects, you can take that latency ahead of time. This can significantly reduce the time it takes to scale up a model especially if it’s larger in size.

The benefits of this approach are particularly noticeable for larger models:

  • Reduced scaling time – Pre-sharded models can be loaded more quickly, decreasing the time required to bring new instances online during scaling events
  • Improved resource utilization – By offloading the compilation and sharding process, more computational resources are available for inference tasks during runtime
  • Consistency – Pre-compiled artifacts provide consistent performance across deployments

Although there is an upfront cost in creating these artifacts, the long-term savings in reduced scaling times and improved resource utilization can be substantial, especially for models that are frequently deployed or require rapid scaling. This approach can significantly reduce the time it takes to scale up a model, particularly for larger models, leading to more responsive and efficient AI systems. The following figures illustrate the proposed way to load models.

Additionally, disk becomes a bottleneck during model loading due to its limited I/O bandwidth. Traditional storage systems struggle with the high throughput required for large-scale model loading, like Meta Llama 3.1 70B. Disk read/write speeds are often much slower than network or GPU memory bandwidths, creating delays in transferring model weights. This issue can be alleviated by streaming data directly from Amazon S3 to GPU memory, bypassing disk entirely.

We can now take a significant step forward and also address the steps it takes using host resources and the sequential steps it takes between downloading the model artifacts to loading it onto the GPU using Fast Model Loader.

Weight streaming

Fast Model Loader streams weights directly from Amazon S3 to GPUs. This is accomplished by cutting out the intermediary steps—the bytes representing model weights are downloaded to the CPU memory and immediately copied over to the GPU using Direct Memory Access (DMA). This simplifies the model loading workflow and makes it straightforward to maximize the model loading throughput. It presents the following key advantages:

  • No waiting – In the traditional approach, each step in the loading process (download, load to host’s CPU, GPU copy) needs to complete for a tensor or a layer before the next step could begin. This creates synchronous bottlenecks, where components are idle while waiting for the previous step to finish. Fast Model Loader’s direct streaming approach eliminates these synchronous blocking operations, allowing all components to operate at their maximum potential concurrently.
  • No accumulation – Instead of downloading the entire model to disk or CPU memory before processing, Fast Model Loader streams the model weights in small chunks directly to the GPU. This avoids the need to accumulate the full model in system storage or memory, reducing the overall resource requirements and footprint.
  • Maximum throughput – By simplifying the model loading workflow and eliminating intermediate steps, Fast Model Loader can more effectively take advantage of the high-throughput capabilities of Amazon S3 and the generous network bandwidth available on the large instances typically used for hosting LLMs. This allows the model loading process to achieve maximum throughput and minimize latency.

The following figure compares model load times for sequential vs. parallel processes.

Model sharding for streaming

The weight streaming paradigm described in the previous section requires that the model weights be prepared appropriately prior to streaming. In order to stream the model weights a few bytes at a time, we need to store the model in a format consistent with our expectation.

The traditional approach to storing and distributing LLM weights often relies on the SafeTensors format. Although SafeTensors provides a standardized way to package and distribute model weights, it presents some challenges when it comes to the weight streaming paradigm used by Fast Model Loader. In the SafeTensors format, the fundamental unit of storage is the tensor. Tensors are multi-dimensional arrays that represent the various weights and parameters of a machine learning model. However, the size of these tensors can vary significantly, ranging from a few megabytes to several gigabytes, depending on the complexity and scale of the model. This non-uniform distribution of tensor sizes poses a problem for Fast Model Loader’s weight streaming approach. The variable tensor sizes in the SafeTensors format make it difficult to achieve consistent throughput. Larger tensors require more time and resources to load, whereas smaller tensors are underutilized, leading to inefficiencies in the overall loading process.

The following figure illustrates loading SafeTensors weights of various sizes.

Fast Model Loader introduces a new model format with the following key advantages:

  • Pre-sharding – The explosion in model sizes has seen them outgrow GPUs. The largest models available today are over 1 TB large, whereas the largest GPUs fall short of 200 GB. This has led to us embracing distributed inference strategies like tensor parallelism. It involves splitting a model into portions (shards) and distributing them to multiple GPUs. However, this involves quite a few computations in deciding how to split the model at every layer and calculating offsets based on tensor size and available GPU memory. Fast Model Loader performs this optimization pre-deployment, which avoids the overhead during scaling activities. The preparation only happens one time, and the model can be deployed to any number of instances with the same distributed inference strategy. The following figure provides an overview of pre-sharding.

  • Uniform size distribution – The model weights are stored in uniform 8 MB chunks, which are less complicated to parallelize for concurrent processing. The following figure illustrates uniform chunks being parallelized across cores.

  • Out of order processing – Objects in Amazon S3 typically have to be downloaded in-order. To read the middle of the object, Amazon S3 starts by reading objects from the start, until it gets to the middle. This requires model weights to be downloaded synchronously, which runs contrary to our fast model loading paradigm. Storing model weights in uniform chunks of 8 MB each allows you to access any piece of the model at any time without synchronization. The following figures illustrate how breaking tensors into chunks allows for asynchronous, out of order retrieval.

Performance Testing:

The implementation of the Fast Model Loader demonstrates significant in End-to-End (E2E) scaling time for large language models. Across five simulations that were performed with Llama3.1 70B, we observed remarkably consistent results, reinforcing the reliability of our findings. For this feature we use container wherein caching was enabled here. When using CUDA Graphs, the average scaling time was reduced from 407 seconds (6.78 minutes) to 334.6 seconds (5.58 minutes), marking a substantial 21.64% improvement. Similarly, without CUDA Graphs, the average scaling time decreased from 379 seconds (6.32 minutes) to 306 seconds (5.10 minutes), resulting in a 19.26% reduction. By cutting scaling times by approximately one-fifth in all observed cases, this feature enables more responsive scaling and better handling of dynamic workloads, ultimately leading to improved performance and resource utilization in AI inference.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models and Container Caching.

The following table summarizes our setup.

Region CMH
Instance Type p4d.24xlarge
Container LMI
Container Image 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model Meta Llama3.1 70B

For this scenario, we illustrate scaling the model by adding a new instance.
The following table summarizes the results when the models are not sharded.
** All numbers presented here are in seconds.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy CUDA Graphs Capture Overhead E2E Scaling Latency
. . . . . With CUDA Graphs Without CUDA Graphs
1 40 185 173 28 398 370
2 40 175 188 29 403 374
3 40 164 208 29 412 383
4 40 185 187 30 412 382
5 40 185 187 28 412 384
Average . 179 189 29 407 379

The following table summarizes the results after the models are sharded.
** All numbers presented here are in seconds.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy CUDA Graphs Capture Overhead E2E Scaling Latency
. . . . . With CUDA Graphs Without CUDA Graphs
1 40 185 119 28 344 316
2 40 175 119 30 334 304
3 40 169 119 28 328 300
4 40 169 120 28 329 301
5 40 179 119 29 338 309
Average . 175.4 119.2 28.6 334.6 306

The following diagram summarizes the impact on E2E scaling time.
** All numbers presented here are in seconds.

.. Before After % Improvements
Scaling with CUDA Graphs 407 334.6 21.64%
Scaling without CUDA Graphs 379 306 19.26%

Note: For customers using ODCRs for GPUs may experience lower time to spin up new instances as compared to on demand depending on instance type.

Impact on Read-to-Serve Time:

The benchmarks below show that the SageMaker Fast Model Loader can load large models significantly faster compared to traditional counterparts. For the LLaMa 3.1 70B model on the ml.p4d.24xlarge instance, we compared the download and load times against 2 traditional methods – downloading the model from HuggingFace Hub using transformers and downloading the model from S3 using vLLM’s default downloader. In both cases we used vLLM’s default loader to load the model after the download.

** All numbers presented here are in seconds.

. Download Load % Improvement with Fast Model Loader Speedup with Fast Model Loader
Transformers Downloader + vLLM Model Loader 602 138 93.24% 15x
vLLM Downloader + vLLM Model Loader 127 138 81.13% 5x
Fast Model Loader 50 . .

The load time here indicates the time taken to get the model fully ready to serve, including time taken to initialize the KV cache.

How to get started

You can start using Fast Model Loader now through the Amazon SageMaker Studio console using Amazon SageMaker JumpStart or programmatically using the SageMaker Python SDK.

From the SageMaker Studio JumpStart hub, you can pick a model and choose Optimize to run the inference optimization job and then deploy the optimized model to a SageMaker endpoint. For more detailed instructions, refer to the Part 2 of this post.

Though SageMaker Studio provides a user-friendly interface for model optimization through SageMaker JumpStart, you can also achieve the same functionality programmatically using the SageMaker Python SDK. The ModelBuilder class offers a streamlined way to optimize and deploy large models, requiring just a few lines of code to prepare your model for fast loading and inference. The following code snippet shows the core implementation to use ModelBuilder to prepare and optimize the model for Fast Model Loader. You can find an end-to-end example notebook in the following GitHub repo.

# Create a model builder object
model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",[{{type}} Annotation]
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test")
)
#Run model optimization job
model_builder.optimize(
        instance_type="ml.p4d.24xlarge",
        output_path=output_path,
        sharding_config={
        "OverrideEnvironment":{
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
        }
)

Customer testimonials

The introduction of Fast Model Loader in SageMaker has generated significant excitement among our customers, particularly those working with LLMs. We’ve collected early feedback from customers that have had the opportunity to preview this new capability. Their responses underscore the potential of Fast Model Loader to transform the deployment and scaling of AI models, especially in scenarios requiring rapid response to changing demands.

Atomicwork is a modern ITSM and ESM solution that revolutionizes internal support for organizations through AI-powered chat interfaces, replacing traditional ticketing portals.

“Amazon SageMaker Fast Model Loader is a game changer for our AI-driven enterprise workflows. It significantly accelerates the deployment and scaling of the large language models, which are critical for providing responsive, chat-based IT support, HR processes, and customer service operations. We look forward to adopting this feature that allows us to optimize our computational resources while maintaining the agility our enterprise customers expect, helping us deliver a truly intelligent service management platform.”

– Kiran Darisi, Co-founder and CTO of Atomicwork.

Conclusion

In this post, we discussed how loading large model artifacts can be the bottleneck in loading and scaling FMs. SageMaker has launched a new feature called Fast Model Loader to address challenges in deploying and scaling FMs for inference. Fast Model Loader can load large models up to 15 times faster by streaming model weights directly from Amazon S3 to the accelerator, reducing scaling and deployment times significantly.

In Part 2 of this post, we demonstrate how you can try out this new feature through either the SageMaker Python SDK or SageMaker Studio console.


About the Authors

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.

Read More

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

In Part 1 of this series, we introduced Amazon SageMaker Fast Model Loader, a new capability in Amazon SageMaker that significantly reduces the time required to deploy and scale large language models (LLMs) for inference. We discussed how this innovation addresses one of the major bottlenecks in LLM deployment: the time required to load massive models onto accelerators. By streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, Fast Model Loader can achieve up to 15 times faster loading times compared to traditional methods.

As the AI landscape continues to evolve and models grow even larger, innovations like Fast Model Loader become increasingly crucial. By significantly reducing model loading times, this feature has the potential to transform the way you deploy and scale your LLMs, enabling more responsive and efficient AI applications across a wide range of use cases.

In this post, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments. We explore two approaches: using the SageMaker Python SDK for programmatic implementation, and using the Amazon SageMaker Studio UI for a more visual, interactive experience. Whether you’re a developer who prefers working with code or someone who favors a graphical interface, you’ll learn how to take advantage of this powerful feature to accelerate your LLM deployments.

Solution overview

Fast Model Loader is currently integrated with SageMaker Large Model Inference (LMI) containers (starting with v13) for GPU instances. It introduces two key techniques to enable lightning-fast model loads:

  • Weight streaming
  • Model sharding for streaming

Use Fast Model Loader with the SageMaker Python SDK

In this section, we show how to use the SageMaker Python SDK to use this new feature. You can find the example notebook in the following GitHub repo. Complete the following steps:

  1. First, use ModelBuilder to prepare and package the model inference components.

To learn more about the ModelBuilder class, refer to Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements. In this example, you deploy the Meta Llama 3.1 70B model with the model name meta-textgeneration-llama-3-1-70b in Amazon SageMaker JumpStart.

The SchemaBuilder parameter is used to infer the serialization and deserialization methods for the model. For more information on SchemaBuilder, refer to Define serialization and deserialization methods.

You can choose to specify OPTION_TENSOR_PARALLEL_DEGREE as a ModelBuilder environment variable as shown in the following commented lines, or in the next step as part of the ModelBuilder sharding_config:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

# Define sample input and output for the model
prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."
# Create the input schema structure
sample_input = {
    "inputs": prompt,
    "parameters": {"max_new_tokens": 32}
}
# Define the expected output format
sample_output = [{"generated_text": response}]

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output),
    #env_vars={
    #   "OPTION_TENSOR_PARALLEL_DEGREE": "8",
    #},
)
  1. Next, use the optimize() function to prepare the model shards for deployment.

The optimize() function will start a model optimization job and will take a few minutes to finish. The tensor parallel degree should be set to how many GPUs you want each inference component to have access to. You can find the model shards at the output_path S3 location under a folder starting with sagemaker-fast-model-loader-xxx.

model_builder.optimize(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    output_path=output_path, 
    sharding_config={
            "OverrideEnvironment": {
            # The value must be equal to the subsequent number of GPUs that will be used for each IC. 
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
    }
)

You can reuse the sharded model that was generated by previous optimization jobs. The following code sample demonstrates how to use model_metadata to overwrite the model path, which needs to point to the Amazon S3 location of the existing model shards:

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    model_metadata={
        "CUSTOM_MODEL_PATH": output_path,
    },
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test"),
    role_arn=role,
    instance_type="ml.p4d.24xlarge",
)
  1. When the model optimization job is complete, you can use the build() function to generate the artifacts according to the model server:
    # use the build() function to generate the artifacts according to the model server
    final_model = model_builder.build()

  2. If you’re using existing model shards without running an optimization job, you need to make sure the _is_sharded_model value is set to True and the EnableNetworkIsolation is set to False because Fast Model Loader requires network access:
    # You only need to set the values if you are using existing sharded models 
    if not final_model._is_sharded_model:
     final_model._is_sharded_model = True 
    if final_model._enable_network_isolation:
     final_model._enable_network_isolation = False

  3. Use the deploy() function to deploy the model to an endpoint, where you can specify the required resources, such as GPU memory and number of accelerators:
    from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
    
    resources_required = ResourceRequirements(
        requests={
            "memory" : 204800,
            "num_accelerators": 8
        }
    )
    
    # deploy the optimized model to an endpoint
    final_model.deploy(
        instance_type="ml.p4d.24xlarge", 
        accept_eula=True, 
        endpoint_logging=False, 
        resources=resources_required
    )

  4. After the endpoint is up and running, you can test the endpoint using the following code example:
    from sagemaker.predictor import retrieve_default 
    endpoint_name = final_model.endpoint_name 
    predictor = retrieve_default(endpoint_name) 
    payload = { "inputs": "I believe the meaning of life is", 
                "parameters": { 
                    "max_new_tokens": 64, 
                    "top_p": 0.9, 
                    "temperature": 0.6 
                } 
            }
    response = predictor.predict(payload) 
    print(response)

  5. To clean up, run the following code cell to delete the resources created for the endpoint:
    predictor.delete_predictor()
    predictor.delete_endpoint()

Use Fast Model Loader with SageMaker Studio

In this section, we show how to use the faster model loading feature through the SageMaker Studio UI. Complete the following steps:

  1. On the SageMaker Studio console, chose JumpStart in the navigation pane.
  2. Choose your model.
  3. On the model details page, choose Optimize.
  4. Accept the EULA and proceed to the optimization configurations.
  5. Select Fast model loading and set the OPTION_TENSOR_PARALLEL_DEGREE to 8, because this example uses an ml.p4d.24xlarge instance that has 8 GPUs. If you’re using an instance with a different number of GPUs, set the value to match the instance.
  6. Set the output path to the Amazon S3 path where the sharded model will be stored.
  7. Choose Create job.

After the inference optimization job starts, you can check the status of the job on the Inference optimization page. Here, each of the jobs have tags associated with them as to what optimization configuration was used.

  1. View the details of the job by choosing the job ID.
  2. Deploy the optimized model by choosing Deploy on the optimized job page.
  3. Verify the endpoint settings and choose Deploy to initiate a SageMaker endpoint deployment.

You will get a notification on the SageMaker Studio UI, and the status will change to In service when the endpoint creation is complete.

You can now send a sample inference request to test the model.

After the test, you can delete the endpoint from the SageMaker Studio console to clean up the resources created in this example.

Conclusion

Fast Model Loader represents a significant advancement in how you can deploy and scale LLMs on SageMaker. In this post, we walked through the step-by-step process of implementing this feature through both the SageMaker Python SDK and SageMaker Studio UI. By using weight streaming and model sharding techniques, you can now achieve dramatically faster model loading times, enabling more responsive scaling for your LLM-based applications.

The integration with SageMaker LMI containers (starting from LMI v13) makes it straightforward to adopt this feature in your existing workflows. Whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM services, Fast Model Loader provides the tools you need to optimize your model deployment pipeline.

Try out Fast Model Loader for your own use case, and leave your feedback and questions in the comments.


About the Authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Read More

Towards Time-Series Reasoning with LLMs

Multi-modal large language models (MLLMs) have enabled numerous advances in understanding and reasoning in domains like vision, but we have not yet seen this broad success for time-series. Although prior works on time-series MLLMs have shown promising performance in time-series forecasting, very few works show how an LLM could be used for time-series reasoning in natural language. We propose a novel multi-modal time-series LLM approach that learns generalizable information across various domains with powerful zero-shot performance. First, we train a lightweight time-series encoder on top of an…Apple Machine Learning Research

Leveraging Periodicity for Robustness with Multi-modal Mood Pattern Models

*Equal Contributors
Data from wearable sensors (e.g., heart rate, step count) can be used to model mood patterns. We characterize feature representations and modeling strategies with multi-modal discrete time series data for mood pattern classification with a large dataset with naturalistic missingness (n=116,819 participants) using 12 wearable data streams, with a focus on capturing periodic trends in data. Considering both performance and robustness, periodicity-based aggregate feature representations with gradient boosting models outperformed other representations and architectures…Apple Machine Learning Research

Strategic Linear Contextual Bandits

Motivated by the phenomenon of strategic agents gaming a recommendation system to maximize the number of times they are recommended to users, we study a strategic variant of the linear contextual bandit problem, where the arms strategically misreport privately observed contexts to the learner. % under strategic context manipulation. We treat the algorithm design problem as one of emph{mechanism design} under uncertainty and propose the Optimistic Grim Trigger Mechanism (OptGTM) that minimizes regret while simultaneously incentivizing the agents to be approximately truthful. We show that…Apple Machine Learning Research

GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics

Single-cell genomics has significantly advanced our understanding of cellular behavior, catalyzing innovations in treatments and precision medicine. However, single-cell sequencing technologies are inherently destructive and can only measure a limited array of data modalities simultaneously. This limitation underscores the need for new methods capable of realigning cells. Optimal transport (OT) has emerged as a potent solution, but traditional discrete solvers are hampered by scalability, privacy, and out-of-sample estimation issues. These challenges have spurred the development of neural…Apple Machine Learning Research

Learning Elastic Costs to Shape Monge Displacements

Given a source and a target probability measure supported on Rdmathbb{R}^dRd, the Monge problem aims for the most efficient way to map one distribution to the other.
This efficiency is quantified by defining a cost function between source and target data.
Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, ℓ22(x,y)=12∥x−y∥22ell^2_2(x,y)=tfrac12|x-y|_2^2ℓ22​(x,y)=21​∥x−y∥22​.
The benefits of using elastic costs, defined through a regularizer τtauτ as c(x,y)=ℓ22(x,y)+τ(x−y)c(x, y)=ell^2_2(x,y)+tau(x-y)c(x,y)=ℓ22​(x,y)+τ(x−y), was…Apple Machine Learning Research