Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Unlock cost savings with the new scale down to zero feature in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce a new feature for Amazon SageMaker inference endpoints: the ability to scale SageMaker inference endpoints to zero instances. This long-awaited capability is a game changer for our customers using the power of AI and machine learning (ML) inference in the cloud. Previously, SageMaker inference endpoints maintained a minimum number of instances to provide continuous availability, even during periods of low or no traffic. With this update, available when using SageMaker inference components, you have more options to align your resource usage with your specific needs and traffic patterns.

Refer to the accompanying notebooks to get started with the new scale down to zero feature.

The new feature expands the possibilities for managing SageMaker inference endpoints. It allows you to configure the endpoints so they can scale to zero instances during periods of inactivity, providing an additional tool for resource management. With this feature, you can closely match your compute resource usage to your actual needs, potentially reducing costs during times of low demand. This enhancement builds upon the existing auto scaling capabilities in SageMaker, offering more granular control over resource allocation. You can now configure your scaling policies to include scaling to zero, allowing for more precise management of your AI inference infrastructure.

The scale down to zero feature presents new opportunities for how businesses can approach their cloud-based ML operations. It provides additional options for managing resources across various scenarios, from development and testing environments to production deployments with variable traffic patterns. As with any new feature, you are encouraged to carefully evaluate how it fits into your overall architecture and operational needs, considering factors such as response times and the specific requirements of your applications.

In this post, we explore the new scale to zero feature for SageMaker inference endpoints, demonstrating how to implement and use this capability to optimize costs and manage resources more effectively. We cover the key scenarios where scaling to zero is beneficial, provide best practices for optimizing scale-up time, and walk through the step-by-step process of implementing this functionality. Additionally, we discuss how to set up scheduled scaling actions for predictable traffic patterns and test the behavior of your scaled-to-zero endpoints.

Determining when to scale to zero

Before we dive into the implementation details of the new scale to zero feature, it’s crucial to understand when and why you should consider using it. Although the ability to scale SageMaker inference endpoints to zero instances offers significant cost-saving potential, it’s crucial to understand when and how to apply this feature effectively. Not all scenarios benefit equally from scaling to zero, and in some cases, it may even impact the performance of your applications. Let’s explore why it’s important to carefully consider when to implement this feature and how to identify the scenarios where it provides the most value.

The ability to scale SageMaker inference endpoints to zero instances is particularly beneficial in three key scenarios:

  • Predictable traffic patterns – If your inference traffic is predictable and follows a consistent schedule, you can use this scaling functionality to automatically scale down to zero during periods of low or no usage. This eliminates the need to manually delete and recreate inference components and endpoints.
  • Sporadic or variable traffic – For applications that experience sporadic or variable inference traffic patterns, scaling down to zero instances can provide significant cost savings. However, scaling from zero instances back up to serving traffic is not instantaneous. During the scale-out process, any requests sent to the endpoint will fail, and these NoCapacityInvocationFailures will be captured in Amazon CloudWatch.
  • Development and testing environments – The scale to zero functionality is also beneficial when testing and evaluating new ML models. During model development and experimentation, you might create temporary inference endpoints to test different configurations. However, it’s possible to forget to delete these endpoints when you’re done. Scaling down to zero makes sure these test endpoints automatically scale back to zero instances when not in use, preventing unwanted charges. This allows you to freely experiment without closely monitoring infrastructure usage or remembering to manually delete endpoints. The automatic scaling to zero provides a cost-effective way to test out ideas and iterate on your ML solutions.

By carefully evaluating your specific use case against these scenarios, you can make informed decisions about implementing scale to zero functionality. This approach makes sure you maximize cost savings without compromising on the performance and availability requirements of your ML applications. It’s important to note that although scaling to zero can provide significant benefits, it also introduces a trade-off in terms of initial response time when scaling back up. Therefore, it’s crucial to assess whether your application can tolerate this potential delay and to implement appropriate strategies to manage it. In the following sections, we dive deeper into each scenario and provide guidance on how to determine if scaling to zero is the right choice for your specific needs. We also discuss best practices for implementation and strategies to mitigate potential drawbacks.

Scale down to zero is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

Now that we understand when to use the scale to zero feature, let’s dive into how to optimize its performance and implement it effectively. Scaling up from zero instances to serving traffic introduces a brief delay (cold start), which can impact your application’s responsiveness. To mitigate this, we first explore best practices for minimizing scale-up time. Then we walk through the step-by-step process of implementing the scale to zero functionality for your SageMaker inference endpoints.

Optimizing scale-up time best practices

When using the scale to zero feature, it’s crucial to minimize the time it takes for your endpoint to scale up and begin serving requests. The following are several best practices you can implement to decrease the scale-out time for your SageMaker inference endpoints:

  • Decrease model or container download time – Use uncompressed model format to reduce the time it takes to download the model artifacts when scaling up. Compressed model files may save storage space, but they require additional time to uncompress and files can’t be downloaded in parallel, which can slow down the scale-up process. To learn more, see Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference.
  • Reduce model server startup time – Look for ways to optimize the startup and initialization of your model server container. This could include techniques like building in packages into the image, using multi-threading, or minimizing unnecessary initialization steps. For more details, see Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1.
  • Use faster auto scaling metrics – Take advantage of more granular auto scaling metrics like ConcurrentRequestsPerCopy to more accurately monitor and react to changes in inference traffic. These sub-minute metrics can help trigger scale-out actions more precisely, reducing the number of NoCapacityInvocationFailures your users might experience. For more information, see Amazon SageMaker inference launches faster auto scaling for generative AI models.
  • Handle failed requests – When scaling from zero instances, there will be a brief period where requests fail due to NoCapacityInvocationFailures because SageMaker provisions resources. To handle this, you can use queues or implement client-side retries:
  • Use a serverless queue like Amazon Simple Queue Service (Amazon SQS) to buffer requests during scale-out. When a failure occurs, enqueue the request and dequeue after the model copies have scaled up from zero.
  • Alternatively, have your client reject failed requests, but then retry after some time after the model copies have scaled. You can retrieve the number of copies of an inference component at any time by making the DescribeInferenceComponent API call and checking the CurrentCopyCount. This allows time for the model copies to scale out from zero, transparently handling the transition for end-users.

By implementing these best practices, you can help make sure your SageMaker inference endpoints can scale out quickly and efficiently to meet changes in traffic, providing a responsive and reliable experience for your end-users.

Solution overview

With these best practices in mind, let’s now walk through the process of enabling your SageMaker inference endpoints to scale down to zero instances. This process involves a few key steps that are crucial for optimizing your endpoint’s performance and cost-efficiency:

  • Configure your endpoint – The first and most critical step is to enable managed instance scaling for your SageMaker endpoint. This is the foundational action that allows you to implement advanced scaling features, including scaling to zero. By enabling managed instance scaling, you’re creating an inference component endpoint, which is essential for the fine-grained control over scaling behaviors we discuss later in this post. After you configure managed instance scaling, you then configure the SageMaker endpoint to set the MinInstanceCount parameter to 0. This parameter allows the endpoint to scale all the way down to zero instances when not in use, maximizing cost-efficiency. Enabling managed instance scaling and setting MinInstanceCount to 0 work together to provide a highly flexible and cost-effective endpoint configuration. However, scaling up from zero will introduce cold starts, potentially impacting response times for initial requests after periods of inactivity. The inference component endpoint created through managed instance scaling serves as the foundation for implementing the sophisticated scaling policies we explore in the next step.
  • Define scaling policies – Next, you need to create two scaling policies that work in tandem to manage the scaling behavior of your endpoint effectively:
    • Scaling policy for inference component copies – This target tracking scaling policy will manage the scaling of your inference component copies. It’s a dynamic policy that adjusts the number of copies based on a specified metric, such as CPU utilization or request count. The policy is designed to scale the copy count to zero when there is no traffic, making sure you’re not paying for unused resources. Conversely, it will scale back up to your desired capacity when needed, allowing your endpoint to handle incoming requests efficiently. When configuring this policy, you need to carefully choose the target metric and threshold that best reflect your workload patterns and performance requirements.
    • Scale out from zero policy – This policy is crucial for enabling your endpoint to scale out from zero model copies when traffic arrives. It’s implemented as a step scaling policy that adds model copies when triggered by incoming requests. This allows SageMaker to provision the necessary instances to support the model copies and handle the incoming traffic. When configuring this policy, you need to consider factors such as the expected traffic patterns, the desired responsiveness of your endpoint, and the potential cold start latency. You may want to set up multiple steps in your policy to handle different levels of incoming traffic more granularly.

By implementing these scaling policies, you create a flexible and cost-effective infrastructure that can automatically adjust to your workload demands and scale to zero when needed.

Now let’s see how to use this feature step by step.

Set up your endpoint

The first crucial step in enabling your SageMaker endpoint to scale to zero is properly configuring the endpoint and its associated components. This process involves three main steps:

  1. Create the endpoint configuration and set MinInstanceCount to 0. This allows the endpoint to scale down all the way to zero instances when not in use.
    sagemaker_client.create_endpoint_config(
        EndpointConfigName=endpoint_config_name,
        ExecutionRoleArn=role,
        ProductionVariants=[
            {
                "VariantName": variant_name,
                "InstanceType": instance_type,
                "InitialInstanceCount": 1,
                "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
                "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
                "ManagedInstanceScaling": {
                    "Status": "ENABLED",
                    "MinInstanceCount": 0,
                    "MaxInstanceCount": max_instance_count,
                },
                "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
            }
        ],
    )

  2. Create the SageMaker endpoint:
    sagemaker_client.create_endpoint(
        EndpointName=endpoint_name,
        EndpointConfigName=endpoint_config_name,
    )

  3. Create the inference component for your endpoint:
    sagemaker_client.create_inference_component(
        InferenceComponentName=inference_component_name,
        EndpointName=endpoint_name,
        VariantName=variant_name,
        Specification={
            "ModelName": model_name,
            "StartupParameters": {
                "ModelDataDownloadTimeoutInSeconds": 3600,
                "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
            },
            "ComputeResourceRequirements": {
                "MinMemoryRequiredInMb": 1024,
                "NumberOfAcceleratorDevicesRequired": 1,
            },
        },
        RuntimeConfig={
            "CopyCount": 1,
        },
    )

Add scaling policies

After the endpoint is deployed and InService, you can add the necessary scaling policies:

  • A target tracking policy that can scale down the copy count for our inference component model copies to zero, and from 1 to n
  • A step scaling policy that will allow the endpoint to scale up from zero

Scaling policy for inference components model copies

After you create your SageMaker endpoint and inference components, you register a new auto scaling target for Application Auto Scaling. In the following code block, you set MinCapacity to 0, which is required for your endpoint to scale down to zero:

# Register scalable target
resource_id = f"inference-component/{inference_component_name}"
service_namespace = "sagemaker"
scalable_dimension = "sagemaker:inference-component:DesiredCopyCount"

aas_client.register_scalable_target(
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    MinCapacity=0,
    MaxCapacity=max_copy_count,  # Replace with your desired maximum number of model copies
)

After you have registered your new scalable target, the next step is to define your target tracking policy. In the following code example, we set the TargetValue to 5. This setting instructs the auto scaling system to increase capacity when the number of concurrent requests per model reaches or exceeds 5.

# Create Target Tracking Scaling Policy

aas_client.put_scaling_policy(
    PolicyName="inference-component-target-tracking-scaling-policy",
    PolicyType="TargetTrackingScaling",
    ServiceNamespace=service_namespace,
    ResourceId=resource_id,
    ScalableDimension=scalable_dimension,
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution",
        },
        # Low TPS + load TPS
        "TargetValue": 5,  # you need to adjust this value based on your use case
        "ScaleInCooldown": 300,  # default
        "ScaleOutCooldown": 300,  # default
    },
)

Application Auto Scaling creates two CloudWatch alarms per scaling target. The first triggers scale-out actions after 1 minute (using one 1-minute data point), and the second triggers scale-in after 15 minutes (using 90 10-second data points). The time to trigger the scaling action is usually 1–2 minutes longer than those minutes because it takes time for the endpoint to publish metrics to CloudWatch, and it also takes time for AutoScaling to react.

Scale out from zero model copies policy

To enable your endpoint to scale out from zero instances, complete the following steps:

  1. Create a step scaling policy that defines when and how to scale out from zero. This policy will add one model copy when triggered, enabling SageMaker to provision the instances required to handle incoming requests after being idle. The following code shows you how to define a step scaling policy. Here we have configured to scale from zero to one model copy ("ScalingAdjustment": 1). Depending on your use case, you can adjust ScalingAdjustment as required.
    aas_client.put_scaling_policy(
        PolicyName="inference-component-step-scaling-policy",
        PolicyType="StepScaling",
        ServiceNamespace=service_namespace,
        ResourceId=resource_id,
        ScalableDimension=scalable_dimension,
        StepScalingPolicyConfiguration={
            "AdjustmentType": "ChangeInCapacity",
            "MetricAggregationType": "Maximum",
            "Cooldown": 60,
            "StepAdjustments":
              [
                 {
                   "MetricIntervalLowerBound": 0,
                   "ScalingAdjustment": 1 # you need to adjust this value based on your use case
                 }
              ]
        },
    )

  2. Create a CloudWatch alarm with the metric NoCapacityInvocationFailures.

When triggered, the alarm initiates the previously defined scaling policy. For more information about the NoCapacityInvocationFailures metric, see documentation.

We have also set the following:

  • EvaluationPeriods to 1
  • DatapointsToAlarm to 1
  • ComparisonOperator to GreaterThanOrEqualToThreshold

This results in waiting approximately 1 minute for the step scaling policy to trigger after our endpoint receives a single request.

cw_client.put_metric_alarm(
    AlarmName='ic-step-scaling-policy-alarm',
    AlarmActions=<step_scaling_policy_arn>,  # Replace with your actual ARN
    MetricName='NoCapacityInvocationFailures',
    Namespace='AWS/SageMaker',
    Statistic='Maximum',
    Dimensions=[
        {
            'Name': 'InferenceComponentName',
            'Value': inference_component_name  # Replace with actual InferenceComponentName
        }
    ],
    Period=30,
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=1,
    ComparisonOperator='GreaterThanOrEqualToThreshold',
    TreatMissingData='missing'
)

Replace <STEP_SCALING_POLICY_ARN> with the Amazon Resource Name (ARN) of the scaling policy you created in the previous step.

Notice the "MinInstanceCount": 0 setting in the endpoint configuration, which allows the endpoint to scale down to zero instances. With the scaling policy, CloudWatch alarm, and minimum instances set to zero, your SageMaker inference endpoint will now be able to automatically scale down to zero instances when not in use.

Test the solution

When our SageMaker endpoint doesn’t receive requests for 15 minutes, it will automatically scale down to zero the number of model copies:

time.sleep(500)
while True:
    desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
    status = desc["InferenceComponentStatus"]
    print(status)
    sys.stdout.flush()
    if status in ["InService", "Failed"]:
        break
    time.sleep(30)
    
desc = sagemaker_client.describe_inference_component(InferenceComponentName=inference_component_name)
print(desc)

After 10 additional minutes of inactivity, SageMaker automatically stops all underlying instances of the endpoint, eliminating all associated instance costs.

If we try to invoke our endpoint while instances are scaled down to zero, we get a validation error:

An error occurred (ValidationError) when calling the InvokeEndpoint operation: Inference Component has no capacity to process this request. ApplicationAutoScaling may be in-progress (if configured) or try to increase the capacity by invoking UpdateInferenceComponentRuntimeConfig API.

sagemaker_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name,
    Body=json.dumps(
        {
            "inputs": "The diamondback terrapin was the first reptile to be",
            "parameters": {
                "do_sample": True,
                "max_new_tokens": 256,
                "min_new_tokens": 256,
                "temperature": 0.3,
                "watermark": True,
            },
        }
    ),
    ContentType="application/json",
)["Body"].read().decode("utf8")

However, after 1 minute, our step scaling policy should start. SageMaker will then start provisioning a new instance and deploy our inference component model copy to handle requests.

Schedule scaling down to zero

In some scenarios, you might observe consistent weekly traffic patterns: a steady workload Monday through Friday, and no traffic on weekends. You can optimize costs and performance by configuring scheduled actions that align with these patterns:

  • Weekend scale-in (Friday evening) – Configure a scheduled action to reduce the number of model copies to zero. This will instruct SageMaker to scale the number instance behind the endpoint to zero, completely eliminating costs during the weekend period of no usage.
  • Workweek scale-out (Monday morning) – Set up a complementary scheduled action to restore the required model capacity for the inference component on Monday morning, so your application is ready for weekday operations.

You can scale your endpoint to zero in two ways. The first method is to set the number of model copies to zero in your inference component using the UpdateInferenceComponentRuntimeConfig API. This approach maintains your endpoint configuration while eliminating compute costs during periods of inactivity.

sagemaker_client.update_inference_component_runtime_config(
    InferenceComponentName=inference_component_name,
    DesiredRuntimeConfig={
        'CopyCount': 0
    }
)

Amazon EventBridge Scheduler can automate SageMaker API calls using cron/rate expressions for recurring schedules or one-time invocations. To function, EventBridge Scheduler requires an execution role with appropriate permissions to invoke the target API operations on your behalf. For more information about how to create this role, see Set up the execution role. The specific permissions needed depend on the target API being called.

The following code creates two scheduled actions for the inference component during 2024–2025. The first schedule scales in the CopyCount to zero every Friday at 18:00 UTC+1, and the second schedule restores model capacity every Monday at 07:00 UTC+1. The schedule will start on November 29, 2024, end on December 31, 2025, and be deleted after completion.

import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 0}, "InferenceComponentName": inference_component_name })
}

# Scale in our endpoint to 0 every friday at 18:00 UTC+1, starting on November 29, 2024
scheduler.create_schedule(
    Name="scale-to-zero-schedule",
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

# Specify the SageMaker target API for the scale out schedule
scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:updateInferenceComponentRuntimeConfig",
    "Input": json.dumps({ "DesiredRuntimeConfig": {"CopyCount": 2}, "InferenceComponentName": inference_component_name })
}

# Scale out our endpoint every Monday at 07:00 UTC+1
scheduler.create_schedule(
    Name="scale-out-schedule",
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

The second method is to delete the inference components by calling the DeleteInferenceComponent API. This approach achieves the same cost-saving benefit while completely removing the components from your configuration. The following code creates a scheduled action that automatically deletes the inference component every Friday at 18:00 UTC during 2024–2025. It also creates a complementary scheduled action that recreates the inference component every Monday at 07:00 UTC+1.

import json
scheduler = boto3.client('scheduler')

flex_window = {
    "Mode": "OFF"
}

# We specify the SageMaker target API for the scale in schedule
scale_in_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:deleteInferenceComponent",
    "Input": json.dumps({"InferenceComponentName": inference_component_name })
}

# Scale in our endpoint by deleting the IC every friday at 18:00 UTC+1
scheduler.create_schedule(
    Name="scale-to-zero-schedule",
    ScheduleExpression="cron(00 18 ? * 6 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_in_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

# Specify the SageMaker target API for the scale up schedule
input_config = {
  "EndpointName": endpoint_name,
  "InferenceComponentName": inference_component_name,
  "RuntimeConfig": {
    "CopyCount": 2
  },
  "Specification": {
    "ModelName": model_name,
    "StartupParameters": {
        "ModelDataDownloadTimeoutInSeconds": 3600,
        "ContainerStartupHealthCheckTimeoutInSeconds": 3600,
    },
    "ComputeResourceRequirements": {
      "MinMemoryRequiredInMb": 1024,
      "NumberOfAcceleratorDevicesRequired": 1
    }
  },
  "VariantName": variant_name
}

scale_out_target = {
    "RoleArn": role,
    "Arn": "arn:aws:scheduler:::aws-sdk:sagemaker:createInferenceComponent",
    "Input": json.dumps(input_config)
}

# Scale out our endpoint by recreating the IC every Monday at 07:00 UTC+1
scheduler.create_schedule(
    Name="scale-out-schedule",
    ScheduleExpression="cron(00 07 ? * 2 2024-2025)",
    ScheduleExpressionTimezone="UTC+1", # Set the correct timezone for your application
    Target=scale_out_target,
    FlexibleTimeWindow=flex_window,
    ActionAfterCompletion="DELETE",
    StartDate="2024-11-29T00:00:00",
    EndDate="2025-12-31T23:59:59"
)

To scale to zero on an endpoint with multiple inference components, all components must be either set to 0 or deleted. You can also automate this process by using EventBridge Scheduler to trigger an AWS Lambda function that handles either deletion or zero-setting of all inference components.

Performance evaluation

We evaluated the performance implications of the Scale to Zero feature by conducting tests using a Llama3-8B instruct model. These tests utilized container caching and optimized model loading techniques, and were performed with both Target Tracking and Step Scaling policies in place. Our findings for Llama3-8B instruct show that when using the Target Tracking policy, SageMaker will scale the endpoint to zero model copies in approximately 15 minutes, and then take an additional 10 minutes to fully scale down the underlying instances, for a total scale-in time of 25 minutes. Conversely, when scaling the endpoint back up from zero, the Step Scaling policy triggers the provisioning of new instances in around 1 minute, followed by provisioning the instance(s) in ~1.748 minutes. Scaling out of model copies in approximately 2.28 minutes, resulting in a total scale-out time of around 5.028 minutes.

The performance tests on LLaMa3.1 models (8B and 70B variants) demonstrate SageMaker’s Scale to Zero feature’s effectiveness, with intentionally conservative scaling times to prevent endpoint thrashing and accommodate spiky traffic patterns. For both model sizes, scaling in takes a total of 25 minutes, allowing a 15-minute buffer before initiating scale-down and an additional 10 minutes to fully decommission instances. This cautious approach helps avoid premature scaling during temporary lulls in traffic. When scaling out, the 8B model takes about 5 minutes, while the 70B model needs approximately 6 minutes. These times include a 1-minute trigger delay, followed by instance provisioning and model copy instantiation. The slightly longer scale-out times, especially for larger models, provide a balance between responsiveness and stability, ensuring the system can handle sudden traffic increases without constantly scaling up and down. This measured approach to scaling helps maintain consistent performance and cost-efficiency in environments with variable workloads.

LLaMa3.1 8B Instruct
Scale in Time to trigger target tracking (min) Time to scale in instance count to zero (min) Total time (min)
15 10 25
Scale out Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instatiate a new model copy (min) Total time (min)
1 1.748 2.28 5.028
LLaMa3.1 70B
Scale in Time to trigger target tracking (min) Time to scale in instance count to zero (min) Total time (min)
15 10 25
Scale out Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instatiate a new model copy (min) Total time (min)
1 3.018 1.984 6.002

Scale up Trials

LLaMa3.1 8B Instruct
Trial Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instantiate a new model copy (min) Total time (min)
1 1 1.96 3.1 6.06
2 1 1.75 2.6 5.35
3 1 1.4 2.1 4.5
4 1 1.96 1.9 4.86
5 1 1.67 1.7 4.37
Average 1 1.748 2.28 5.028
LLaMa3.1 70B
Trial Time to trigger step scaling policy (min) Time to provision instance(s) (min) Time to instantiate a new model copy (min) Total time (min)
1 1 3.1 1.98 6.08
2 1 2.92 1.98 5.9
3 1 2.82 1.98 5.8
4 1 3.27 2 6.27
5 1 2.98 1.98 5.96
Average 1 3.018 1.984 6.002
  • Target Tracking: Scale Model Copies to Zero (min) – This refers to the time it took target tracking to trigger the alarm and SageMaker to decrease model copies to zero on the instance
  • Scale in Instance Count to Zero (min) – This refers to the time it takes SageMaker to scale the instances down to zero after all inference component model copies are zero
  • Step Scaling: Scale up Model Copies from Zero (min) – This refers to the time it took step scaling to trigger the alarm and for SageMaker to provision the instances
  • Scale out Instance Count from Zero (min) – This refers to the time it takes for SageMaker to scale out and add inference component model copies

If you want more customization and faster scaling, consider using step scaling to scale model copies instead of target tracking.

Customers testimonials

The new Scale to Zero feature for SageMaker inference endpoints has sparked considerable interest across customers. We gathered initial reactions from companies who have previewed and evaluated this capability, highlighting its potential impact on AI and machine learning operations.

Atlassian, headquartered in Sydney, Australia, is a software company specializing in collaboration tools for software development and project management:

“The new Scale to Zero feature for SageMaker inference strongly aligns with our commitment to efficiency and innovation. We’re enthusiastic about its potential to revolutionize how we manage our machine learning inference resources, and we look forward to integrating it into our operations”

– Guarav Awadhwal – Senior Engineering Manager at Atlassian

iFood is a Latin American online food delivery firm based in Brazil. It works with over 300,000 restaurants, connecting them with millions of customers every month.

“The Scale to Zero feature for SageMaker Endpoints will be fundamental for iFood’s Machine Learning Operations. Over the years, we’ve collaborated closely with the SageMaker team to enhance our inference capabilities. This feature represents a significant advancement, as it allows us to improve cost efficiency without compromising the performance and quality of our ML services, given that inference constitutes a substantial part of our infrastructure expenses.”

– Daniel Vieira, MLOps Engineer Manager at iFoods

VIDA, headquartered in Jakarta, Indonesia, is a leading digital identity provider that enable individuals and business to conduct business in a safe and secure digital environment.

“SageMaker’s new Scale to Zero feature for GPU inference endpoints shows immense promise for deep fake detection operations. The potential to efficiently manage our face liveness and document verification inference models while optimizing infrastructure costs aligns perfectly with our goals. We’re excited to leverage this capability to enhance our identity verification solutions.”

– Keshav Sharma, ML Platform Architect at VIDA

APOIDEA Group is a leading AI-focused FinTech ISV company headquartered in Hong Kong. Leveraging cutting-edge generative AI and deep learning technologies, the company develops innovative AI FinTech solutions for multinational banks. APOIDEA’s products automate repetitive human analysis tasks, extracting valuable financial insights from extensive financial documents to accelerate AI-driven transformation across the industry.

“SageMaker’s Scale to Zero feature is a game changer for our AI financial analysis solution in operations. It delivers significant cost savings by scaling down endpoints during quiet periods, while maintaining the flexibility we need for batch inference and model testing. This capability is transforming how we manage our GenAI workloads and evaluate new models. We’re eager to harness its power to further optimize our deep learning and NLP model deployments.”

– Mickey Yip, VP of Product at APOIDEA Group

Fortiro, based in Melbourne, Australia, is a FinTech company specializing in automated document fraud detection and financial verification for trusted financial institutions.

“The new Scale-to-Zero capability in SageMaker is a game-changer for our MLOps and delivers great cost savings. Being able to easily scale inference endpoints and GPUs means we can take advantage of a fast, highly responsive environment, without incurring unnecessary costs. Our R&D teams constantly experiment with new AI-based document fraud detection methods, which involves a lot of testing and repeating. This capability empowers us to do this both faster and more efficiently.”

– Amir Vahid , Chief Technology Officer at Fortiro

These testimonials underscore the anticipation for SageMaker’s Scale to Zero feature. As organizations begin to implement this capability, we expect to see innovative applications that balance cost efficiency with performance in machine learning deployments.

Conclusion

In this post, we introduced the new scale to zero feature in SageMaker, an innovative capability that enables you to optimize costs by automatically scaling in your inference endpoints when they’re not in use. We guided you through the detailed process of implementing this feature, including configuring endpoints, setting up auto scaling policies, and managing inference components for both automatic and scheduled scaling scenarios.

This cost-saving functionality presents new possibilities for how you can approach your ML operations. With this feature, you can closely align your compute resource usage with actual needs, potentially reducing costs during periods of low demand. We encourage you to try this capability and start optimizing your SageMaker inference costs today.

To help you get started quickly, we’ve prepared a comprehensive notebooks containing an end-to-end example of how to configure an endpoint to scale to zero.

We encourage you to try this capability and start optimizing your SageMaker inference costs today!


About the authors

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Christian Kamwangala is an AI/ML and Generative AI Specialist Solutions Architect at AWS, based in Paris, France. He helps enterprise customers architect and implement cutting-edge AI solutions using AWS’s comprehensive suite of tools, with a focus on production-ready systems that follow industry best practices. In his spare time, Christian enjoys exploring nature and spending time with family and friends.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Raghu Ramesha is a Senior GenAI/ML Solutions Architect on the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in computer science from UT Dallas. In his free time, he enjoys traveling and photography.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Raj Vippagunta is a Principal Engineer at Amazon SageMaker Machine Learning(ML) platform team in AWS. He uses his vast experience of 18+ years in large-scale distributed systems and his passion for machine learning to build practical service offerings in the AI and ML space. He has helped build various at-scale solutions for AWS and Amazon. In his spare time, he likes reading books, pursue long distance running and exploring new places with his family.

Read More

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Supercharge your auto scaling for generative AI inference – Introducing Container Caching in SageMaker Inference

Today at AWS re:Invent 2024, we are excited to announce the new Container Caching capability in Amazon SageMaker, which significantly reduces the time required to scale generative AI  models for inference. This innovation allows you to scale your models faster, observing up to 56% reduction in latency when scaling a new model copy and up to 30% when adding a model copy on a new instance. These improvements are available across a wide range of SageMaker’s Deep Learning Containers (DLCs), including Large Model Inference (LMI, powered by vLLM and multiple other frameworks), Hugging Face Text Generation Inference (TGI), PyTorch (Powered by TorchServe), and NVIDIA Triton. Fast container startup times are critical to scale generative AI models effectively, making sure end-users aren’t negatively impacted as inference demand increases.

As generative AI models and their hosting containers grow in size and complexity, scaling these models efficiently for inference becomes increasingly challenging. Until now, each time SageMaker scaled up an inference endpoint by adding new instances, it needed to pull the container image (often several tens of gigabytes in size) from Amazon Elastic Container Registry (Amazon ECR), a process that could take minutes. For generative AI models requiring multiple instances to handle high-throughput inference requests, this added significant overhead to the total scaling time, potentially impacting application performance during traffic spikes.

Container Caching addresses this scaling challenge by pre-caching the container image, eliminating the need to download it when scaling up. This new feature brings several key benefits for generative AI inference workloads: dramatically faster scaling to handle traffic spikes, improved resource utilization on GPU instances, and potential cost savings through more efficient scaling and reduced idle time during scale-up events. These benefits are particularly impactful for popular frameworks and tools like vLLM-powered LMI, Hugging Face TGI, PyTorch with TorchServe, and NVIDIA Triton, which are widely used in deploying and serving generative AI models on SageMaker inference.

In our tests, we’ve seen substantial improvements in scaling times for generative AI model endpoints across various frameworks. The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. This enhancement allows customers running high-throughput production workloads to handle sudden traffic spikes more efficiently, providing more predictable scaling behavior and minimal impact on end-user latency across their ML infrastructure, regardless of the chosen inference framework.

In this post, we explore the new Container Caching feature for SageMaker inference, addressing the challenges of deploying and scaling large language models (LLMs). We discuss how this innovation significantly reduces container download and load times during scaling events, a major bottleneck in LLM and generative AI inference. You’ll learn about the key benefits of Container Caching, including faster scaling, improved resource utilization, and potential cost savings. We showcase its real-world impact on various applications, from chatbots to content moderation systems. We then guide you through getting started with Container Caching, explaining its automatic enablement for SageMaker provided DLCs and how to reference cached versions. Finally, we delve into the supported frameworks, with a focus on LMI, PyTorch, Hugging Face TGI, and NVIDIA Triton, and conclude by discussing how this feature fits into our broader efforts to enhance machine learning (ML) workloads on AWS.

This feature is only supported when using inference components. For more information on inference components, see Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker.

The challenge of deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

  • Provisioning new compute resources
  • Downloading container image
  • Loading container image
  • Loading the model weights into memory
  • Initializing the inference runtime
  • Shifting traffic to serve new requests

The cumulative time for these steps can range from several minutes to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during traffic spikes, making it a critical area for optimization in the field of AI inference infrastructure.

The introduction of Container Caching for SageMaker DLCs brings several key benefits for inference workloads:

  • Faster scaling – By having the latest DLCs pre-cached, the time required to scale inference endpoints in response to traffic spikes is substantially reduced. This provides a more consistent and responsive experience for inference hosting, allowing systems to adapt quickly to changing demand patterns. ML engineers can now design more aggressive auto scaling policies, knowing that new instances can be brought online in a fraction of the time previously required.
  • Quick endpoint startup – Using pre-cached containers significantly decreases the startup time for new model deployments. This acceleration in the deployment pipeline enables more frequent model updates and iterations, fostering a more agile development cycle. AI and ML engineers can now move from model training to production deployment with unprecedented speed, reducing time-to-market for new AI features and improvements.
  • Improved resource utilization – Container Caching minimizes idle time on expensive GPU instances during the initialization phase. Instead of waiting for container downloads, these high-performance resources can immediately focus on inference tasks. This optimization provides more efficient use of computational resources, potentially allowing for higher throughput and better cost-effectiveness.
  • Cost savings – The cumulative effect of faster deployments and more efficient scaling can lead to significant reductions in overall inference costs. By minimizing idle time and improving resource utilization, organizations can potentially serve the same workload with fewer instances or handle increased demand without proportional increases in infrastructure costs. Additionally, the improved responsiveness can lead to better user experiences, potentially driving higher engagement and revenue in customer-facing applications.
  • Enhanced compatibility – By focusing on the latest SageMaker DLCs, this caching mechanism makes sure users always have quick access to the most recent and optimized environments for their models. This can be particularly beneficial for teams working with cutting-edge AI technologies that require frequent updates to the underlying frameworks and libraries.

Container Caching represents a significant advancement in AI inference infrastructure. It addresses a critical bottleneck in the deployment process, empowering organizations to build more responsive, cost-effective, and scalable AI systems.

Getting started with Container Caching for inference

Container Caching is automatically enabled for popular SageMaker DLCs like LMI, Hugging Face TGI, NVIDIA Triton, and PyTorch used for inference. To use cached containers, you only need to make sure you’re using a supported SageMaker container. No additional configuration or steps are required.

The following table lists the supported DLCs.

SageMaker DLC Starting Version Starting Container
LMI 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
LMI-TRT 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124
LMI-Neuron 0.29.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.29.0-neuronx-sdk2.19.1
TGI-GPU 2.4.0 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0
TGI-Neuron 2.1.2 763104351884.dkr.ecr.us-west-2.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.25-neuronx-py310-ubuntu22.04-v1.0
Pytorch-GPU 2.5.1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker
Pytorch-CPU 2.5.1 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference:2.5.1-cpu-py311-ubuntu22.04-sagemaker
Triton 24.09 763104351884.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:24.09-py3

In the following sections, we discuss how to get started with several popular SageMaker DLCs.

Hugging Face TGI

Developed by Hugging Face, TGI is an inference framework for deploying and serving LLMs, offering a purpose-built solution that combines security, performance, and ease of management. TGI is specifically designed to deliver high-performance text generation through advanced features like tensor parallelism and continuous batching. It supports a wide range of popular open source LLMs, making it a popular choice for diverse AI applications. What sets TGI apart is its optimization for both NVIDIA GPUs and AWS accelerators with AWS Inferentia and AWS Trainium, providing optimal performance across different hardware configurations.

With the introduction of Container Caching, customers using the latest release of TGI containers on SageMaker will experience improved scaling performance. The caching mechanism works automatically, requiring no additional configuration or code changes. This seamless integration means that organizations can immediately benefit from faster scaling without any operational overhead.

Philipp Schmid, Technical Lead at Hugging Face, shares his perspective on this enhancement: “Hugging Face TGI containers are widely used by SageMaker inference customers, offering a powerful solution optimized for running popular models from the Hugging Face. We are excited to see Container Caching speed up auto scaling for users, expanding the reach and adoption of open models from Hugging Face.”

You can use Container Caching with Hugging Face TGI using the following code:

// Using Container Caching for Huggingface TGI
//Create an IC with Hugging face image

create_inference_component(
        image="763104351884.dkr.ecr.<region>.amazonaws.com/huggingface-pytorch-tgi-inference:2.4.0-tgi2.4.0-gpu-py311-cu124-ubuntu22.04-v2.0", 
        model_url= "s3://path/to/your/model/artifacts"
        )

** We will cache latest version of currently maintained images - https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only

NVIDIA Triton

NVIDIA Triton Inference Server is a model server from NVIDIA that supports multiple deep learning frameworks and model formats. On SageMaker, Triton offers a comprehensive serving stack with support for various backends, including TensorRT, PyTorch, Python, and more. Triton is particularly powerful because of its ability to optimize inference across different hardware configurations while providing features like dynamic batching, concurrent model execution, and ensemble models. The Triton architecture enables efficient model serving through features like multi-framework support, optimized GPU utilization, and flexible model management.

With Container Caching, Triton deployments on SageMaker become even more efficient, especially when scaling large-scale inference workloads. This is particularly beneficial when deploying multiple models using Triton’s Python backend or when running model ensembles that require complex preprocessing and postprocessing pipelines. This improves the deployment and scaling experience for Triton workloads by eliminating the need to repeatedly download container images during scaling events.

Eliuth Triana, Global Lead Amazon Developer Relations at NVIDIA, comments on this enhancement:

“The integration of Container Caching with NVIDIA Triton Inference Server on SageMaker represents a significant advancement in serving machine learning models at scale. This feature perfectly complements Triton’s advanced serving capabilities by reducing deployment latency and optimizing resource utilization during scaling events. For customers running production workloads with Triton’s multi-framework support and dynamic batching, Container Caching provides faster response to demand spikes while maintaining Triton’s performance optimizations.”

To use Container Caching with NVIDIA Triton, use the following code:

// Using Container Caching for Triton
create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/sagemaker-tritonserver:24.09-py3", 
    model_url="s3://path/to/your/model/artifacts" 
)

PyTorch and TorchServe (now with vLLM engine integration)

SageMaker Deep Learning Container for PyTorch is powered by TorchServe . It offers a comprehensive solution for deploying and serving PyTorch models, including Large Language Models (LLMs), in production environments. TorchServe provides robust model serving capabilities through HTTP REST APIs, like flexible configuration options and performance optimization features like server-side batching, multi-model serving and dynamic model loading. The container supports a wide range of models and advanced features, including quantization, and parameter-efficient methods like LoRA.

The latest version of PyTorch also uses TorchServe integrated with vLLM engine which leverages advanced features such as vLLM’s state-of-the-art inference engine with PagedAttention and continuous batching. It supports single-node, multi-GPU distributed inference, enabling tensor parallel sharding for larger models. The integration of Container Caching significantly reduces scaling times, particularly beneficial for large models during auto-scaling events. TorchServe’s handler system allows for easy customization of pre- and post-processing logic, making it adaptable to various use cases. With its growing feature set, TorchServe is a popular choice for deploying and scaling machine learning models among inference customers.

You can use Container Caching with PyTorch using the following code:

 // Using Container Caching for PyTorch 
 create_inference_component( 
    image="763104351884.dkr.ecr.<region>.amazonaws.com/pytorch-inference:2.5.1-gpu-py311-cu124-ubuntu22.04-sagemaker", 
    model_url="s3://path/to/your/model/artifacts" 
 )

LMI container

The Large Model Inference (LMI) container is a high-performance serving solution that can be used through a no-code interface with smart defaults that can be extended to fit your unique needs. LMI delivers performance differentiation through advanced optimizations, outpacing open source backends like vLLM, TensorRT-LLM, and Transformers NeuronX while offering a unified UI.

Popular features such as continuous batching, token streaming, and speculative decoding are available out of the box to provide superior throughput, latency, and scalability. LMI supports a wide array of use cases like multi-node inference and model personalization through LoRA adapters, and performance optimizations like quantization and compilation.

With Container Caching, LMI containers deliver even faster scaling capabilities, particularly beneficial for large-scale LLM deployments where container startup times can significantly impact auto scaling responsiveness. This enhancement works seamlessly across all supported backends while maintaining the container’s advanced features and optimization capabilities.

Contributors of LMI containers comment on this enhancement:

“The addition of Container Caching to LMI containers represents a significant step forward in making LLM deployment more efficient and responsive. This feature complements our efforts to speed up model loading through pre-sharding, weight streaming, and compiler caching, enabling customers to achieve both high-performance inference and rapid scaling capabilities, which is crucial for production LLM workloads.”

To use Container Caching with LMI, use the following code:

# Using Container Caching for LMI
create_inference_component(
    image= "763104351884.dkr.ecr.<region>.amazonaws.com/djl-inference:0.30.0-lmi12.0.0-cu124",
    model_url="s3://path/to/your/model/artifacts"
)

Performance Evaluation:

The implementation of Container Caching for running Llama3.1 70B model showed significant and consistent improvements in end-to-end (E2E) scaling times. We ran 5+ scaling simulations and observed consistent performance with low variations across trials. When scaling the model on an available instance, the E2E scaling time was reduced from 379 seconds (6.32 minutes) to 166 seconds (2.77 minutes), resulting in an absolute improvement of 213 seconds (3.55 minutes), or a 56% reduction in scaling time. For the scenario of scaling the model by adding a new instance, the E2E scaling time decreased from 580 seconds (9.67 minutes) to 407 seconds (6.78 minutes), yielding an improvement of 172 seconds (2.87 minutes), which translates to a 30% reduction in scaling time. These results demonstrate that Container Caching substantially and reliably enhances the efficiency of model scaling operations, particularly for large language models like Llama3.1 70B, with more pronounced benefits observed when scaling on existing instances.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models.

The following table summarizes our setup.

Region CMH
Instance Type p4d.24xlarge
Container LMI V13.31
Container Image 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model Llama 3.1 70B

Scaling the model by adding a new instance

For this scenario, we explore scaling the model by adding a new instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 223 339 602
2 40 203 339 582
3 40 175 339 554
4 40 210 339 589
5 40 191 339 570
Average 200 339 580

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 185 173 398
2 40 175 188 403
3 40 164 208 412
4 40 185 187 412
5 40 185 187 412
Average 178.8 188.6 407.4

Scaling the model on an available instance

In this scenario, we explore scaling the model on an available instance.

The following table summarizes the results when containers are not cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 339 379
2 40 339 379
3 40 339 379
4 40 339 379
5 40 339 379
Average 339 379

The following table summarizes the results after containers are cached.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Instantiate a New Model Copy End-to-End Scaling Latency
1 40 150 190
2 40 122 162
3 40 121 161
4 40 119 159
5 40 119 159
Average 126.2 166.2

Summary of findings

The following table summarizes our results in both scenarios.

. End-to End Scaling Time Before End-to-End Scaling Time After Improvement in Absolute Numbers % Improvements
Scaling the model on an available instance 379 166 213 56
Scaling the model by adding a new instance 580 407 172 30

Customers using ODCRs for GPUs may experience a lower time to spin up new instances as compared to on demand depending on instance type.

Conclusion

Container Caching for inference is just one of the many ways SageMaker can improve the efficiency and performance of ML workloads on AWS. We encourage you to try out this new feature for your inference workloads and share your experiences with us. Your feedback is invaluable as we continue to innovate and improve our ML platform.

To learn more about Container Caching and other SageMaker features for inference, refer to Amazon SageMaker Documentation or check out our GitHub repositories for examples and tutorials on deploying models for inference.


About the Authors

Wenzhao Sun, PhD, is a Sr. Software Dev Engineer with the SageMaker Inference team. He possesses a strong passion for pushing the boundaries of technical solutions, striving to maximize their theoretical potential. His primary focus is on delivering secure, high-performance, and user-friendly machine learning features for AWS customers. Outside of work, he enjoys traveling and video games.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends. You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Author - Aakash DeepAakash Deep is a Software Development Engineering Manager with the Amazon SageMaker Inference team. He enjoys working on machine learning and distributed systems. His mission is to deliver secure, highly performant, highly scalable and user friendly machine learning features for AWS customers. Outside of work, he enjoys hiking and traveling.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.

Read More

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

The generative AI landscape has been rapidly evolving, with large language models (LLMs) at the forefront of this transformation. These models have grown exponentially in size and complexity, with some now containing hundreds of billions of parameters and requiring hundreds of gigabytes of memory. As LLMs continue to expand, AI engineers face increasing challenges in deploying and scaling these models efficiently for inference. One of the primary bottlenecks in the inference deployment process has been the time required to load these massive models onto accelerators. With LLMs now reaching hundreds of gigabytes in size, it has become increasingly difficult for many users to address bursty traffic patterns and scale quickly. For LLMs that often require high throughput and low-latency inference requests, this loading process can add significant overhead to the total deployment and scaling time, potentially impacting application performance during traffic spikes. SageMaker Large Model Inference (LMI) is deep learning container to help customers quickly get started with LLM deployments on SageMaker Inference.

Today at AWS re:Invent 2024, we are excited to announce a new capability in Amazon SageMaker Inference that significantly reduces the time required to deploy and scale LLMs for inference using LMI: Fast Model Loader. This innovation allows you to scale your models faster, observing up to 19% reduction in latency when scaling a new model copy on a new instance for inference. It represents a substantial leap forward in loading large models efficiently. Fast Model Loader introduces a novel approach by streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster model loading.

In our internal testing, we observed that Fast Model Loader can load large models up to 15 times faster compared to the traditional loading methods. This dramatic improvement in loading speed opens up new possibilities for responsive AI systems, potentially enabling faster scaling and more dynamic applications that can adapt quickly to changing demands. During our performance testing we were able to load the llama-3.1-70B model on an ml.p4d.24xlarge instance in just 1 minute. This model, with its 70 billion parameters, typically requires over 140 GB of memory in full precision, underscoring the magnitude of the loading challenge that Fast Model Loader addresses.

Fast Model Loader is designed to tackle scaling challenges, potentially leading to improved resource utilization on GPU instances and more efficient scaling during autoscaling events. This feature aims to provide you with a powerful new option for managing the deployment and scaling of your LLMs on SageMaker inference, whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM-based services.

This post is Part 1 of a series exploring Fast Model Loader. In this post, we delve into the technical details of Fast Model Loader, explore its integration with existing SageMaker workflows, discuss how you can get started with this powerful new feature, and share customer success stories. In Part 2, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments.

Challenges in deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

  • Provisioning new compute instances
  • Downloading the container image
  • Loading the container image
  • Downloading the model artifacts from Amazon S3 to disk
  • Loading the model artifacts on the host (using CPU and memory)
  • Preparing the model to be loaded on GPU (quantization, model sharding, and so on)
  • Loading the final model artifacts on the GPU

The cumulative time for these steps can take up to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during scaling activities, making it a critical area for optimization in the field of AI inference infrastructure.

To reduce the time it takes to download and load the container image, SageMaker now supports container caching. To learn more about this new feature, refer to Supercharge your auto scaling for generative AI inference- Introducing Container Caching in SageMaker Inference

For model loading, a typical deployment follows the steps described in this section, which can lead non-ideal deployment latency. This can lead to requests sitting in the queue waiting to be processed while the deployment concludes, or can result in dropped requests when timeouts are exceeded, as shown in the following diagrams.

You can take a step to optimize deployment with ahead of time (AoT) compilation of your model. This requires you to create or use existing pre-sharded models to avoid the step needed to process the model during runtime deployment. By taking on the cost of pre-creating these artifacts and referencing them as persisted objects, you can take that latency ahead of time. This can significantly reduce the time it takes to scale up a model especially if it’s larger in size.

The benefits of this approach are particularly noticeable for larger models:

  • Reduced scaling time – Pre-sharded models can be loaded more quickly, decreasing the time required to bring new instances online during scaling events
  • Improved resource utilization – By offloading the compilation and sharding process, more computational resources are available for inference tasks during runtime
  • Consistency – Pre-compiled artifacts provide consistent performance across deployments

Although there is an upfront cost in creating these artifacts, the long-term savings in reduced scaling times and improved resource utilization can be substantial, especially for models that are frequently deployed or require rapid scaling. This approach can significantly reduce the time it takes to scale up a model, particularly for larger models, leading to more responsive and efficient AI systems. The following figures illustrate the proposed way to load models.

Additionally, disk becomes a bottleneck during model loading due to its limited I/O bandwidth. Traditional storage systems struggle with the high throughput required for large-scale model loading, like Meta Llama 3.1 70B. Disk read/write speeds are often much slower than network or GPU memory bandwidths, creating delays in transferring model weights. This issue can be alleviated by streaming data directly from Amazon S3 to GPU memory, bypassing disk entirely.

We can now take a significant step forward and also address the steps it takes using host resources and the sequential steps it takes between downloading the model artifacts to loading it onto the GPU using Fast Model Loader.

Weight streaming

Fast Model Loader streams weights directly from Amazon S3 to GPUs. This is accomplished by cutting out the intermediary steps—the bytes representing model weights are downloaded to the CPU memory and immediately copied over to the GPU using Direct Memory Access (DMA). This simplifies the model loading workflow and makes it straightforward to maximize the model loading throughput. It presents the following key advantages:

  • No waiting – In the traditional approach, each step in the loading process (download, load to host’s CPU, GPU copy) needs to complete for a tensor or a layer before the next step could begin. This creates synchronous bottlenecks, where components are idle while waiting for the previous step to finish. Fast Model Loader’s direct streaming approach eliminates these synchronous blocking operations, allowing all components to operate at their maximum potential concurrently.
  • No accumulation – Instead of downloading the entire model to disk or CPU memory before processing, Fast Model Loader streams the model weights in small chunks directly to the GPU. This avoids the need to accumulate the full model in system storage or memory, reducing the overall resource requirements and footprint.
  • Maximum throughput – By simplifying the model loading workflow and eliminating intermediate steps, Fast Model Loader can more effectively take advantage of the high-throughput capabilities of Amazon S3 and the generous network bandwidth available on the large instances typically used for hosting LLMs. This allows the model loading process to achieve maximum throughput and minimize latency.

The following figure compares model load times for sequential vs. parallel processes.

Model sharding for streaming

The weight streaming paradigm described in the previous section requires that the model weights be prepared appropriately prior to streaming. In order to stream the model weights a few bytes at a time, we need to store the model in a format consistent with our expectation.

The traditional approach to storing and distributing LLM weights often relies on the SafeTensors format. Although SafeTensors provides a standardized way to package and distribute model weights, it presents some challenges when it comes to the weight streaming paradigm used by Fast Model Loader. In the SafeTensors format, the fundamental unit of storage is the tensor. Tensors are multi-dimensional arrays that represent the various weights and parameters of a machine learning model. However, the size of these tensors can vary significantly, ranging from a few megabytes to several gigabytes, depending on the complexity and scale of the model. This non-uniform distribution of tensor sizes poses a problem for Fast Model Loader’s weight streaming approach. The variable tensor sizes in the SafeTensors format make it difficult to achieve consistent throughput. Larger tensors require more time and resources to load, whereas smaller tensors are underutilized, leading to inefficiencies in the overall loading process.

The following figure illustrates loading SafeTensors weights of various sizes.

Fast Model Loader introduces a new model format with the following key advantages:

  • Pre-sharding – The explosion in model sizes has seen them outgrow GPUs. The largest models available today are over 1 TB large, whereas the largest GPUs fall short of 200 GB. This has led to us embracing distributed inference strategies like tensor parallelism. It involves splitting a model into portions (shards) and distributing them to multiple GPUs. However, this involves quite a few computations in deciding how to split the model at every layer and calculating offsets based on tensor size and available GPU memory. Fast Model Loader performs this optimization pre-deployment, which avoids the overhead during scaling activities. The preparation only happens one time, and the model can be deployed to any number of instances with the same distributed inference strategy. The following figure provides an overview of pre-sharding.

  • Uniform size distribution – The model weights are stored in uniform 8 MB chunks, which are less complicated to parallelize for concurrent processing. The following figure illustrates uniform chunks being parallelized across cores.

  • Out of order processing – Objects in Amazon S3 typically have to be downloaded in-order. To read the middle of the object, Amazon S3 starts by reading objects from the start, until it gets to the middle. This requires model weights to be downloaded synchronously, which runs contrary to our fast model loading paradigm. Storing model weights in uniform chunks of 8 MB each allows you to access any piece of the model at any time without synchronization. The following figures illustrate how breaking tensors into chunks allows for asynchronous, out of order retrieval.

Performance Testing:

The implementation of the Fast Model Loader demonstrates significant in End-to-End (E2E) scaling time for large language models. Across five simulations that were performed with Llama3.1 70B, we observed remarkably consistent results, reinforcing the reliability of our findings. For this feature we use container wherein caching was enabled here. When using CUDA Graphs, the average scaling time was reduced from 407 seconds (6.78 minutes) to 334.6 seconds (5.58 minutes), marking a substantial 21.64% improvement. Similarly, without CUDA Graphs, the average scaling time decreased from 379 seconds (6.32 minutes) to 306 seconds (5.10 minutes), resulting in a 19.26% reduction. By cutting scaling times by approximately one-fifth in all observed cases, this feature enables more responsive scaling and better handling of dynamic workloads, ultimately leading to improved performance and resource utilization in AI inference.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models and Container Caching.

The following table summarizes our setup.

Region CMH
Instance Type p4d.24xlarge
Container LMI
Container Image 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model Meta Llama3.1 70B

For this scenario, we illustrate scaling the model by adding a new instance.
The following table summarizes the results when the models are not sharded.
** All numbers presented here are in seconds.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy CUDA Graphs Capture Overhead E2E Scaling Latency
. . . . . With CUDA Graphs Without CUDA Graphs
1 40 185 173 28 398 370
2 40 175 188 29 403 374
3 40 164 208 29 412 383
4 40 185 187 30 412 382
5 40 185 187 28 412 384
Average . 179 189 29 407 379

The following table summarizes the results after the models are sharded.
** All numbers presented here are in seconds.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy CUDA Graphs Capture Overhead E2E Scaling Latency
. . . . . With CUDA Graphs Without CUDA Graphs
1 40 185 119 28 344 316
2 40 175 119 30 334 304
3 40 169 119 28 328 300
4 40 169 120 28 329 301
5 40 179 119 29 338 309
Average . 175.4 119.2 28.6 334.6 306

The following diagram summarizes the impact on E2E scaling time.
** All numbers presented here are in seconds.

.. Before After % Improvements
Scaling with CUDA Graphs 407 334.6 21.64%
Scaling without CUDA Graphs 379 306 19.26%

Note: For customers using ODCRs for GPUs may experience lower time to spin up new instances as compared to on demand depending on instance type.

Impact on Read-to-Serve Time:

The benchmarks below show that the SageMaker Fast Model Loader can load large models significantly faster compared to traditional counterparts. For the LLaMa 3.1 70B model on the ml.p4d.24xlarge instance, we compared the download and load times against 2 traditional methods – downloading the model from HuggingFace Hub using transformers and downloading the model from S3 using vLLM’s default downloader. In both cases we used vLLM’s default loader to load the model after the download.

** All numbers presented here are in seconds.

. Download Load % Improvement with Fast Model Loader Speedup with Fast Model Loader
Transformers Downloader + vLLM Model Loader 602 138 93.24% 15x
vLLM Downloader + vLLM Model Loader 127 138 81.13% 5x
Fast Model Loader 50 . .

The load time here indicates the time taken to get the model fully ready to serve, including time taken to initialize the KV cache.

How to get started

You can start using Fast Model Loader now through the Amazon SageMaker Studio console using Amazon SageMaker JumpStart or programmatically using the SageMaker Python SDK.

From the SageMaker Studio JumpStart hub, you can pick a model and choose Optimize to run the inference optimization job and then deploy the optimized model to a SageMaker endpoint. For more detailed instructions, refer to the Part 2 of this post.

Though SageMaker Studio provides a user-friendly interface for model optimization through SageMaker JumpStart, you can also achieve the same functionality programmatically using the SageMaker Python SDK. The ModelBuilder class offers a streamlined way to optimize and deploy large models, requiring just a few lines of code to prepare your model for fast loading and inference. The following code snippet shows the core implementation to use ModelBuilder to prepare and optimize the model for Fast Model Loader. You can find an end-to-end example notebook in the following GitHub repo.

# Create a model builder object
model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",[{{type}} Annotation]
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test")
)
#Run model optimization job
model_builder.optimize(
        instance_type="ml.p4d.24xlarge",
        output_path=output_path,
        sharding_config={
        "OverrideEnvironment":{
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
        }
)

Customer testimonials

The introduction of Fast Model Loader in SageMaker has generated significant excitement among our customers, particularly those working with LLMs. We’ve collected early feedback from customers that have had the opportunity to preview this new capability. Their responses underscore the potential of Fast Model Loader to transform the deployment and scaling of AI models, especially in scenarios requiring rapid response to changing demands.

Atomicwork is a modern ITSM and ESM solution that revolutionizes internal support for organizations through AI-powered chat interfaces, replacing traditional ticketing portals.

“Amazon SageMaker Fast Model Loader is a game changer for our AI-driven enterprise workflows. It significantly accelerates the deployment and scaling of the large language models, which are critical for providing responsive, chat-based IT support, HR processes, and customer service operations. We look forward to adopting this feature that allows us to optimize our computational resources while maintaining the agility our enterprise customers expect, helping us deliver a truly intelligent service management platform.”

– Kiran Darisi, Co-founder and CTO of Atomicwork.

Conclusion

In this post, we discussed how loading large model artifacts can be the bottleneck in loading and scaling FMs. SageMaker has launched a new feature called Fast Model Loader to address challenges in deploying and scaling FMs for inference. Fast Model Loader can load large models up to 15 times faster by streaming model weights directly from Amazon S3 to the accelerator, reducing scaling and deployment times significantly.

In Part 2 of this post, we demonstrate how you can try out this new feature through either the SageMaker Python SDK or SageMaker Studio console.


About the Authors

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.

Read More

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

In Part 1 of this series, we introduced Amazon SageMaker Fast Model Loader, a new capability in Amazon SageMaker that significantly reduces the time required to deploy and scale large language models (LLMs) for inference. We discussed how this innovation addresses one of the major bottlenecks in LLM deployment: the time required to load massive models onto accelerators. By streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, Fast Model Loader can achieve up to 15 times faster loading times compared to traditional methods.

As the AI landscape continues to evolve and models grow even larger, innovations like Fast Model Loader become increasingly crucial. By significantly reducing model loading times, this feature has the potential to transform the way you deploy and scale your LLMs, enabling more responsive and efficient AI applications across a wide range of use cases.

In this post, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments. We explore two approaches: using the SageMaker Python SDK for programmatic implementation, and using the Amazon SageMaker Studio UI for a more visual, interactive experience. Whether you’re a developer who prefers working with code or someone who favors a graphical interface, you’ll learn how to take advantage of this powerful feature to accelerate your LLM deployments.

Solution overview

Fast Model Loader is currently integrated with SageMaker Large Model Inference (LMI) containers (starting with v13) for GPU instances. It introduces two key techniques to enable lightning-fast model loads:

  • Weight streaming
  • Model sharding for streaming

Use Fast Model Loader with the SageMaker Python SDK

In this section, we show how to use the SageMaker Python SDK to use this new feature. You can find the example notebook in the following GitHub repo. Complete the following steps:

  1. First, use ModelBuilder to prepare and package the model inference components.

To learn more about the ModelBuilder class, refer to Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements. In this example, you deploy the Meta Llama 3.1 70B model with the model name meta-textgeneration-llama-3-1-70b in Amazon SageMaker JumpStart.

The SchemaBuilder parameter is used to infer the serialization and deserialization methods for the model. For more information on SchemaBuilder, refer to Define serialization and deserialization methods.

You can choose to specify OPTION_TENSOR_PARALLEL_DEGREE as a ModelBuilder environment variable as shown in the following commented lines, or in the next step as part of the ModelBuilder sharding_config:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

# Define sample input and output for the model
prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."
# Create the input schema structure
sample_input = {
    "inputs": prompt,
    "parameters": {"max_new_tokens": 32}
}
# Define the expected output format
sample_output = [{"generated_text": response}]

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output),
    #env_vars={
    #   "OPTION_TENSOR_PARALLEL_DEGREE": "8",
    #},
)
  1. Next, use the optimize() function to prepare the model shards for deployment.

The optimize() function will start a model optimization job and will take a few minutes to finish. The tensor parallel degree should be set to how many GPUs you want each inference component to have access to. You can find the model shards at the output_path S3 location under a folder starting with sagemaker-fast-model-loader-xxx.

model_builder.optimize(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    output_path=output_path, 
    sharding_config={
            "OverrideEnvironment": {
            # The value must be equal to the subsequent number of GPUs that will be used for each IC. 
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
    }
)

You can reuse the sharded model that was generated by previous optimization jobs. The following code sample demonstrates how to use model_metadata to overwrite the model path, which needs to point to the Amazon S3 location of the existing model shards:

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    model_metadata={
        "CUSTOM_MODEL_PATH": output_path,
    },
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test"),
    role_arn=role,
    instance_type="ml.p4d.24xlarge",
)
  1. When the model optimization job is complete, you can use the build() function to generate the artifacts according to the model server:
    # use the build() function to generate the artifacts according to the model server
    final_model = model_builder.build()

  2. If you’re using existing model shards without running an optimization job, you need to make sure the _is_sharded_model value is set to True and the EnableNetworkIsolation is set to False because Fast Model Loader requires network access:
    # You only need to set the values if you are using existing sharded models 
    if not final_model._is_sharded_model:
     final_model._is_sharded_model = True 
    if final_model._enable_network_isolation:
     final_model._enable_network_isolation = False

  3. Use the deploy() function to deploy the model to an endpoint, where you can specify the required resources, such as GPU memory and number of accelerators:
    from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
    
    resources_required = ResourceRequirements(
        requests={
            "memory" : 204800,
            "num_accelerators": 8
        }
    )
    
    # deploy the optimized model to an endpoint
    final_model.deploy(
        instance_type="ml.p4d.24xlarge", 
        accept_eula=True, 
        endpoint_logging=False, 
        resources=resources_required
    )

  4. After the endpoint is up and running, you can test the endpoint using the following code example:
    from sagemaker.predictor import retrieve_default 
    endpoint_name = final_model.endpoint_name 
    predictor = retrieve_default(endpoint_name) 
    payload = { "inputs": "I believe the meaning of life is", 
                "parameters": { 
                    "max_new_tokens": 64, 
                    "top_p": 0.9, 
                    "temperature": 0.6 
                } 
            }
    response = predictor.predict(payload) 
    print(response)

  5. To clean up, run the following code cell to delete the resources created for the endpoint:
    predictor.delete_predictor()
    predictor.delete_endpoint()

Use Fast Model Loader with SageMaker Studio

In this section, we show how to use the faster model loading feature through the SageMaker Studio UI. Complete the following steps:

  1. On the SageMaker Studio console, chose JumpStart in the navigation pane.
  2. Choose your model.
  3. On the model details page, choose Optimize.
  4. Accept the EULA and proceed to the optimization configurations.
  5. Select Fast model loading and set the OPTION_TENSOR_PARALLEL_DEGREE to 8, because this example uses an ml.p4d.24xlarge instance that has 8 GPUs. If you’re using an instance with a different number of GPUs, set the value to match the instance.
  6. Set the output path to the Amazon S3 path where the sharded model will be stored.
  7. Choose Create job.

After the inference optimization job starts, you can check the status of the job on the Inference optimization page. Here, each of the jobs have tags associated with them as to what optimization configuration was used.

  1. View the details of the job by choosing the job ID.
  2. Deploy the optimized model by choosing Deploy on the optimized job page.
  3. Verify the endpoint settings and choose Deploy to initiate a SageMaker endpoint deployment.

You will get a notification on the SageMaker Studio UI, and the status will change to In service when the endpoint creation is complete.

You can now send a sample inference request to test the model.

After the test, you can delete the endpoint from the SageMaker Studio console to clean up the resources created in this example.

Conclusion

Fast Model Loader represents a significant advancement in how you can deploy and scale LLMs on SageMaker. In this post, we walked through the step-by-step process of implementing this feature through both the SageMaker Python SDK and SageMaker Studio UI. By using weight streaming and model sharding techniques, you can now achieve dramatically faster model loading times, enabling more responsive scaling for your LLM-based applications.

The integration with SageMaker LMI containers (starting from LMI v13) makes it straightforward to adopt this feature in your existing workflows. Whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM services, Fast Model Loader provides the tools you need to optimize your model deployment pipeline.

Try out Fast Model Loader for your own use case, and leave your feedback and questions in the comments.


About the Authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Read More

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Chronos-Bolt is the newest addition to AutoGluon-TimeSeries, delivering accurate zero-shot forecasting up to 250 times faster than the original Chronos models [1].

Time series forecasting plays a vital role in guiding key business decisions across industries such as retail, energy, finance, and healthcare. Traditionally, forecasting has relied on statistical models [2] like ETS and ARIMA, which remain strong baselines, particularly when training data is limited. Over the past decade, advancements in deep learning have spurred a shift toward so-called global models such as DeepAR [3] and PatchTST [4]. These approaches train a single deep learning model across multiple time series in a dataset—for example, sales across a broad e-commerce catalog or observability metrics for thousands of customers.

Foundation models (FMs) such as Chronos [1] have taken the idea of training a single model across multiple time series a significant step further. These models are pretrained on a vast corpus of real and synthetic time series data, covering diverse domains, frequencies, and history lengths. As a result, they enable zero-shot forecasting—delivering accurate predictions on unseen time series datasets. This lowers the entry barrier to forecasting and greatly simplifies forecasting pipelines by providing accurate forecasts without the need for training. Chronos models have been downloaded over 120 million times from Hugging Face and are available for Amazon SageMaker customers through AutoGluon-TimeSeries and Amazon SageMaker JumpStart.

In this post, we introduce Chronos-Bolt, our latest FM for forecasting that has been integrated into AutoGluon-TimeSeries.

Introducing Chronos-Bolt

Chronos-Bolt is based on the T5 encoder-decoder architecture [5] and has been trained on nearly 100 billion time series observations. It chunks the historical time series context into patches of multiple observations, which are then input into the encoder. The decoder then uses these representations to directly generate quantile forecasts across multiple future steps—a method known as direct multi-step forecasting. This differs from the original Chronos models that rely on autoregressive decoding. The chunking of time series and direct multi-step forecasting makes Chronos-Bolt up to 250 times faster and 20 times more memory-efficient than the original Chronos models.

The following plot compares the inference time of Chronos-Bolt against the original Chronos models for forecasting 1024 time series with a context length of 512 observations and a prediction horizon of 64 steps.

Inference speed comparison between Chronos and Chronos-Bolt

Chronos-Bolt models are not only significantly faster, but also more accurate than the original Chronos models. The following plot reports the probabilistic and point forecasting performance of Chronos-Bolt in terms of the Weighted Quantile Loss (WQL) and the Mean Absolute Scaled Error (MASE), respectively, aggregated over 27 datasets (see [1] for dataset details). Remarkably, despite having no prior exposure to these datasets during training, the zero-shot Chronos-Bolt models outperform commonly used statistical models and deep learning models that have been trained on these datasets (highlighted by *). Furthermore, they also perform better than other FMs, denoted by a +, which indicates that these models were pretrained on certain datasets in our benchmark and are not entirely zero-shot. Notably, Chronos-Bolt (Base) also surpasses the original Chronos (Large) model in terms of the forecasting accuracy while being over 600 times faster.

Zero-shot benchmark for Chronos-Bolt

Chronos-Bolt models are now available on Hugging Face in four sizes—Tiny (9M), Mini (21M), Small (48M), and Base (205M)—and can also be used on the CPU.

Solution overview

In this post, we showcase how to use Chronos-Bolt models using the familiar interface of AutoGluon-TimeSeries. AutoGluon-TimeSeries enables SageMaker customers to build and deploy models for time series forecasting, including FMs such as Chronos-Bolt and other global models, and effortlessly ensemble them with statistical models to maximize accuracy.

Perform zero-shot forecasting with Chronos-Bolt

To get started, you need to install AutoGluon v1.2 by running the following command in an Amazon SageMaker Studio notebook or in the terminal:

pip install autogluon.timeseries~=1.2.0

AutoGluon-TimeSeries uses the TimeSeriesDataFrame to work with time series datasets. The TimeSeriesDataFrame expects data in the long dataframe format with at least three columns: an ID column denoting the IDs of individual time series in the dataset, a timestamp column, and a target column that contains the raw time series values. The timestamps must be uniformly spaced, with missing observations denoted by NaN and Chronos-Bolt will handle them appropriately. The following snippet loads the Australian Electricity dataset [6] that contains electricity demand data at 30-minute intervals for five Australian states into a TimeSeriesDataFrame:

from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

train_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/train.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

The next step involves fitting a TimeSeriesPredictor on this data:

predictor = TimeSeriesPredictor(prediction_length=48).fit(train_data, presets="bolt_base")

We have specified that the TimeSeriesPredictor should produce forecasts for the next 48 steps, or 1 day in this case. AutoGluon-TimeSeries offers various presets that can be used when fitting the predictor. The bolt_base preset, used in this example, employs the Base (205M) variant of Chronos-Bolt for zero-shot inference. Because no model fitting is required for zero-shot inference, the call to fit() returns almost instantaneously. The predictor is now ready to generate zero-shot forecasts, which can be done through the predict method:

predictions = predictor.predict(train_data)

AutoGluon-TimeSeries generates both point and probabilistic (quantile) forecasts for the target value. The probabilistic forecast captures the uncertainty of the target value, which is essential for many planning tasks.

We can also visualize the predictions and compare them against the ground truth target value over the forecast horizon:

test_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/test.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

predictor.plot(test_data, predictions, max_history_length=200, item_ids=["T000002"])

Chronos-Bolt generates an accurate zero-shot forecast, as shown in the following plot illustrating point forecasts and the 80% prediction intervals.

Forecasts Qualitative

Fine-tune Chronos-Bolt with AutoGluon

So far, we have used Chronos-Bolt in inference-only mode for zero-shot forecasting. However, AutoGluon-TimeSeries also allows you to fine-tune Chronos-Bolt on your specific datasets. We recommend using a GPU instance such as g5.2xlarge for fine-tuning. The following snippet specifies two settings for the Chronos-Bolt (Small, 48M) model: zero-shot and fine-tuned. AutoGluon-TimeSeries will perform a lightweight fine-tuning of the pretrained model on the provided training data. We add name suffixes to identify the zero-shot and fine-tuned versions of the model.

predictor = TimeSeriesPredictor(prediction_length=48, eval_metric="MASE").fit(
    train_data,
    hyperparameters={
        "Chronos": [
            {"model_path": "bolt_small", "ag_args": {"name_suffix": "ZeroShot"}},
            {"model_path": "bolt_small", "fine_tune": True, "ag_args": {"name_suffix": "FineTuned"}},
        ]
    },
    enable_ensemble=False,
    time_limit=600,
)

The predictor will be fitted for at most 10 minutes, as specified by the time_limit. After fitting, we can evaluate the two model variants on the test data and generate a leaderboard:

predictor.leaderboard(test_data)

Fine-tuning Leaderboard

Fine-tuning resulted in a significantly improved forecast accuracy, as shown by the test MASE scores. All AutoGluon-TimeSeries models report scores in a “higher is better” format, meaning that most forecasting error metrics like MASE are multiplied by -1 when reported.

Augment Chronos-Bolt with exogenous information

Chronos-Bolt is a univariate model, meaning it relies solely on the historical data of the target time series for making predictions. However, in real-world scenarios, additional exogenous information related to the target series (such as holidays or promotions) is often available. Using this information when making predictions can improve forecast accuracy. AutoGluon-TimeSeries now features covariate regressors, which can be combined with univariate models like Chronos-Bolt to incorporate exogenous information. A covariate regressor in AutoGluon-TimeSeries is a tabular regression model that is fit on the known covariates and static features to predict the target column at each time step. The predictions of the covariate regressor are subtracted from the target column, and the univariate model then forecasts the residuals.

We use a grocery sales dataset to demonstrate how Chronos-Bolt can be combined with a covariate regressor. This dataset includes three known covariates: scaled_price, promotion_email, and promotion_homepage, and the task is to forecast the unit_sales:

train_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/grocery_sales/train.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

Grocery Sales DataFrame

The following code fits a TimeSeriesPredictor to forecast unit_sales for the next 7 weeks. We have specified the target column we are interested in forecasting and the names of known covariates while constructing the TimeSeriesPredictor. Two configurations are defined for Chronos-Bolt: a zero-shot setting, which uses only the historical context of unit_sales without considering the known covariates, and a covariate regressor setting, which employs a CatBoost model as the covariate_regressor. We also use the target_scaler, which makes sure the time series have a comparable scale before training, which typically results in better accuracy.

predictor = TimeSeriesPredictor(
    prediction_length=7,
    eval_metric="MASE",
    target="unit_sales",
    known_covariates_names=["scaled_price", "promotion_email", "promotion_homepage"],
).fit(
    train_data,
    hyperparameters={
        "Chronos": [
            {"model_path": "bolt_small", "ag_args": {"name_suffix": "ZeroShot"}},
            {
                "model_path": "bolt_small",
                "covariate_regressor": "CAT",
                "target_scaler": "standard",
                "ag_args": {"name_suffix": "WithRegressor"},
            },
        ],
    },
    time_limit=600,
    enable_ensemble=False,
)

After the predictor has been fit, we can evaluate it on the test dataset and generate the leaderboard. Using the covariate regressor with Chronos-Bolt improves over its univariate zero-shot performance considerably.

test_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/grocery_sales/test.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)
predictor.leaderboard(test_data)

Covariate Regressor Results

The covariates might not always be useful—for some datasets, the zero-shot model might achieve better accuracy. Therefore, it’s important to try multiple models and select the one that achieves the best accuracy on held-out data.

Conclusion

Chronos-Bolt models empower practitioners to generate high-quality forecasts rapidly in a zero-shot manner. AutoGluon-TimeSeries enhances this capability by enabling users to fine-tune Chronos-Bolt models effortlessly, integrate them with covariate regressors, and ensemble them with a diverse range of forecasting models. For advanced users, it provides a comprehensive set of features to customize forecasting models beyond what was demonstrated in this post. AutoGluon predictors can be seamlessly deployed to SageMaker using AutoGluon-Cloud and the official Deep Learning Containers.

To learn more about using AutoGluon-TimeSeries to build accurate and robust forecasting models, explore our tutorials. Stay updated by following AutoGluon on X (formerly Twitter) and starring us on GitHub!

References

[1] Ansari, Abdul Fatir, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, et al. “Chronos: Learning the language of time series.” Transactions on Machine Learning Research (2024).
[2] Hyndman, R. J., and G. Athanasopoulos. “Forecasting: principles and practice 3rd Ed.” O Texts (2018).
[3] Salinas, David, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. “DeepAR: Probabilistic forecasting with autoregressive recurrent networks.” International Journal of Forecasting 36, no. 3 (2020): 1181-1191.
[4] Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: long-term forecasting with transformers.” In The Eleventh International Conference on Learning Representations (2023).
[5] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research 21, no. 140 (2020): 1-67.
[6] Godahewa, Rakshitha, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. “Monash time series forecasting archive.” In NeurIPS Track on Datasets and Benchmarks (2021).


About the Authors

Abdul Fatir Ansari is a Senior Applied Scientist at Amazon Web Services, specializing in machine learning and forecasting, with a focus on foundation models for structured data, such as time series. He received his PhD from the National University of Singapore, where his research centered on deep generative models for images and time series.

Caner Turkmen is a Senior Applied Scientist at Amazon Web Services, where he works on research problems at the intersection of machine learning and forecasting. Before joining AWS, he worked in the management consulting industry as a data scientist, serving the financial services and telecommunications sectors. He holds a PhD in Computer Engineering from Bogazici University in Istanbul.

Oleksandr Shchur is a Senior Applied Scientist at Amazon Web Services, where he works on time series forecasting in AutoGluon. Before joining AWS, he completed a PhD in Machine Learning at the Technical University of Munich, Germany, doing research on probabilistic models for event data. His research interests include machine learning for temporal data and generative modeling.

Lorenzo Stella is a Senior Applied Scientist at Amazon Web Services, working on machine learning, forecasting, and generative AI for analytics and decision-making. He holds a PhD in Computer Science and Electrical Engineering from IMTLucca (Italy) and KU Leuven (Belgium), where his research focused on numerical optimization algorithms for machine learning and optimal control applications.

Read More

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

Today, the Accounts Payable (AP) and Accounts Receivable (AR) analysts in Amazon Finance operations receive queries from customers through email, cases, internal tools, or phone. When a query arises, analysts must engage in a time-consuming process of reaching out to subject matter experts (SMEs) and go through multiple policy documents containing standard operating procedures (SOPs) relevant to the query. This back-and-forth communication process often takes from hours to days, primarily because analysts, especially the new hires, don’t have immediate access to the necessary information. They spend hours consulting SMEs and reviewing extensive policy documents.

To address this challenge, Amazon Finance Automation developed a large language model (LLM)-based question-answer chat assistant on Amazon Bedrock. This solution empowers analysts to rapidly retrieve answers to customer queries, generating prompt responses within the same communication thread. As a result, it drastically reduces the time required to address customer queries.

In this post, we share how Amazon Finance Automation built this generative AI Q&A chat assistant using Amazon Bedrock.

Solution overview

The solution is based on a Retrieval Augmented Generation (RAG) pipeline running on Amazon Bedrock, as shown in the following diagram. When a user submits a query, RAG works by first retrieving relevant documents from a knowledge base, then generating a response with the LLM from the retrieved documents.

The solution consists of the following key components:

  1. Knowledge base – We used Amazon OpenSearch Service as the vector store for embedding documents. For performance evaluation, we processed and indexed multiple Amazon finance policy documents into the knowledge base. Alternatively, Amazon Bedrock Knowledge Bases provides fully managed support for end-to-end RAG workflows. We’re planning to migrate to Amazon Bedrock Knowledge Bases to eliminate cluster management and add extensibility to our pipeline.
  2. Embedding model – At the time of writing, we’re using the Amazon Titan Multimodal Embeddings G1 model on Amazon Bedrock. The model is pre-trained on large and unique datasets and corpora from Amazon and provides accuracy that is higher than or comparable to other embedding models on the market based on our comparative analysis.
  3. Generator model – We used a foundation model (FM) provided by Amazon Bedrock for its balanced ability to deliver highly accurate answers quickly.
  4. Diversity ranker – It’s responsible for rearranging the results obtained from vector index to avoid skewness or bias towards any specific document or section.
  5. Lost in the middle ranker – It’s responsible for efficiently distributing the most relevant results towards the top and bottom of the prompt, maximizing the impact of the prompt’s content.
  6. Guardrails – We used Amazon Bedrock Guardrails to detect personal identifiable information (PII) and safeguard against prompt injection attacks.
  7. Validation engine – Removes PII from the response and checks whether the generated answer aligns with the retrieved context. If not, it returns a hardcoded “I don’t know” response to prevent hallucinations.
  8. Chat assistant UI – We developed the UI using Streamlit, an open source Python library for web-based application development on machine learning (ML) use cases.

Evaluate RAG performance

The accuracy of the chat assistant is the most critical performance metric to Amazon Finance Operations. After we built the first version of the chat assistant, we measured the bot response accuracy by submitting questions to the chat assistant. The SMEs manually evaluated the RAG responses one by one, and found only 49% of the responses were correct. This was far below the expectation, and the solution needed improvement.

However, manually evaluating the RAG isn’t sustainable—it requires hours of effort from finance operations and engineering teams. Therefore, we adopted the following automated performance evaluation approach:

  • Prepare testing data – We constructed a test dataset with three data fields:
    • question – This consists of 100 questions from policy documents where answers reside in a variety of sources, such as policy documents and engineering SOPs, covering complex text formats such as embedded tables and images.
    • expected_answer – These are manually labeled answers by Amazon Finance Operations SMEs.
    • generated_answer – This is the answer generated by the bot.
  • NLP scores – We used a test dataset to calculate the ROUGE score and METEOR score. Because these scores merely use word-matching algorithms and ignore the semantic meaning of the text, they aren’t aligned with the SME scores. Based on our analysis, the variance was approximately 30% compared to human evaluations.
  • LLM-based score – We used an FM offered by Amazon Bedrock to score the RAG performance. We designed specialized LLM prompts to evaluate the RAG performance by comparing the generated answer with the expected answer. We generated a set of LLM-based metrics, including accuracy, acceptability, and factualness, and the citation representing the evaluation reasoning. The variance of this approach was approximately 5% compared to human analysis, so we decided to stick to this approach of evaluation. If your RAG system is built on Amazon Bedrock Knowledge Bases, you can use the new RAG evaluation for Amazon Bedrock Knowledge Bases tool to evaluate the retrieve or the retrieve and generate functionality with an LLM as a judge. It provides retrieval evaluation metrics such as context relevance and context coverage. It also provides retrieve and generate evaluation metrics such as correctness, completeness, and helpfulness, as well as responsible AI metrics such as harmfulness and answer refusal.

Improve the accuracy of RAG pipeline

Based on the aforementioned evaluation techniques, we focused on the following areas in the RAG pipeline to improve the overall accuracy.

Add document semantic chunking to improve accuracy from 49% to 64%

Upon diagnosing incorrect responses in the RAG pipeline, we identified 14% of the inaccuracy was due to incomplete contexts sent to the LLM. These incomplete contexts were originally generated by the segmentation algorithm based on a fixed chunk size (for example, 512 tokens or 384 words), which doesn’t consider document boundaries such as sections and paragraphs.

To address this problem, we designed a new document segmentation approach using QUILL Editor, Amazon Titan Text Embeddings, and OpenSearch Service, using the following steps:

  1. Convert the unstructured text to a structured HTML document using QUILL Editor. In this way, the HTML document preserves the document formatting that divides the contents into logical chunks.
  2. Identify the logical structure of the HTML document and insert divider strings based on HTML tags for document segmentation.
  3. Use an embedding model to generate semantic vector representation of document chunks.
  4. Assign tags based on important keywords in the section to identify the logical boundaries between sections.
  5. Insert the embedding vectors of the segmented documents to the OpenSearch Service vector store.

The following diagram illustrates the document retriever splitting workflow.

When processing the document, we follow specific rules:

  • Extract the start and end of a section of a document precisely
  • Extract the titles of the section and pair them with section content accurately
  • Assign tags based on important keywords from the sections
  • Persist the markdown information from the policy while indexing
  • Exclude images and tables from the processing in the initial release

With this approach, we can improve RAG accuracy from 49% to 64%.

Use prompt engineering to improve accuracy from 64% to 76%

Prompt engineering is a crucial technique to improve the performance of LLMs. We learned from our project that there is no one-size-fits-all prompt engineering approach; it’s a best practice to design task-specific prompts. We adopted the following approach to enhance the effectiveness of the prompt-to-RAG generator:

  • In approximately 14% of cases, we identified that the LLM generated responses even when no relevant context was retrieved from the RAG, leading to hallucinations. In this case, we engineered prompts and asked the LLM not to generate any response when there is no relevant context provided.
  • In approximately 13% of cases, we received user feedback that the response from the LLM was too brief, lacking complete context. We engineered prompts that encouraged the LLM to be more comprehensive.
  • We engineered prompts to enable the capability to generate both concise and detailed answers for the users.
  • We used LLM prompts for generation of citations to properly attribute our source used to generate the answer. In the UI, the citations are listed with hyperlinks following the LLM response, and users can use these citations to validate the LLM performance.
  • We improved our prompts to introduce better chain-of-thought (CoT) reasoning:
    • The LLM’s unique characteristic of using internally generated reasoning contributes to improved performance and aligns responses with humanlike coherence. Because of this interplay between prompt quality, reasoning requests, and the model’s inherent capabilities, we could optimize performance.
    • Encouraging CoT reasoning prompts the LLM to consider the context of the conversation, making it less prone to hallucinations.
    • By building upon the established context, the model is more likely to generate responses that logically follow the conversation’s narrative, reducing the chances of providing inaccurate or hallucinated answers.
    • We added examples of previously answered questions to establish a pattern for the LLM, encouraging CoT.

We then used meta-prompting using an FM offered by Amazon Bedrock to craft a prompt that caters to the aforementioned requirements.

The following example is a prompt for generating a quick summary and a detailed answer:

You are an AI assistant that helps answer questions based on provided text context. I will give you some passages from a document, followed by a question. Your task is to provide the best possible answer to the question using only the information from the given context. Here is the context:

<context>
{}
</context>

And here is the question:
<question>
{}
</question>

Think carefully about how the context can be used to answer the question.
<thinkingprocess>
- Carefully read the provided context and analyze what information it contains
- Identify the key pieces of information in the context that are relevant to answering the question
- Determine if the context provides enough information to answer the question satisfactorily
- If not, simply state "I don't know, I don't have the complete context needed to answer this
question"
- If so, synthesize the relevant information into a concise summary answer
- Expand the summary into a more detailed answer, utilizing Markdown formatting to make it clear and
readable
</thinkingprocess>

If you don't have enough context to answer the question, provide your response in the following
format:
I don't know, I don't have the complete context needed to answer this question.

If you do have enough context to answer the question, provide your response in the following format:
#### Quick Summary:
Your concise 1-2 sentence summary goes here.
#### Detailed Answer:
Your expanded answer goes here, using Markdown formatting like **bold**, *italics*, and Bullet points to improve readability.

Remember, the ultimate goal is to provide an informative, clear and readable answer to the question
using only the context provided. Let's begin!

The following example is a prompt for generating citations based on the generated answers and retrieved contexts:

You are an AI assistant that specializes in attributing generated answers to specific sections within provided documents. Your task is to determine which sections from the given documents were most likely used to generate the provided answer. If you cannot find exact matches, suggest sections that are closely related to the content of the answer.

Here is the generated answer to analyze:
<generated_answer>
{}
</generated_answer>

And here are the sections from various documents to consider:
<sections>
{}
</sections>

Please carefully read through the generated answer and the provided sections. In the scratchpad space below, brainstorm and reason about which sections are most relevant to the answer:
<scratchpad>
</scratchpad>

After identifying the relevant sections, provide your output in the following format:
**Document Name:** <document name> n
**Document Link:** <document link> n
**Relevant Sections:** n
- <section name 1>
- <section name 2>
- <section name 3>

Do not include any additional explanations or reasoning in your final output. Simply list the document name, link, and relevant section names in the specified format above.

Assistant:

By implementing the prompt engineering approaches, we improved RAG accuracy from 64% to 76%.

Use an Amazon Titan Text Embeddings model to improve accuracy from 76% to 86%

After implementing the document segmentation approach, we still saw lower relevance scores for retrieved contexts (55–65%), and the incorrect contexts were in the top ranks for more than 50% of cases. This indicated that there was still room for improvement.

We experimented with multiple embedding models, including first-party and third-party models. For example, the contextual embedding models such as bge-base-en-v1.5 performed better for context retrieval, comparing to other top embedding models such as all-mpnet-base-v2. We found that using the Amazon Titan Embeddings G1 model increased the possibility of retrieved contexts from approximately 55–65% to 75–80%, and 80% of the retrieved contexts have higher ranks than before.

Finally, by adopting the Amazon Titan Text Embeddings G1 model, we improved the overall accuracy from 76% to 86%.

Conclusion

We achieved remarkable progress in developing a generative AI Q&A chat assistant for Amazon Finance Automation by using a RAG pipeline and LLMs on Amazon Bedrock. Through continual evaluation and iterative improvement, we have addressed challenges of hallucinations, document ingestion issues, and context retrieval inaccuracies. Our results have shown a significant improvement in RAG accuracy from 49% to 86%.

You can follow our journey and adopt a similar solution to address challenges in your RAG application and improve overall performance.


About the Authors

SohebSoheb Moin is a Software Development Engineer at Amazon, who led the development of the Generative AI chatbot. He specializes in leveraging generative AI and Big Data analytics to design, develop, and implement secure, scalable, innovative solutions that empowers Finance Operations with better productivity, automation. Outside of work, Soheb enjoys traveling, playing badminton, and engaging in chess tournaments.

Nitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 19 years of experience building business critical, scalable, high-performance software. Nitin leads data services, communication, work management and several Generative AI initiatives within Finance. In his spare time, he enjoys listening to music and read.

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

SatyenKumar Satyen Gaurav is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.

MohakMohak Chugh is a Software Development Engineer at Amazon, with over 3 years of experience in developing products leveraging Generative AI and Big Data on AWS. His work encompasses a range of areas, including RAG based GenAI chatbots and high performance data reconciliation. Beyond work, he finds joy in playing the piano and performing with his music band.

pbavishiParth Bavishi is a Senior Product Manager at Amazon with over 10 years of experience in building impactful products. He currently leads the development of generative AI capabilities for Amazon’s Finance Automation, driving innovation and efficiency within the organization. A dedicated mentor, Parth enjoys sharing his product management knowledge and finds satisfaction in activities like volleyball and reading.

Read More

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

We are excited to announce the availability of Cohere’s advanced reranking model Rerank 3.5 through our new Rerank API in Amazon Bedrock. This powerful reranking model enables AWS customers to significantly improve their search relevance and content ranking capabilities. This model is also available for Amazon Bedrock Knowledge Base users. By incorporating Cohere’s Rerank 3.5 in Amazon Bedrock, we’re making enterprise-grade search technology more accessible and empowering organizations to enhance their information retrieval systems with minimal infrastructure management.

In this post, we discuss the need for Reranking, the capabilities of Cohere’s Rerank 3.5, and how to get started using it on Amazon Bedrock.

Reranking for advanced retrieval

Reranking is a vital enhancement to Retrieval Augmented Generation (RAG) systems that adds a sophisticated second layer of analysis to improve search result relevance beyond what traditional vector search can achieve. Unlike embedding models that rely on pre-computed static vectors, rerankers perform dynamic query-time analysis of document relevance, enabling more nuanced and contextual matching. This capability allows RAG systems to effectively balance between broad document retrieval and precise context selection, ultimately leading to more accurate and reliable outputs from language models while reducing the likelihood of hallucinations.

Existing search systems significantly benefit from reranking technology by providing more contextually relevant results that directly impact user satisfaction and business outcomes. Unlike traditional keyword matching or basic vector search, reranking performs an intelligent second-pass analysis that considers multiple factors, including semantic meaning, user intent, and business rules to optimize search result ordering. In ecommerce specifically, reranking helps surface the most relevant products by understanding nuanced relationships between search queries and product attributes, while also incorporating crucial business metrics like conversion rates and inventory levels. This advanced relevance optimization leads to improved product discovery, higher conversion rates, and enhanced customer satisfaction across digital commerce platforms, making reranking an essential component for any modern enterprise search infrastructure.

Introducing Cohere Rerank 3.5

Cohere’s Rerank 3.5 is designed to enhance search and RAG systems. This intelligent cross-encoding model takes a query and a list of potentially relevant documents as input, then returns the documents sorted by semantic similarity to the query. Cohere Rerank 3.5 excels in understanding complex information requiring reasoning and is able to understand the meaning behind enterprise data and user questions. Its ability to comprehend and analyze enterprise data and user questions across over 100 languages including Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish, makes it particularly valuable for global organizations in sectors such as finance, healthcare, hospitality, energy, government, and manufacturing.

One of the key advantages of Cohere Rerank 3.5 is its ease of implementation. Through a single Rerank API call in Amazon Bedrock, you can integrate Rerank into existing systems at scale, whether keyword-based or semantic. Reranking strictly improves first-stage retrievals on standard text retrieval benchmarks.

Cohere Rerank 3.5 is state of the art in the financial domain, as illustrated in the following figure.

Cohere Rerank 3.5 is also state of the art in the ecommerce domain, as illustrated in the following figure. Cohere’s ecommerce benchmarks revolve around retrieval on various products, including fashion, electronics, food, and more.

Products were structured as strings in a key-value pair format such as the following:

“Title”: “Title” 
“Description”: “Long-form description” “Type”: <Some categorical data> etc.....

Cohere Rerank 3.5 also excels in hospitality, as shown in the following figure. Hospitality benchmarks revolve around retrieval on hospitality experiences and lodging options.

Documents were structured as strings in a key-value pairs format such as the following:

“Listing Title”: “Rental unit in Toronto” “Location”: “171 John Street, Toronto, Ontario, Canada”

“Description”: “Escape to our serene villa with stunning downtown views....”

We see noticeable gains in project management performance across all types of issue tracking tasks, as illustrated in the following figure.

Cohere’s project management benchmarks span a variety of retrieval tasks, such as:

  • Search through engineering tickets from various project management and issue tracking software tools
  • Search through GitHub issues on popular open source repos

Get started with Cohere Rerank 3.5

To start using Cohere Rerank 3.5 with Rerank API and Amazon Bedrock Knowledge Bases, navigate to the Amazon Bedrock console, and click on Model Access on the left hand pane. Click on Modify Access, select Cohere Rerank 3.5, click Next and hit submit.

Get Started with Amazon Bedrock Rerank API

The Cohere Rerank 3.5 model, powered by the Amazon Bedrock Rerank API, allows you to rerank input documents directly based on their semantic relevance to a user query – without requiring a pre-configured knowledge base. The flexibility makes it a powerful tool for various use cases.

To begin, set up your environment by importing the necessary libraries and initializing Boto3 clients:

import boto3
import json
region = boto3.Session().region_name

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime',region_name=region)

modelId = "cohere.rerank-v3-5:0"
model_package_arn = f"arn:aws:bedrock:{region}::foundation-model/{modelId}”

Next, define a main function that reorders a list of text documents by computing relevance scores based on the user query:

def rerank_text(text_query, text_sources, num_results, model_package_arn):
    response = bedrock_agent_runtime.rerank(
        queries=[
            {
                "type": "TEXT",
                "textQuery": {
                    "text": text_query
                }
            }
        ],
        sources=text_sources,
        rerankingConfiguration={
            "type": "BEDROCK_RERANKING_MODEL",
            "bedrockRerankingConfiguration": {
                "numberOfResults": num_results,
                "modelConfiguration": {
                    "modelArn": model_package_arn,
                }
            }
        }
    )
    return response['results']

For instance, imagine a scenario where you need to identify emails related to returning items from a multilingual dataset. The example below demonstrates this process:

example_query = "What emails have been about returning items?"

documents = [
    "Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?",
    "Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?",
    "مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب",
    "Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?",
    "Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.",
    "Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?",
    "Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.",
    "早上好,关于我最近的订单,我有一个问题。我收到了错误的商品",
    "Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."
]

Now, prepare the list of text sources that will be passed into the rerank_text() function:

text_sources = []
for text in documents:
    text_sources.append({
        "type": "INLINE",
        "inlineDocumentSource": {
            "type": "TEXT",
            "textDocument": {
                "text": text,
            }
        }
    })

You can then invoke rerank_text() by specifying the user query, the text resources, the desired number of top-ranked results, and the model ARN:

response = rerank_text(example_query, text_sources, 3, model_package_arn)
print(response)

The output generated by the Amazon Bedrock Rerank API with Cohere Rerank 3.5 for this query is:

[{'index': 4, 'relevanceScore': 0.1122397780418396},
 {'index': 8, 'relevanceScore': 0.07777658104896545},
 {'index': 2, 'relevanceScore': 0.0770234540104866}]

The relevance scores provided by the API are normalized to a range of [0, 1], with higher scores indicating higher relevance to the query. Here the 5th item in the list of documents is the most relevant. (Translated from German to English: Hello, I have a question about my last order. I received the wrong item and need to return it.)

You can also get started using Cohere Rerank 3.5 with Amazon Bedrock Knowledge Bases by completing the following steps:

  1. In the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
  2. Choose Create knowledge base.
  3. Provide your knowledge base details, such as name, permissions, and data source.
  1. To configure your data source, specify the location of your data.
  2. Select an embedding model to convert the data into vector embeddings, and have Amazon Bedrock create a vector store in your account to store the vector data.

When you select this option (available only in the Amazon Bedrock console), Amazon Bedrock creates a vector index in Amazon OpenSearch Serverless (by default) in your account, removing the need to manage anything yourself.

  1. Review your settings and create your knowledge base.
  2. In the Amazon Bedrock console, choose your knowledge base and choose Test knowledge base.
  3. Choose the icon for additional configuration options for testing your knowledge base.
  4. Choose your model (for this post, Cohere Rerank 3.5) and choose Apply.

The configuration pane shows the new Reranking section menu with additional configuration options. The number of reranked source chunks returns the specified number of highest relevant chunks.

Conclusion

In this post, we explored how to use Cohere’s Rerank 3.5 model in Amazon Bedrock, demonstrating its powerful capabilities for enhancing search relevance and robust reranking capabilities for enterprise applications, enhancing user experience and optimizing information retrieval workflows. Start improving your search relevance today with Cohere’s Rerank model on Amazon Bedrock.

Cohere Rerank 3.5 in Amazon Bedrock is available in the following AWS Regions: in us-west-2 (US West – Oregon), ca-central-1 (Canada – Central), eu-central-1 (Europe – Frankfurt), and ap-northeast-1 (Asia Pacific – Tokyo).

Share your feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

To learn more about Cohere Rerank 3.5’s features and capabilities, view the Cohere in Amazon Bedrock product page.


About the Authors

Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.

James Yi is a Senior AI/ML Partner Solutions Architect at Amazon Web Services. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Read More

AWS DeepRacer: How to master physical racing?

AWS DeepRacer: How to master physical racing?

As developers gear up for re:Invent 2024, they again face the unique challenges of physical racing. What are the obstacles? Let’s have a look.

In this blog post, I will look at what makes physical AWS DeepRacer racing—a real car on a real track—different to racing in the virtual world—a model in a simulated 3D environment. I will cover the basics, the differences in virtual compared to physical, and what steps I have taken to get a deeper understanding of the challenge.

The AWS DeepRacer League is wrapping up. In two days, 32 racers will face off in Las Vegas for one last time. This year, the qualification has been all-virtual, so the transition from virtual to physical racing will be a challenge.

The basics

AWS DeepRacer relies on the racer training a model within the simulator, a 3D environment built around ROS and Gazebo, originally built on AWS RoboMaker.

The trained model is subsequently used for either virtual or physical races. The model comprises a convolutional neural network (CNN) and an action space translating class labels into speed and throttle movement. In the basic scenario involving a single camera, a 160 x 120 pixels, 8-bit grayscale image (similar to the following figure) is captured 15 times per second, passed through the neural network, and the action with the highest weight (probability) is executed.

The small piece of AI magic is that during model evaluation (racing) there’s no context; each image is processed independently of the image before it, and without knowledge of the state of the car itself. If you process the images in reverse order the results remain the same!

Virtual compared to physical

The virtual worlds are 3D worlds created in Gazebo, and the software is written in Python and C++ using ROS as the framework. As shown in the following image, the 3D simulation is fairly flat, with basic textures and surfaces. There is little or no reflections or shine, and the environment is as visually clean as you make it. Input images are captured 15 times per second.

Within this world a small car is simulated. Compared to a real car, the model is very basic and lacks quite a few of the things that make a real car work: There is no suspension, the tires are rigid cylinders, there is no Ackermann steering, and there are no differentials. It’s almost surprising that this car can drive at all. On the positive side the camera is perfect; irrespective of lighting conditions you get crisp clear pictures with no motion blur.

A typical virtual car drives at speeds between 0.5 and 4.0 meters per second, depending on the shape of the track. If you go too fast, it will often oversteer and spin out of the turn because of the relatively low grip.

In contrast, the real world is less perfect—simulation-to-real gap #1 is around visual noise created by light, reflections (if track is printed on reflective material), and background noise (such as if the barriers around the track are too low, and the car sees people and objects in the back). Input images are captured 30 times per second.

The car itself—based on the readily available WLToys A979—has all the things the model car doesn’t: proper tires, suspension, and differential. One problem is that the car is heavy—around 1.5 kg—and the placement of some components causes the center of gravity to be very high. This causes simulation-to-real gap #2: Roll and pitch during corners at high speeds cause the camera to rotate, confusing the neural network as the horizon moves.

Gap #3 comes from motion blur when the light is too dim; the blur can cause the dashed centerline to look like a solid line, making it hard to distinguish the centerline from the solid inner and outer lines, as shown in the following figure.

The steering geometry, the differentials, the lack of engineering precision of the A979, and the corresponding difficulty in calibrating it, causes gap #4. Even if the model wants to go straight, the car still pulls left or right, needing constant correction to stay on track. This is most noticeable when the car is unable to drive down the straights in a straight line.

The original AWS DeepRacer, without modifications, has a smaller speed range of about 2 meters per second. It has a better grip but suffers from the previously mentioned roll movements. If you go too fast, it will understeer and potentially roll over. Since 2023, the AWS pit-crews operate their fleets of AWS DeepRacers with shock spacers to stiffen the suspension, reduce the roll, and increase the max effective speed.

Four questions

Looking at the sim-to-real gaps there are four questions that we want to explore:

  • How can we train the model to better handle the real world? This includes altering the simulator to close some of the gaps, combined with adapting reward function, action space, and training methodology to make better use of this simulator.
  • How can we better evaluate what the car does, and why? In the virtual world, we can perform log analysis to investigate; in the real world this has not yet been possible.
  • How can we evaluate our newly trained models? A standard AWS DeepRacer track, with its size of 8 meters x 6 meters, is prohibitively large. Is it possible to downscale the track to fit in a home?
  • Will a modified car perform better? Upgrade my AWS DeepRacer with better shocks? Add ball bearings and shims to improve steering precision? Or build a new lighter car based on a Raspberry Pi?

Solutions

To answer these questions, some solutions are required to support the experiments. The following assumes that you’re using Deepracer-for-Cloud to run the training locally or in an Amazon Elastic Compute Cloud (Amazon EC2) instance. We won’t go into the details but provide references that will enable you to try things out on your own.

Customized simulator

The first thing to look at is how you can alter the simulator. The simulator code is available, and modifying it doesn’t require too many skills. You can alter the car and the physics of the world or adjust the visual environment.

Change the environment

Changing the environments means altering the 3D world. This can be done by altering the features in a pre-existing track by adding or removing track parts (such as lines), changing lighting, adding background features (such as walls or buildings), swapping out textures, and so on. Making changes to the world will require building a new Docker image, which can take quite some time, but there are ways to speed that up. Going a step further, it’s also possible to make the world programmatically (command line or code) alterable during run-time.

The starting point are the track COLLADA (.dae) files found in the meshes folder. You can import it into Blender (shown in the following figure), make your changes, and export the file again. Note that lights and camera positions from Blender aren’t considered by Gazebo. To alter the lighting conditions, you will have to alter the .world file in worlds—the files are XML files in sdformat.

See Custom Tracks for some examples of tuned tracks.

Car and physics

The competition cars owned by AWS can’t be altered, so the objective of tuning the car in the simulator is to make it behave in ways more similar to the real one. Trained neural networks have an embedded expectation of what will happen next; which means that the simulated car learned that by taking a specific action, it would get a turn of a given radius. If the simulator car steers more or less than the physical one in a given situation, the outcome becomes unpredictable.

Lack of Ackermann steering, no differentials, but wheels that can deflect up to 30 degrees—real wheels only go to a bit more than 20 degrees outwards and less than that inwards. My experience is that the real car, surprisingly enough, still has a shorter turning radius than the virtual one.

The car models are found in the urdf folder. There are three different cars, relating to the different versions of physics, which you configure in your actions space (model_metadata.json). Today, only the deepracer (v3 and v4 physics) and deepracer_kinematics (v5 physics) models are relevant. There are variant models for single camera and for stereo camera, both with and without the LIDAR.

Each physics version is different; the big question is what impact, if any, each version has on the behavior of the physical car.

  • Version 3: Steering and throttle is managed through a PID controller, making speed and steering changes smooth (and slow). The simulation environment runs at all times—including during image processing and inference—leading to a higher latency between image capture and action taking effect.
  • Version 4: Steering and throttle is managed through a PID controller, but the world is put on hold during inference, reducing the latency.
  • Version 5: Steering and throttle is managed through a position and velocity controller, and the world is put on hold during inference, almost eliminating latency. (This is very unnatural; the car can take alternating 30 degree left and right turns and will go almost straight ahead.)

The PID controller for v3 and v4 can be changed in the racecar control file. By changing the P, I, and D values, you can tune how fast or how slow the car accelerates and steers.

You can also tune the friction. In our simulator, friction is defined for the wheels, not the surfaces that the car drives on. The values (called mu and mu2) are found in racecar.gazebo; increasing them (once per tire!) will allow the car to drive faster without spinning.

Finally, I implemented an experimental version of the Ackermann steering geometry including differentials. Why? When turning, a car’s wheels follow two circles with the same center point, the inner one is having a smaller radius than the outer one. In short, the inner wheels will have to steer more (larger curvature), but rotate slower (smaller circumference) than the outer wheels.

Customized car software

The initial work to create an altered software stack for the original AWS DeepRacer started in 2022. The first experiments included operating the AWS DeepRacer with an R/C controller and capturing the camera images and IMU data to create an in-car video. There was a lot to learn about ROS2, including creating a custom node for publishing IMU sensor data and capturing and creating videos on the fly. During the Berlin Summit in 2022, I also got to give my modified car a spin on the track!

In the context of physical racing, the motivation for customizing the car software is to obtain more information—what does the car do, and why. Watching the following video, you can clearly see the rolling movement in the turns, and the blurring of certain parts of the image discussed earlier.

The work triggered a need to alter several of the open source AWS DeepRacer packages, and included work such as optimizing the performance from camera to inference through compressing images and enabling GPU and compute stick acceleration of the inference. This turned into several scripts comprising all the changes to the different nodes and creating an upgraded software package that could be installed on an original AWS DeepRacer car.

The work evolved, and a logging mechanism using ROS Bag allowed us to analyze not only pictures, but also the actions that the car took. Using the deepracer-viz library of Jochem Lugtenburg, a fellow AWS DeepRacer community leader, I added a GradCam overlay on the video feed (shown in the following video), which gives a better understanding of what’s going on.

The outcome of this has evolved into the community AWS DeepRacer Custom Car repository, which allows anyone to upgrade their AWS DeepRacer with improved software with two commands and without having to compile the modules themselves!

Benefits are:

  • Performance improvement by using compressed image transport for the main processing pipeline.
  • Inference using OpenVINO with Intel GPU (original AWS DeepRacer), OpenVino with Myriad Neural Compute Stick (NCS2), or TensorFlow Lite.
  • Model Optimizer caching, speeding up switching of models.
  • Capture in-car camera and inference results to a ROS Bag for logfile analysis.
  • UI tweaks and fixes.
  • Support for Raspberry Pi4, enabling us to create the DeepRacer Pi!

Testing on a custom track

Capturing data is great, but you need a way to test it all—bringing models trained in a customized environment onto a track to see what works and what doesn’t.

The question turned out to be: How hard is it to make a track that has the same design as the official tracks, but that takes up less space than the 8m x 6m of the re:Invent 2018 track? After re:Invent 2023, I started to investigate. The goal was to create a custom track that would fit in my garage with a theoretical maximum size of 5.5m x 4.5m. The track should be printable on vinyl in addition to being available in the Simulator for virtual testing.

After some trial and error, it proved to be quite straightforward, even if it requires multiple steps, starting in a Jupyter Notebook, moving into a vector drawing program (Inkscape), and finalizing in Blender (to create the simulator meshes).

The trapezoid track shown in the following two figures (center line and final sketch) is a good example of how to create a brand new track. The notebook starts with eight points in an array and builds out the track step by step, adding the outer line, center line, and color.

In the end I chose to print a narrower version of Trapezoid—Trapezoid Narrow, shown in the following figure—to fit behind my garage, with dimensions of 5.20m x 2.85m including the green borders around the track. I printed it on PVC with a thickness 500 grams per square meter. The comparatively heavy material was a good choice. It prevents folds and wrinkles and generally ensures that the track stays in place even when you walk on it.

Around the track, I added a boundary of mesh PVC mounted on some 20 x 20 centimeter aluminum poles. Not entirely a success, because the light shone through and I needed to add a lining of black fleece. The following image shows the completed track before the addition of black fleece.

Experiments and conclusions

re:Invent is just days away. Experiments are still running, and because I need to fight my way through the Wildcard race, this is not the time to include all the details. Let’s just say that things aren’t always as straightforward as expected.

As a preview of what’s going on, I’ll end this post with the latest iteration of the in-car video, showing a AWS DeepRacer Pi doing laps in the garage. Check back after re:Invent for the big reveal!


About the author

Lars Lorentz Ludvigsen is a technology enthusiast who was introduced to AWS DeepRacer in late 2019 and was instantly hooked. Lars works as a Managing Director at Accenture where he helps clients to build the next generation of smart connected products. In addition to his role at Accenture, he’s an AWS Community Builder who focuses on developing and maintaining the AWS DeepRacer community’s software solutions.

Read More

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference

The new efficient multi-adapter inference feature of Amazon SageMaker unlocks exciting possibilities for customers using fine-tuned models. This capability integrates with SageMaker inference components to allow you to deploy and manage hundreds of fine-tuned Low-Rank Adaptation (LoRA) adapters through SageMaker APIs. Multi-adapter inference handles the registration of fine-tuned adapters with a base model and dynamically loads them from GPU memory, CPU memory, or local disk in milliseconds, based on the request. This feature provides atomic operations for adding, deleting, or updating individual adapters across a SageMaker endpoint’s running instances without affecting performance or requiring a redeployment of the endpoint.

The efficiency of LoRA adapters allows for a wide range of hyper-personalization and task-based customization which had previously been too resource-intensive and costly to be feasible. For example, marketing and software as a service (SaaS) companies can personalize artificial intelligence and machine learning (AI/ML) applications using each of their customer’s images, art style, communication style, and documents to create campaigns and artifacts that represent them. Similarly, enterprises in industries like healthcare or financial services can reuse a common base model with task-based adapters to efficiently tackle a variety of specialized AI tasks. Whether it’s diagnosing medical conditions, assessing loan applications, understanding complex documents, or detecting financial fraud, you can simply swap in the appropriate fine-tuned LoRA adapter for each use case at runtime. This flexibility and efficiency unlocks new opportunities to deploy powerful, customized AI across your organization. With this new efficient multi-adapter inference capability, SageMaker reduces the complexity of deploying and managing the adapters that power these applications.

In this post, we show how to use the new efficient multi-adapter inference feature in SageMaker.

Problem statement

You can use powerful pre-trained foundation models (FMs) without needing to build your own complex models from scratch. However, these general-purpose models might not always align with your specific needs or your unique data. To make these models work for you, you can use Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA.

The benefit of PEFT and LoRA is that it lets you fine-tune models quickly and cost-effectively. These methods are based on the idea that only a small part of a large FM needs updating to adapt it to new tasks or domains. By freezing the base model and just updating a few extra adapter layers, you can fine-tune models much faster and cheaper, while still maintaining high performance. This flexibility means you can quickly customize pre-trained models at low cost to meet different requirements. When inferencing, the LoRA adapters can be loaded dynamically at runtime to augment the results from the base model for best performance. You can create a library of task-specific, customer-specific, or domain-specific adapters that can be swapped in as needed for maximum efficiency. This allows you to build AI tailored exactly to your business.

Although fine-tuned LoRA adapters can effectively address targeted use cases, managing these adapters can be challenging at scale. You can use open-source libraries, or the AWS managed Large Model Inference (LMI) deep learning container (DLC) to dynamically load and unload adapter weights. Current deployment methods use fixed adapters or Amazon Simple Storage Service (Amazon S3) locations, making post-deployment changes impossible without updating the model endpoint and adding unnecessary complexity. This deployment method also makes it impossible to collect per-adapter metrics, making the evaluation of their health and performance a challenge.

Solution overview

In this solution, we show how to use efficient multi-adapter inference in SageMaker to host and manage multiple LoRA adapters with a common base model. The approach is based on an existing SageMaker capability, inference components, where you can have multiple containers or models on the same endpoint and allocate a certain amount of compute to each container. With inference components, you can create and scale multiple copies of the model, each of which retains the compute that you have allocated. With inference components, deploying multiple models that have specific hardware requirements becomes a much simpler process, allowing for the scaling and hosting of multiple FMs. An example deployment would look like the following figure.

This feature extends inference components to a new type of component, inference component adapters, which you can use to allow SageMaker to manage your individual LoRA adapters at scale while having a common inference component for the base model that you’re deploying. In this post, we show how to create, update, and delete inference component adapters and how to call them for inference. You can envision this architecture as the following figure.

IC and Adapters

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.12xlarge SageMaker hosting instance.

In this example, you learn how to deploy a base model (Meta Llama 3.1 8B Instruct) and LoRA adapters on an SageMaker real-time endpoint using inference components. You can find the example notebook in the GitHub repository.

import sagemaker
import boto3
import json

role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket() # bucket to house artifacts
region = sess._region_name

sm_client = boto3.client(service_name='sagemaker')
sm_rt_client = boto3.client(service_name='sagemaker-runtime')

Download the base model from the Hugging Face model hub. Because Meta Llama 3.1 8B Instruct is a gated model, you will need a Hugging Face access token and to submit a request for model access on the model page. For more details, see Accessing Private/Gated Models.

from huggingface_hub import snapshot_download

model_name = sagemaker.utils.name_from_base("llama-3-1-8b-instruct")

HF_TOKEN = "<<YOUR_HF_TOKEN>>"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model_id_pathsafe = model_id.replace("/","-")
local_model_path = f"./models/{model_id_pathsafe}"
s3_model_path = f"s3://{bucket}/models/{model_id_pathsafe}"

snapshot_download(repo_id=model_id, use_auth_token=HF_TOKEN, local_dir=local_model_path, allow_patterns=[".json", ".safetensors"])

Copy your model artifact to Amazon S3 to improve model load time during deployment:

!aws s3 cp —recursive {local_model_path} {s3_model_path}

Select one of the available LMI container images for hosting. Efficient adapter inference capability is available in 0.31.0-lmi13.0.0 and higher.

inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"

Create a container environment for the hosting container. LMI container parameters can be found in the LMI Backend User Guides.

The parameters OPTION_MAX_LORAS and OPTION_MAX_CPU_LORAS control how adapters move between GPU, CPU, and disk. OPTION_MAX_LORAS sets a limit on the number of adapters concurrently stored in GPU memory, with excess adapters offloaded to CPU memory.  OPTION_MAX_CPU_LORAS determines how many adapters are staged in CPU memory, offloading excess adapters to local SSD storage.

In the following example, 30 adapters can live in GPU memory and 70 adapters in CPU memory before going to local storage.

env = {
    "HF_MODEL_ID": f"{s3_model_path}",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "30",
    "OPTION_MAX_CPU_LORAS": "70",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

With your container image and environment defined, you can create a SageMaker model object that you will use to create an inference component later:

model_name = sagemaker.utils.name_from_base("llama-3-1-8b-instruct")

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image_uri,
        "Environment": env,
    },
)

Set up a SageMaker endpoint

To create a SageMaker endpoint, you need an endpoint configuration. When using inference components, you don’t specify a model in the endpoint configuration. You load the model as a component later on.

endpoint_config_name = f"{model_name}"
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 900
container_startup_health_check_timeout_in_seconds = 900

initial_instance_count = 1

sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": initial_instance_count,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ]
)

Create the SageMaker endpoint with the following code:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName = endpoint_name, EndpointConfigName = endpoint_config_name
)

With your endpoint created, you can now create the inference component for the base model. This will be the base component that the adapter components you create later will depend on.

Notable parameters here are ComputeResourceRequirements. These are a component-level configuration that determine the amount of resources that the component needs (memory, vCPUs, accelerators). The adapters will share these resources with the base component.

base_inference_component_name = f"base-{model_name}"

variant_name = "AllTraffic"

initial_copy_count = 1
min_memory_required_in_mb = 32000
number_of_accelerator_devices_required = 4

sm_client.create_inference_component(
    InferenceComponentName = base_inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        "ModelName": model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

 In this example, you create a single adapter, but you could host up to hundreds of them per endpoint. They will need to be compressed and uploaded to Amazon S3.

The adapter package has the following files at the root of the archive with no sub-folders.

Adapter Files

For this example, an adapter was fine-tuned using QLoRA and Fully Sharded Data Parallel (FSDP) on the training split of the ECTSum dataset. Training took 21 minutes on an ml.p4d.24xlarge and cost approximately $13 using current on-demand pricing.

For each adapter you are going to deploy, you need to specify an InferenceComponentName, an ArtifactUrl with the S3 location of the adapter archive, and a BaseInferenceComponentName to create the connection between the base model inference component and the new adapter inference components. You repeat this process for each additional adapter.

ic_ectsum_name = f"adapter-ectsum-{base_inference_component_name}"
adapter_s3_uri = "<<S3_PATH_FOR_YOUR_ADAPTER>>

sm_client.create_inference_component(
    InferenceComponentName = adapter_ic1_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": inference_component_name,
        "Container": {
            "ArtifactUrl": adapter_s3_uri
        },
    },
)

Use the deployed adapter

First, you build a prompt to invoke the model for earnings summarization, filling in the source text with a random item from the ECTSum dataset. Then you store the ground truth summary from the item for comparison later.

from datasets import load_dataset
dataset_name = "mrSoul7766/ECTSum"

test_dataset = load_dataset(dataset_name, trust_remote_code=True, split="test")

test_item = test_dataset.shuffle().select(range(1))

prompt =f"""
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an AI assistant trained to summarize earnings calls.
    Provide a concise summary of the call, capturing the key points and overall context.
    Focus on quarter over quarter revenue, earnings per share, changes in debt, highlighted risks, and growth opportunities.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    Summarize the following earnings call:

    {test_item["text"]}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

ground_truth_response = test_item["summary"]

To test the base model, specify the EndpointName for the endpoint you created earlier and the name of the base inference component as InferenceComponentName, along with your prompt and other inference parameters in the Body parameter:

component_to_invoke = base_inference_component_name

response_model = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 100, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

base_model_reponse = response_model["Body"].read().decode("utf8")["generated_text"]

To invoke the adapter, use the adapter inference component name in your invoke_endpoint call:

component_to_invoke = adapter_ic1_name

response_model = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 100, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_reponse = response_model["Body"].read().decode("utf8")["generated_text"]

Compare outputs

Compare the outputs of the base model and adapter to ground truth. While the base model might appear subjectively better in this test, the adapter’s response is actually much closer to the ground truth response. This will be proven with metrics in the next section.

Ground Truth:

q3 non-gaap earnings per share $3.71.
q3 sales rose 15.4 percent to $747 million.
bio rad laboratories - now anticipates 2021 non-gaap currency-neutral revenue growth between 12 to 13 percent.
sees 2021 estimated non-gaap operating margin of about 19.5 percent.
qtrly gaap income per share $129.96.

----------------------------------

Base Model Response:

Here is a summary of the earnings call:

**Key Points:**

* Revenue: $747 million, up 15.4% year-over-year (13.8% on a currency-neutral basis)
* Earnings per share: $129.96, up from $3 per share in Q3 2020
* Gross margin: 58.6% on a GAAP basis, 57.9% on a non-GAAP basis
* Operating income: $156.8 million, up from $109.6 million in Q3 2020
* Net income: $3.928

----------------------------------

Adapter Model Response:

                Here is a concise summary of the call:

                q3 revenue $747.6 million versus refinitiv ibes estimate of $753.9 million.
q3 earnings per share $3.71.
sees fy earnings per share $11.85 to $12.05.
sees fy 2021 non-gaap revenue growth to be 12% to 13%.
sees fy 2021 non-gaap gross margin to be 57.5% to 57.8%.
sees fy 2021 non-gaap operating margin to be 19.5%.

To validate the true adapter performance, you can use a tool like fmeval to run an evaluation of summarization accuracy. This will calculate the METEOR, ROUGE, and BertScore metrics for the adapter vs. the base model. Doing so against the test split of ECTSum yields the following results.

Testing Score Text

The fine-tuned adapter shows a 59% increase in METEOR score, 159% increase in ROUGE score, and 8.6% increase in BertScore.

The following diagram shows the frequency distribution of scores for the different metrics, with the adapter consistently scoring better more often in all metrics.

Testing Scores

We observed an end-to-end latency difference of up to 10%  between base model invocation and the adapter in our tests. If the adapter is loaded from CPU memory or disk, it will incur an additional cold start delay for the first load to GPU. But depending on your container configurations and instance type chosen, these values may vary.

Update an existing adapter

Because adapters are managed as inference components, you can update them on a running endpoint. SageMaker handles the unloading and deregistering of the old adapter and loading and registering of the new adapter onto every base inference component on all the instances that it is running on for this endpoint. To update an adapter inference component, use the update_inference_component API and supply the existing inference component name and the Amazon S3 path to the new compressed adapter archive.

You can train a new adapter, or re-upload the existing adapter artifact to test this functionality.

update_inference_component_response = sm_client.update_inference_component(
    InferenceComponentName = adapter_ic1_name,
    Specification={
        "Container": {
            "ArtifactUrl": new_adapter_s3_uri
        },
    },
)

Remove adapters

If you need to delete an adapter, call the delete_inference_component API with the inference component name to remove it:

sess = sagemaker.session.Session()
sess.delete_inference_component(adapter_ic1_name, wait = True)

Deleting the base model inference component will automatically delete the base inference component and any associated adapter inference components:

sess.delete_inference_component(base_inference_component_name, wait = True)

Pricing

SageMaker multi-adapter inference is generally available in AWS Regions US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE), and South America (São Paulo), and is available at no extra cost.

Conclusion

The new efficient multi-adapter inference feature in SageMaker opens up exciting possibilities for customers with fine-tuning use cases. By allowing the dynamic loading of fine-tuned LoRA adapters, you can quickly and cost-effectively customize AI models to your specific needs. This flexibility unlocks new opportunities to deploy powerful, customized AI across organizations in industries like marketing, healthcare, and finance. The ability to manage these adapters at scale through SageMaker inference components makes it effortless to build tailored generative AI solutions.


About the Authors

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Read More

Improve the performance of your Generative AI applications with Prompt Optimization on Amazon Bedrock

Improve the performance of your Generative AI applications with Prompt Optimization on Amazon Bedrock

Prompt engineering refers to the practice of writing instructions to get the desired responses from foundation models (FMs). You might have to spend months experimenting and iterating on your prompts, following the best practices for each model, to achieve your desired output. Furthermore, these prompts are specific to a model and task, and performance isn’t guaranteed when they are used with a different FM. This manual effort required for prompt engineering can slow down your ability to test different models.

Today, we are excited to announce the availability of Prompt Optimization on Amazon Bedrock. With this capability, you can now optimize your prompts for several use cases with a single API call or a click of a button on the Amazon Bedrock console.

In this post, we discuss how you can get started with this new feature using an example use case in addition to discussing some performance benchmarks.

Solution overview

At the time of writing, Prompt Optimization for Amazon Bedrock supports Prompt Optimization for Anthropic’s Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, and Claude-3.5-Sonnet models, Meta’s Llama 3 70B and Llama 3.1 70B models, Mistral’s Large model and Amazon’s Titan Text Premier model. Prompt Optimizations can result in significant improvements for Generative AI tasks. Some example performance benchmarks for several tasks were conducted and are discussed.

In the following sections, we demonstrate how to use the Prompt Optimization feature. For our use case, we want to optimize a prompt that looks at a call or chat transcript, and classifies the next best action.

Use automatic prompt optimization

To get started with this feature, complete the following steps:

  1. On the Amazon Bedrock console, choose Prompt management in the navigation pane.
  2. Choose Create prompt.
  3. Enter a name and optional description for your prompt, then choose Create.

  1. For User message, enter the prompt template that you want to optimize.

For example, we want to optimize a prompt that looks at a call or chat transcript and classifies the next best action as one of the following:

  • Wait for customer input
  • Assign agent
  • Escalate

The following screenshot shows what our prompt looks like in the prompt builder.

  1. In the Configurations pane, for Generative AI resource, choose Models and choose your preferred model. For this example, we use Anthropic’s Claude 3.5 Sonnet.
  2. Choose Optimize.

A pop-up appears that indicates that your prompt is being optimized.

When optimization is complete, you should see a side-by-side view of the original and the optimized prompt for your use case.

  1. Add values to your test variables (in this case, transcript) and choose Run.

You can then see the output from the model in the desired format.

As we can see in this example, the prompt is more explicit, with clear instructions on how to process the original transcript provided as a variable. This results in the correct classification, in the required output format. Once a prompt has been optimized, it can be deployed into an application by creating a version which creates a snapshot of its configuration. Multiple versions can be stored to enable switching between different use-case prompt configurations. See prompt management for more details on prompt version control and deployment.

Performance benchmarks

We ran the Prompt Optimization feature on several open source datasets. We are excited to share the improvements seen in a few important and common use cases that we see our customers working with:

  • Summarization (XSUM)
  • RAG-based dialog continuation (DSTC)
  • Function calling (GLAIVE)

To measure performance improvement with respect to the baseline prompts, we use ROUGE-2 F1 for the summarization use case, HELM-F1 for the dialog continuation use case, and HELM-F1 and JSON matching for function calling. We saw a performance improvement of 18% on the summarization use case, 8% on dialog completion, and 22% on function calling benchmarks. The following table contains the detailed results.

Use Case Original Prompt Optimized Prompt Performance Improvement
Summarization First, please read the article below.
{context}
 Now, can you write me an extremely short abstract for it?
<task>
Your task is to provide a concise 1-2 sentence summary of the given text that captures the main points or key information.
</task><context>
{context}
</context><instructions>
Please read the provided text carefully and thoroughly to understand its content. Then, generate a brief summary in your own words that is much shorter than the original text while still preserving the core ideas and essential details. The summary should be concise yet informative, capturing the essence of the text in just 1-2 sentences.
</instructions><result_format>
Summary: [WRITE YOUR 1-2 SENTENCE SUMMARY HERE]
</result_format>
18.04%
Dialog continuation Functions available:
{available_functions}
Examples of calling functions:
Input:
Functions: [{"name": "calculate_area", "description": "Calculate the area of a shape", "parameters": {"type": "object", "properties": {"shape": {"type": "string", "description": "The type of shape (e.g. rectangle, triangle, circle)"}, "dimensions": {"type": "object", "properties": {"length": {"type": "number", "description": "The length of the shape"}, "width": {"type": "number", "description": "The width of the shape"}, "base": {"type": "number", "description": "The base of the shape"}, "height": {"type": "number", "description": "The height of the shape"}, "radius": {"type": "number", "description": "The radius of the shape"}}}}, "required": ["shape", "dimensions"]}}]
Conversation history: USER: Can you calculate the area of a rectangle with a length of 5 and width of 3?
Output:
{"name": "calculate_area", "arguments": {"shape": "rectangle", "dimensions": {"length": 5, "width": 3}}}Input:
Functions: [{"name": "search_books", "description": "Search for books based on title or author", "parameters": {"type": "object", "properties": {"search_query": {"type": "string", "description": "The title or author to search for"}}, "required": ["search_query"]}}]
Conversation history: USER: I am looking for books by J.K. Rowling. Can you help me find them?
Output:
{"name": "search_books", "arguments": {"search_query": "J.K. Rowling"}}Input:
Functions: [{"name": "calculate_age", "description": "Calculate the age based on the birthdate", "parameters": {"type": "object", "properties": {"birthdate": {"type": "string", "format": "date", "description": "The birthdate"}}, "required": ["birthdate"]}}]
Conversation history: USER: Hi, I was born on 1990-05-15. Can you tell me how old I am today?
Output:
{"name": "calculate_age", "arguments": {"birthdate": "1990-05-15"}}
Current chat history:
{conversation_history}
Respond to the last message. Call a function if necessary.

Task: Respond to the user's message in the given conversation by calling appropriate functions if necessary.

Instructions:
1. Review the list of available functions:
<available_functions>
{available_functions}
</available_functions>

2. Study the examples of how to call these functions:
<fewshot_examples>

<example>
H:
<context>Functions: [{"name": "calculate_area", "description": "Calculate the area of a shape", "parameters": {"type": "object", "properties": {"shape": {"type": "string", "description": "The type of shape (e.g. rectangle, triangle, circle)"}, "dimensions": {"type": "object", "properties": {"length": {"type": "number", "description": "The length of the shape"}, "width": {"type": "number", "description": "The width of the shape"}, "base": {"type": "number", "description": "The base of the shape"}, "height": {"type": "number", "description": "The height of the shape"}, "radius": {"type": "number", "description": "The radius of the shape"}}}}, "required": ["shape", "dimensions"]}}]</context>
<question>USER: Can you calculate the area of a rectangle with a length of 5 and width of 3?</question>
A:
<output>{"name": "calculate_area", "arguments": {"shape": "rectangle", "dimensions": {"length": 5, "width": 3}}}</output>
</example>

<example>
H:
<context>Functions: [{"name": "search_books", "description": "Search for books based on title or author", "parameters": {"type": "object", "properties": {"search_query": {"type": "string", "description": "The title or author to search for"}}, "required": ["search_query"]}}]</context>
<question>USER: I am looking for books by J.K. Rowling. Can you help me find them?</question>
A:
<output>{"name": "search_books", "arguments": {"search_query": "J.K. Rowling"}}</output>
</example>

<example>
H:
<context>Functions: [{"name": "calculate_age", "description": "Calculate the age based on the birthdate", "parameters": {"type": "object", "properties": {"birthdate": {"type": "string", "format": "date", "description": "The birthdate"}}, "required": ["birthdate"]}}]</context>
<question>USER: Hi, I was born on 1990-05-15. Can you tell me how old I am today?</question>
A:
<output>{"name": "calculate_age", "arguments": {"birthdate": "1990-05-15"}}</output>
</example>

</fewshot_examples>

3. Carefully read the current conversation history:
<conversation_history>
{conversation_history}
</conversation_history>

4. Analyze the last message from the user and determine if any of the available functions need to be called to provide an appropriate response.

5. If a function call is necessary, follow the format demonstrated in the examples to invoke the relevant function with the required arguments.

6. If no function call is needed, provide a direct response to the user's message.

7. Your response should be concise, relevant, and tailored to the specific context of the conversation.

8. Enclose your final response in <response></response> tags, without any additional preamble or explanation.

Provide your response immediately after these instructions, following the specified format.

8.23%
Function Calling

## Retrieved knowledge snippets

Domain: Restaurant
Entity: THE COPPER KETTLE
Review: My friend Mark took me to the copper kettle to celebrate my promotion. I decided to treat myself to Shepherds Pie. It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. I may come back to try the strawberries and cream come time for Wimbledon..
Highlight: It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio.

Domain: Restaurant
Entity: THE COPPER KETTLE
Review: Last week, my colleagues and I visited THE COPPER KETTLE that serves British cuisine. We enjoyed a nice view from inside of the restaurant. The atmosphere was enjoyable and the restaurant was located in a nice area. However, the food was mediocre and was served in small portions.
Highlight: We enjoyed a nice view from inside of the restaurant.

## Dialog

user: Howdy, I'm looking for a British restaurant for breakfast.

agent: There are several British restaurants available. Would you prefer a moderate or expensive price range?

user: Moderate price range please.

agent: Five restaurants match your criteria. Four are in Centre area and one is in the West. Which area would you prefer?

user: I would like the Center of town please.

agent: How about The Copper Kettle?

user: Do they offer a good view?

agent:

Given the above information from a retrieval augmented generation (RAG) system, respond with the appropriate answer to the question from the user.

<task_description>
You are an advanced question-answering system that utilizes information from a retrieval augmented generation (RAG) system to provide accurate and relevant responses to user queries.
</task_description><instructions>
1. Carefully review the provided context information:
<context>
Domain: Restaurant
Entity: THE COPPER KETTLE
Review: My friend Mark took me to the copper kettle to celebrate my promotion. I decided to treat myself to Shepherds Pie. It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. I may come back to try the strawberries and cream come time for Wimbledon..
Highlight: It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio.Domain: Restaurant
Entity: THE COPPER KETTLE
Review: Last week, my colleagues and I visited THE COPPER KETTLE that serves British cuisine. We enjoyed a nice view from inside of the restaurant. The atmosphere was enjoyable and the restaurant was located in a nice area. However, the food was mediocre and was served in small portions.
Highlight: We enjoyed a nice view from inside of the restaurant.
</context>2. Analyze the user's question:
<question>
user: Howdy, I'm looking for a British restaurant for breakfast.agent: There are several British restaurants available. Would you prefer a moderate or expensive price range?user: Moderate price range please.agent: Five restaurants match your criteria. Four are in Centre area and one is in the West. Which area would you prefer?user: I would like the Center of town please.agent: How about The Copper Kettle?user: Do they offer a good view?

agent:
</question>

3. Leverage the context information and your knowledge to generate a concise and accurate answer to the user's question.

4. Ensure your response directly addresses the specific query while incorporating relevant details from the context.

5. Provide your answer in a clear and easy-to-understand manner, without any unnecessary preamble or explanation.
</instructions>

<output_format>
Answer: [Insert your concise answer here]
</output_format>

<example>
Context:
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair, it was initially criticized by some of France's leading artists and intellectuals for its design, but it has become a global cultural icon of France and one of the most recognizable structures in the world.

Question: What is the Eiffel Tower?

Answer: The Eiffel Tower is a wrought-iron lattice tower in Paris, France, named after its designer Gustave Eiffel, and constructed as the centerpiece of the 1889 World's Fair.
</example>

22.03%

The consistent improvements across different tasks highlight the robustness and effectiveness of Prompt Optimization in enhancing prompt performance for various natural language processing (NLP) tasks. This shows Prompt Optimization can save you considerable time and effort while achieving better outcomes by testing models with optimized prompts implementing the best practices for each model.

Conclusion

Prompt Optimization on Amazon Bedrock empowers you to effortlessly enhance your prompt’s performance across a wide range of use cases with just a single API call or a few clicks on the Amazon Bedrock console. The substantial improvements demonstrated on open-source benchmarks for tasks like summarization, dialog continuation, and function calling underscore this new feature’s capability to streamline the prompt engineering process significantly. Prompt Optimization on Amazon Bedrock enables you to easily test many different models for your generative-AI application, following the best prompt engineering practices for each model. The reduced manual effort, will greatly accelerate the development of generative-AI applications in your organization.

We encourage you to try out Prompt Optimization with your own use cases and reach out to us for feedback and collaboration.


About the Authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focusing on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

Zhengyuan Shen is an Applied Scientist at Amazon Bedrock, specializing in foundational models and ML modeling for complex tasks including natural language and structured data understanding. He is passionate about leveraging innovative ML solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside work, he enjoys sports and cooking.

Shipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.

Read More