Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – part 1

The generative AI landscape has been rapidly evolving, with large language models (LLMs) at the forefront of this transformation. These models have grown exponentially in size and complexity, with some now containing hundreds of billions of parameters and requiring hundreds of gigabytes of memory. As LLMs continue to expand, AI engineers face increasing challenges in deploying and scaling these models efficiently for inference. One of the primary bottlenecks in the inference deployment process has been the time required to load these massive models onto accelerators. With LLMs now reaching hundreds of gigabytes in size, it has become increasingly difficult for many users to address bursty traffic patterns and scale quickly. For LLMs that often require high throughput and low-latency inference requests, this loading process can add significant overhead to the total deployment and scaling time, potentially impacting application performance during traffic spikes. SageMaker Large Model Inference (LMI) is deep learning container to help customers quickly get started with LLM deployments on SageMaker Inference.

Today at AWS re:Invent 2024, we are excited to announce a new capability in Amazon SageMaker Inference that significantly reduces the time required to deploy and scale LLMs for inference using LMI: Fast Model Loader. This innovation allows you to scale your models faster, observing up to 19% reduction in latency when scaling a new model copy on a new instance for inference. It represents a substantial leap forward in loading large models efficiently. Fast Model Loader introduces a novel approach by streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, enabling faster model loading.

In our internal testing, we observed that Fast Model Loader can load large models up to 15 times faster compared to the traditional loading methods. This dramatic improvement in loading speed opens up new possibilities for responsive AI systems, potentially enabling faster scaling and more dynamic applications that can adapt quickly to changing demands. During our performance testing we were able to load the llama-3.1-70B model on an ml.p4d.24xlarge instance in just 1 minute. This model, with its 70 billion parameters, typically requires over 140 GB of memory in full precision, underscoring the magnitude of the loading challenge that Fast Model Loader addresses.

Fast Model Loader is designed to tackle scaling challenges, potentially leading to improved resource utilization on GPU instances and more efficient scaling during autoscaling events. This feature aims to provide you with a powerful new option for managing the deployment and scaling of your LLMs on SageMaker inference, whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM-based services.

This post is Part 1 of a series exploring Fast Model Loader. In this post, we delve into the technical details of Fast Model Loader, explore its integration with existing SageMaker workflows, discuss how you can get started with this powerful new feature, and share customer success stories. In Part 2, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments.

Challenges in deploying LLMs for inference

As LLMs and their respective hosting containers continue to grow in size and complexity, AI and ML engineers face increasing challenges in deploying and scaling these models efficiently for inference. The rapid evolution of LLMs, with some models now using hundreds of billions of parameters, has led to a significant increase in the computational resources and sophisticated infrastructure required to run them effectively.

One of the primary bottlenecks in the deployment process is the time required to download and load containers when scaling up endpoints or launching new instances. This challenge is particularly acute in dynamic environments where rapid scaling is crucial to maintain service quality. The sheer size of these containers, often ranging from several gigabytes to tens of gigabytes, can lead to substantial delays in the scaling process.

When a scale-up event occurs, several actions take place, each contributing to the total time between triggering a scale-up event and serving traffic from the newly added instances. These actions typically include:

  • Provisioning new compute instances
  • Downloading the container image
  • Loading the container image
  • Downloading the model artifacts from Amazon S3 to disk
  • Loading the model artifacts on the host (using CPU and memory)
  • Preparing the model to be loaded on GPU (quantization, model sharding, and so on)
  • Loading the final model artifacts on the GPU

The cumulative time for these steps can take up to tens of minutes, depending on the model size, runtime used by the model, and infrastructure capabilities. This delay can lead to suboptimal user experiences and potential service degradation during scaling activities, making it a critical area for optimization in the field of AI inference infrastructure.

To reduce the time it takes to download and load the container image, SageMaker now supports container caching. To learn more about this new feature, refer to Supercharge your auto scaling for generative AI inference- Introducing Container Caching in SageMaker Inference

For model loading, a typical deployment follows the steps described in this section, which can lead non-ideal deployment latency. This can lead to requests sitting in the queue waiting to be processed while the deployment concludes, or can result in dropped requests when timeouts are exceeded, as shown in the following diagrams.

You can take a step to optimize deployment with ahead of time (AoT) compilation of your model. This requires you to create or use existing pre-sharded models to avoid the step needed to process the model during runtime deployment. By taking on the cost of pre-creating these artifacts and referencing them as persisted objects, you can take that latency ahead of time. This can significantly reduce the time it takes to scale up a model especially if it’s larger in size.

The benefits of this approach are particularly noticeable for larger models:

  • Reduced scaling time – Pre-sharded models can be loaded more quickly, decreasing the time required to bring new instances online during scaling events
  • Improved resource utilization – By offloading the compilation and sharding process, more computational resources are available for inference tasks during runtime
  • Consistency – Pre-compiled artifacts provide consistent performance across deployments

Although there is an upfront cost in creating these artifacts, the long-term savings in reduced scaling times and improved resource utilization can be substantial, especially for models that are frequently deployed or require rapid scaling. This approach can significantly reduce the time it takes to scale up a model, particularly for larger models, leading to more responsive and efficient AI systems. The following figures illustrate the proposed way to load models.

Additionally, disk becomes a bottleneck during model loading due to its limited I/O bandwidth. Traditional storage systems struggle with the high throughput required for large-scale model loading, like Meta Llama 3.1 70B. Disk read/write speeds are often much slower than network or GPU memory bandwidths, creating delays in transferring model weights. This issue can be alleviated by streaming data directly from Amazon S3 to GPU memory, bypassing disk entirely.

We can now take a significant step forward and also address the steps it takes using host resources and the sequential steps it takes between downloading the model artifacts to loading it onto the GPU using Fast Model Loader.

Weight streaming

Fast Model Loader streams weights directly from Amazon S3 to GPUs. This is accomplished by cutting out the intermediary steps—the bytes representing model weights are downloaded to the CPU memory and immediately copied over to the GPU using Direct Memory Access (DMA). This simplifies the model loading workflow and makes it straightforward to maximize the model loading throughput. It presents the following key advantages:

  • No waiting – In the traditional approach, each step in the loading process (download, load to host’s CPU, GPU copy) needs to complete for a tensor or a layer before the next step could begin. This creates synchronous bottlenecks, where components are idle while waiting for the previous step to finish. Fast Model Loader’s direct streaming approach eliminates these synchronous blocking operations, allowing all components to operate at their maximum potential concurrently.
  • No accumulation – Instead of downloading the entire model to disk or CPU memory before processing, Fast Model Loader streams the model weights in small chunks directly to the GPU. This avoids the need to accumulate the full model in system storage or memory, reducing the overall resource requirements and footprint.
  • Maximum throughput – By simplifying the model loading workflow and eliminating intermediate steps, Fast Model Loader can more effectively take advantage of the high-throughput capabilities of Amazon S3 and the generous network bandwidth available on the large instances typically used for hosting LLMs. This allows the model loading process to achieve maximum throughput and minimize latency.

The following figure compares model load times for sequential vs. parallel processes.

Model sharding for streaming

The weight streaming paradigm described in the previous section requires that the model weights be prepared appropriately prior to streaming. In order to stream the model weights a few bytes at a time, we need to store the model in a format consistent with our expectation.

The traditional approach to storing and distributing LLM weights often relies on the SafeTensors format. Although SafeTensors provides a standardized way to package and distribute model weights, it presents some challenges when it comes to the weight streaming paradigm used by Fast Model Loader. In the SafeTensors format, the fundamental unit of storage is the tensor. Tensors are multi-dimensional arrays that represent the various weights and parameters of a machine learning model. However, the size of these tensors can vary significantly, ranging from a few megabytes to several gigabytes, depending on the complexity and scale of the model. This non-uniform distribution of tensor sizes poses a problem for Fast Model Loader’s weight streaming approach. The variable tensor sizes in the SafeTensors format make it difficult to achieve consistent throughput. Larger tensors require more time and resources to load, whereas smaller tensors are underutilized, leading to inefficiencies in the overall loading process.

The following figure illustrates loading SafeTensors weights of various sizes.

Fast Model Loader introduces a new model format with the following key advantages:

  • Pre-sharding – The explosion in model sizes has seen them outgrow GPUs. The largest models available today are over 1 TB large, whereas the largest GPUs fall short of 200 GB. This has led to us embracing distributed inference strategies like tensor parallelism. It involves splitting a model into portions (shards) and distributing them to multiple GPUs. However, this involves quite a few computations in deciding how to split the model at every layer and calculating offsets based on tensor size and available GPU memory. Fast Model Loader performs this optimization pre-deployment, which avoids the overhead during scaling activities. The preparation only happens one time, and the model can be deployed to any number of instances with the same distributed inference strategy. The following figure provides an overview of pre-sharding.

  • Uniform size distribution – The model weights are stored in uniform 8 MB chunks, which are less complicated to parallelize for concurrent processing. The following figure illustrates uniform chunks being parallelized across cores.

  • Out of order processing – Objects in Amazon S3 typically have to be downloaded in-order. To read the middle of the object, Amazon S3 starts by reading objects from the start, until it gets to the middle. This requires model weights to be downloaded synchronously, which runs contrary to our fast model loading paradigm. Storing model weights in uniform chunks of 8 MB each allows you to access any piece of the model at any time without synchronization. The following figures illustrate how breaking tensors into chunks allows for asynchronous, out of order retrieval.

Performance Testing:

The implementation of the Fast Model Loader demonstrates significant in End-to-End (E2E) scaling time for large language models. Across five simulations that were performed with Llama3.1 70B, we observed remarkably consistent results, reinforcing the reliability of our findings. For this feature we use container wherein caching was enabled here. When using CUDA Graphs, the average scaling time was reduced from 407 seconds (6.78 minutes) to 334.6 seconds (5.58 minutes), marking a substantial 21.64% improvement. Similarly, without CUDA Graphs, the average scaling time decreased from 379 seconds (6.32 minutes) to 306 seconds (5.10 minutes), resulting in a 19.26% reduction. By cutting scaling times by approximately one-fifth in all observed cases, this feature enables more responsive scaling and better handling of dynamic workloads, ultimately leading to improved performance and resource utilization in AI inference.

To run this benchmark, we use sub-minute metrics to detect the need for scaling. For more details, see Amazon SageMaker inference launches faster auto scaling for generative AI models and Container Caching.

The following table summarizes our setup.

Region CMH
Instance Type p4d.24xlarge
Container LMI
Container Image 763104351884.dkr.ecr.us-east-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124
Model Meta Llama3.1 70B

For this scenario, we illustrate scaling the model by adding a new instance.
The following table summarizes the results when the models are not sharded.
** All numbers presented here are in seconds.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy CUDA Graphs Capture Overhead E2E Scaling Latency
. . . . . With CUDA Graphs Without CUDA Graphs
1 40 185 173 28 398 370
2 40 175 188 29 403 374
3 40 164 208 29 412 383
4 40 185 187 30 412 382
5 40 185 187 28 412 384
Average . 179 189 29 407 379

The following table summarizes the results after the models are sharded.
** All numbers presented here are in seconds.

Meta Llama 3.1 70B
Trial Time to Detect Need for Scaling Time to Spin Up an Instance Time to Instantiate a New Model Copy CUDA Graphs Capture Overhead E2E Scaling Latency
. . . . . With CUDA Graphs Without CUDA Graphs
1 40 185 119 28 344 316
2 40 175 119 30 334 304
3 40 169 119 28 328 300
4 40 169 120 28 329 301
5 40 179 119 29 338 309
Average . 175.4 119.2 28.6 334.6 306

The following diagram summarizes the impact on E2E scaling time.
** All numbers presented here are in seconds.

.. Before After % Improvements
Scaling with CUDA Graphs 407 334.6 21.64%
Scaling without CUDA Graphs 379 306 19.26%

Note: For customers using ODCRs for GPUs may experience lower time to spin up new instances as compared to on demand depending on instance type.

Impact on Read-to-Serve Time:

The benchmarks below show that the SageMaker Fast Model Loader can load large models significantly faster compared to traditional counterparts. For the LLaMa 3.1 70B model on the ml.p4d.24xlarge instance, we compared the download and load times against 2 traditional methods – downloading the model from HuggingFace Hub using transformers and downloading the model from S3 using vLLM’s default downloader. In both cases we used vLLM’s default loader to load the model after the download.

** All numbers presented here are in seconds.

. Download Load % Improvement with Fast Model Loader Speedup with Fast Model Loader
Transformers Downloader + vLLM Model Loader 602 138 93.24% 15x
vLLM Downloader + vLLM Model Loader 127 138 81.13% 5x
Fast Model Loader 50 . .

The load time here indicates the time taken to get the model fully ready to serve, including time taken to initialize the KV cache.

How to get started

You can start using Fast Model Loader now through the Amazon SageMaker Studio console using Amazon SageMaker JumpStart or programmatically using the SageMaker Python SDK.

From the SageMaker Studio JumpStart hub, you can pick a model and choose Optimize to run the inference optimization job and then deploy the optimized model to a SageMaker endpoint. For more detailed instructions, refer to the Part 2 of this post.

Though SageMaker Studio provides a user-friendly interface for model optimization through SageMaker JumpStart, you can also achieve the same functionality programmatically using the SageMaker Python SDK. The ModelBuilder class offers a streamlined way to optimize and deploy large models, requiring just a few lines of code to prepare your model for fast loading and inference. The following code snippet shows the core implementation to use ModelBuilder to prepare and optimize the model for Fast Model Loader. You can find an end-to-end example notebook in the following GitHub repo.

# Create a model builder object
model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",[{{type}} Annotation]
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test")
)
#Run model optimization job
model_builder.optimize(
        instance_type="ml.p4d.24xlarge",
        output_path=output_path,
        sharding_config={
        "OverrideEnvironment":{
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
        }
)

Customer testimonials

The introduction of Fast Model Loader in SageMaker has generated significant excitement among our customers, particularly those working with LLMs. We’ve collected early feedback from customers that have had the opportunity to preview this new capability. Their responses underscore the potential of Fast Model Loader to transform the deployment and scaling of AI models, especially in scenarios requiring rapid response to changing demands.

Atomicwork is a modern ITSM and ESM solution that revolutionizes internal support for organizations through AI-powered chat interfaces, replacing traditional ticketing portals.

“Amazon SageMaker Fast Model Loader is a game changer for our AI-driven enterprise workflows. It significantly accelerates the deployment and scaling of the large language models, which are critical for providing responsive, chat-based IT support, HR processes, and customer service operations. We look forward to adopting this feature that allows us to optimize our computational resources while maintaining the agility our enterprise customers expect, helping us deliver a truly intelligent service management platform.”

– Kiran Darisi, Co-founder and CTO of Atomicwork.

Conclusion

In this post, we discussed how loading large model artifacts can be the bottleneck in loading and scaling FMs. SageMaker has launched a new feature called Fast Model Loader to address challenges in deploying and scaling FMs for inference. Fast Model Loader can load large models up to 15 times faster by streaming model weights directly from Amazon S3 to the accelerator, reducing scaling and deployment times significantly.

In Part 2 of this post, we demonstrate how you can try out this new feature through either the SageMaker Python SDK or SageMaker Studio console.


About the Authors

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

Marc Karp is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers design, deploy, and manage ML workloads at scale. In his spare time, he enjoys traveling and exploring new places.

Anisha Kolla is a Software Development Engineer with SageMaker Inference team with over 10+ years of industry experience. She is passionate about building scalable and efficient solutions that empower customers to deploy and manage machine learning applications seamlessly. Anisha thrives on tackling complex technical challenges and contributing to innovative AI capabilities. Outside of work, she enjoys exploring fusion cuisines, traveling, and spending time with family and friends.

Read More

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

Introducing Fast Model Loader in SageMaker Inference: Accelerate autoscaling for your Large Language Models (LLMs) – Part 2

In Part 1 of this series, we introduced Amazon SageMaker Fast Model Loader, a new capability in Amazon SageMaker that significantly reduces the time required to deploy and scale large language models (LLMs) for inference. We discussed how this innovation addresses one of the major bottlenecks in LLM deployment: the time required to load massive models onto accelerators. By streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, Fast Model Loader can achieve up to 15 times faster loading times compared to traditional methods.

As the AI landscape continues to evolve and models grow even larger, innovations like Fast Model Loader become increasingly crucial. By significantly reducing model loading times, this feature has the potential to transform the way you deploy and scale your LLMs, enabling more responsive and efficient AI applications across a wide range of use cases.

In this post, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments. We explore two approaches: using the SageMaker Python SDK for programmatic implementation, and using the Amazon SageMaker Studio UI for a more visual, interactive experience. Whether you’re a developer who prefers working with code or someone who favors a graphical interface, you’ll learn how to take advantage of this powerful feature to accelerate your LLM deployments.

Solution overview

Fast Model Loader is currently integrated with SageMaker Large Model Inference (LMI) containers (starting with v13) for GPU instances. It introduces two key techniques to enable lightning-fast model loads:

  • Weight streaming
  • Model sharding for streaming

Use Fast Model Loader with the SageMaker Python SDK

In this section, we show how to use the SageMaker Python SDK to use this new feature. You can find the example notebook in the following GitHub repo. Complete the following steps:

  1. First, use ModelBuilder to prepare and package the model inference components.

To learn more about the ModelBuilder class, refer to Package and deploy classical ML and LLMs easily with Amazon SageMaker, part 1: PySDK Improvements. In this example, you deploy the Meta Llama 3.1 70B model with the model name meta-textgeneration-llama-3-1-70b in Amazon SageMaker JumpStart.

The SchemaBuilder parameter is used to infer the serialization and deserialization methods for the model. For more information on SchemaBuilder, refer to Define serialization and deserialization methods.

You can choose to specify OPTION_TENSOR_PARALLEL_DEGREE as a ModelBuilder environment variable as shown in the following commented lines, or in the next step as part of the ModelBuilder sharding_config:

from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging

# Define sample input and output for the model
prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."
# Create the input schema structure
sample_input = {
    "inputs": prompt,
    "parameters": {"max_new_tokens": 32}
}
# Define the expected output format
sample_output = [{"generated_text": response}]

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    role_arn=role,
    sagemaker_session=sess,
    schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output),
    #env_vars={
    #   "OPTION_TENSOR_PARALLEL_DEGREE": "8",
    #},
)
  1. Next, use the optimize() function to prepare the model shards for deployment.

The optimize() function will start a model optimization job and will take a few minutes to finish. The tensor parallel degree should be set to how many GPUs you want each inference component to have access to. You can find the model shards at the output_path S3 location under a folder starting with sagemaker-fast-model-loader-xxx.

model_builder.optimize(
    instance_type="ml.p4d.24xlarge", 
    accept_eula=True, 
    output_path=output_path, 
    sharding_config={
            "OverrideEnvironment": {
            # The value must be equal to the subsequent number of GPUs that will be used for each IC. 
                "OPTION_TENSOR_PARALLEL_DEGREE": "8"
            }
    }
)

You can reuse the sharded model that was generated by previous optimization jobs. The following code sample demonstrates how to use model_metadata to overwrite the model path, which needs to point to the Amazon S3 location of the existing model shards:

model_builder = ModelBuilder(
    model="meta-textgeneration-llama-3-1-70b",
    model_metadata={
        "CUSTOM_MODEL_PATH": output_path,
    },
    schema_builder=SchemaBuilder(sample_input="Test", sample_output="Test"),
    role_arn=role,
    instance_type="ml.p4d.24xlarge",
)
  1. When the model optimization job is complete, you can use the build() function to generate the artifacts according to the model server:
    # use the build() function to generate the artifacts according to the model server
    final_model = model_builder.build()

  2. If you’re using existing model shards without running an optimization job, you need to make sure the _is_sharded_model value is set to True and the EnableNetworkIsolation is set to False because Fast Model Loader requires network access:
    # You only need to set the values if you are using existing sharded models 
    if not final_model._is_sharded_model:
     final_model._is_sharded_model = True 
    if final_model._enable_network_isolation:
     final_model._enable_network_isolation = False

  3. Use the deploy() function to deploy the model to an endpoint, where you can specify the required resources, such as GPU memory and number of accelerators:
    from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
    
    resources_required = ResourceRequirements(
        requests={
            "memory" : 204800,
            "num_accelerators": 8
        }
    )
    
    # deploy the optimized model to an endpoint
    final_model.deploy(
        instance_type="ml.p4d.24xlarge", 
        accept_eula=True, 
        endpoint_logging=False, 
        resources=resources_required
    )

  4. After the endpoint is up and running, you can test the endpoint using the following code example:
    from sagemaker.predictor import retrieve_default 
    endpoint_name = final_model.endpoint_name 
    predictor = retrieve_default(endpoint_name) 
    payload = { "inputs": "I believe the meaning of life is", 
                "parameters": { 
                    "max_new_tokens": 64, 
                    "top_p": 0.9, 
                    "temperature": 0.6 
                } 
            }
    response = predictor.predict(payload) 
    print(response)

  5. To clean up, run the following code cell to delete the resources created for the endpoint:
    predictor.delete_predictor()
    predictor.delete_endpoint()

Use Fast Model Loader with SageMaker Studio

In this section, we show how to use the faster model loading feature through the SageMaker Studio UI. Complete the following steps:

  1. On the SageMaker Studio console, chose JumpStart in the navigation pane.
  2. Choose your model.
  3. On the model details page, choose Optimize.
  4. Accept the EULA and proceed to the optimization configurations.
  5. Select Fast model loading and set the OPTION_TENSOR_PARALLEL_DEGREE to 8, because this example uses an ml.p4d.24xlarge instance that has 8 GPUs. If you’re using an instance with a different number of GPUs, set the value to match the instance.
  6. Set the output path to the Amazon S3 path where the sharded model will be stored.
  7. Choose Create job.

After the inference optimization job starts, you can check the status of the job on the Inference optimization page. Here, each of the jobs have tags associated with them as to what optimization configuration was used.

  1. View the details of the job by choosing the job ID.
  2. Deploy the optimized model by choosing Deploy on the optimized job page.
  3. Verify the endpoint settings and choose Deploy to initiate a SageMaker endpoint deployment.

You will get a notification on the SageMaker Studio UI, and the status will change to In service when the endpoint creation is complete.

You can now send a sample inference request to test the model.

After the test, you can delete the endpoint from the SageMaker Studio console to clean up the resources created in this example.

Conclusion

Fast Model Loader represents a significant advancement in how you can deploy and scale LLMs on SageMaker. In this post, we walked through the step-by-step process of implementing this feature through both the SageMaker Python SDK and SageMaker Studio UI. By using weight streaming and model sharding techniques, you can now achieve dramatically faster model loading times, enabling more responsive scaling for your LLM-based applications.

The integration with SageMaker LMI containers (starting from LMI v13) makes it straightforward to adopt this feature in your existing workflows. Whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM services, Fast Model Loader provides the tools you need to optimize your model deployment pipeline.

Try out Fast Model Loader for your own use case, and leave your feedback and questions in the comments.


About the Authors

Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.

James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences,  and staying up to date with the latest technology trends.You can find him on LinkedIn.

Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.

Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.

Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Read More

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Chronos-Bolt is the newest addition to AutoGluon-TimeSeries, delivering accurate zero-shot forecasting up to 250 times faster than the original Chronos models [1].

Time series forecasting plays a vital role in guiding key business decisions across industries such as retail, energy, finance, and healthcare. Traditionally, forecasting has relied on statistical models [2] like ETS and ARIMA, which remain strong baselines, particularly when training data is limited. Over the past decade, advancements in deep learning have spurred a shift toward so-called global models such as DeepAR [3] and PatchTST [4]. These approaches train a single deep learning model across multiple time series in a dataset—for example, sales across a broad e-commerce catalog or observability metrics for thousands of customers.

Foundation models (FMs) such as Chronos [1] have taken the idea of training a single model across multiple time series a significant step further. These models are pretrained on a vast corpus of real and synthetic time series data, covering diverse domains, frequencies, and history lengths. As a result, they enable zero-shot forecasting—delivering accurate predictions on unseen time series datasets. This lowers the entry barrier to forecasting and greatly simplifies forecasting pipelines by providing accurate forecasts without the need for training. Chronos models have been downloaded over 120 million times from Hugging Face and are available for Amazon SageMaker customers through AutoGluon-TimeSeries and Amazon SageMaker JumpStart.

In this post, we introduce Chronos-Bolt, our latest FM for forecasting that has been integrated into AutoGluon-TimeSeries.

Introducing Chronos-Bolt

Chronos-Bolt is based on the T5 encoder-decoder architecture [5] and has been trained on nearly 100 billion time series observations. It chunks the historical time series context into patches of multiple observations, which are then input into the encoder. The decoder then uses these representations to directly generate quantile forecasts across multiple future steps—a method known as direct multi-step forecasting. This differs from the original Chronos models that rely on autoregressive decoding. The chunking of time series and direct multi-step forecasting makes Chronos-Bolt up to 250 times faster and 20 times more memory-efficient than the original Chronos models.

The following plot compares the inference time of Chronos-Bolt against the original Chronos models for forecasting 1024 time series with a context length of 512 observations and a prediction horizon of 64 steps.

Inference speed comparison between Chronos and Chronos-Bolt

Chronos-Bolt models are not only significantly faster, but also more accurate than the original Chronos models. The following plot reports the probabilistic and point forecasting performance of Chronos-Bolt in terms of the Weighted Quantile Loss (WQL) and the Mean Absolute Scaled Error (MASE), respectively, aggregated over 27 datasets (see [1] for dataset details). Remarkably, despite having no prior exposure to these datasets during training, the zero-shot Chronos-Bolt models outperform commonly used statistical models and deep learning models that have been trained on these datasets (highlighted by *). Furthermore, they also perform better than other FMs, denoted by a +, which indicates that these models were pretrained on certain datasets in our benchmark and are not entirely zero-shot. Notably, Chronos-Bolt (Base) also surpasses the original Chronos (Large) model in terms of the forecasting accuracy while being over 600 times faster.

Zero-shot benchmark for Chronos-Bolt

Chronos-Bolt models are now available on Hugging Face in four sizes—Tiny (9M), Mini (21M), Small (48M), and Base (205M)—and can also be used on the CPU.

Solution overview

In this post, we showcase how to use Chronos-Bolt models using the familiar interface of AutoGluon-TimeSeries. AutoGluon-TimeSeries enables SageMaker customers to build and deploy models for time series forecasting, including FMs such as Chronos-Bolt and other global models, and effortlessly ensemble them with statistical models to maximize accuracy.

Perform zero-shot forecasting with Chronos-Bolt

To get started, you need to install AutoGluon v1.2 by running the following command in an Amazon SageMaker Studio notebook or in the terminal:

pip install autogluon.timeseries~=1.2.0

AutoGluon-TimeSeries uses the TimeSeriesDataFrame to work with time series datasets. The TimeSeriesDataFrame expects data in the long dataframe format with at least three columns: an ID column denoting the IDs of individual time series in the dataset, a timestamp column, and a target column that contains the raw time series values. The timestamps must be uniformly spaced, with missing observations denoted by NaN and Chronos-Bolt will handle them appropriately. The following snippet loads the Australian Electricity dataset [6] that contains electricity demand data at 30-minute intervals for five Australian states into a TimeSeriesDataFrame:

from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

train_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/train.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

The next step involves fitting a TimeSeriesPredictor on this data:

predictor = TimeSeriesPredictor(prediction_length=48).fit(train_data, presets="bolt_base")

We have specified that the TimeSeriesPredictor should produce forecasts for the next 48 steps, or 1 day in this case. AutoGluon-TimeSeries offers various presets that can be used when fitting the predictor. The bolt_base preset, used in this example, employs the Base (205M) variant of Chronos-Bolt for zero-shot inference. Because no model fitting is required for zero-shot inference, the call to fit() returns almost instantaneously. The predictor is now ready to generate zero-shot forecasts, which can be done through the predict method:

predictions = predictor.predict(train_data)

AutoGluon-TimeSeries generates both point and probabilistic (quantile) forecasts for the target value. The probabilistic forecast captures the uncertainty of the target value, which is essential for many planning tasks.

We can also visualize the predictions and compare them against the ground truth target value over the forecast horizon:

test_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/test.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

predictor.plot(test_data, predictions, max_history_length=200, item_ids=["T000002"])

Chronos-Bolt generates an accurate zero-shot forecast, as shown in the following plot illustrating point forecasts and the 80% prediction intervals.

Forecasts Qualitative

Fine-tune Chronos-Bolt with AutoGluon

So far, we have used Chronos-Bolt in inference-only mode for zero-shot forecasting. However, AutoGluon-TimeSeries also allows you to fine-tune Chronos-Bolt on your specific datasets. We recommend using a GPU instance such as g5.2xlarge for fine-tuning. The following snippet specifies two settings for the Chronos-Bolt (Small, 48M) model: zero-shot and fine-tuned. AutoGluon-TimeSeries will perform a lightweight fine-tuning of the pretrained model on the provided training data. We add name suffixes to identify the zero-shot and fine-tuned versions of the model.

predictor = TimeSeriesPredictor(prediction_length=48, eval_metric="MASE").fit(
    train_data,
    hyperparameters={
        "Chronos": [
            {"model_path": "bolt_small", "ag_args": {"name_suffix": "ZeroShot"}},
            {"model_path": "bolt_small", "fine_tune": True, "ag_args": {"name_suffix": "FineTuned"}},
        ]
    },
    enable_ensemble=False,
    time_limit=600,
)

The predictor will be fitted for at most 10 minutes, as specified by the time_limit. After fitting, we can evaluate the two model variants on the test data and generate a leaderboard:

predictor.leaderboard(test_data)

Fine-tuning Leaderboard

Fine-tuning resulted in a significantly improved forecast accuracy, as shown by the test MASE scores. All AutoGluon-TimeSeries models report scores in a “higher is better” format, meaning that most forecasting error metrics like MASE are multiplied by -1 when reported.

Augment Chronos-Bolt with exogenous information

Chronos-Bolt is a univariate model, meaning it relies solely on the historical data of the target time series for making predictions. However, in real-world scenarios, additional exogenous information related to the target series (such as holidays or promotions) is often available. Using this information when making predictions can improve forecast accuracy. AutoGluon-TimeSeries now features covariate regressors, which can be combined with univariate models like Chronos-Bolt to incorporate exogenous information. A covariate regressor in AutoGluon-TimeSeries is a tabular regression model that is fit on the known covariates and static features to predict the target column at each time step. The predictions of the covariate regressor are subtracted from the target column, and the univariate model then forecasts the residuals.

We use a grocery sales dataset to demonstrate how Chronos-Bolt can be combined with a covariate regressor. This dataset includes three known covariates: scaled_price, promotion_email, and promotion_homepage, and the task is to forecast the unit_sales:

train_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/grocery_sales/train.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

Grocery Sales DataFrame

The following code fits a TimeSeriesPredictor to forecast unit_sales for the next 7 weeks. We have specified the target column we are interested in forecasting and the names of known covariates while constructing the TimeSeriesPredictor. Two configurations are defined for Chronos-Bolt: a zero-shot setting, which uses only the historical context of unit_sales without considering the known covariates, and a covariate regressor setting, which employs a CatBoost model as the covariate_regressor. We also use the target_scaler, which makes sure the time series have a comparable scale before training, which typically results in better accuracy.

predictor = TimeSeriesPredictor(
    prediction_length=7,
    eval_metric="MASE",
    target="unit_sales",
    known_covariates_names=["scaled_price", "promotion_email", "promotion_homepage"],
).fit(
    train_data,
    hyperparameters={
        "Chronos": [
            {"model_path": "bolt_small", "ag_args": {"name_suffix": "ZeroShot"}},
            {
                "model_path": "bolt_small",
                "covariate_regressor": "CAT",
                "target_scaler": "standard",
                "ag_args": {"name_suffix": "WithRegressor"},
            },
        ],
    },
    time_limit=600,
    enable_ensemble=False,
)

After the predictor has been fit, we can evaluate it on the test dataset and generate the leaderboard. Using the covariate regressor with Chronos-Bolt improves over its univariate zero-shot performance considerably.

test_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/grocery_sales/test.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)
predictor.leaderboard(test_data)

Covariate Regressor Results

The covariates might not always be useful—for some datasets, the zero-shot model might achieve better accuracy. Therefore, it’s important to try multiple models and select the one that achieves the best accuracy on held-out data.

Conclusion

Chronos-Bolt models empower practitioners to generate high-quality forecasts rapidly in a zero-shot manner. AutoGluon-TimeSeries enhances this capability by enabling users to fine-tune Chronos-Bolt models effortlessly, integrate them with covariate regressors, and ensemble them with a diverse range of forecasting models. For advanced users, it provides a comprehensive set of features to customize forecasting models beyond what was demonstrated in this post. AutoGluon predictors can be seamlessly deployed to SageMaker using AutoGluon-Cloud and the official Deep Learning Containers.

To learn more about using AutoGluon-TimeSeries to build accurate and robust forecasting models, explore our tutorials. Stay updated by following AutoGluon on X (formerly Twitter) and starring us on GitHub!

References

[1] Ansari, Abdul Fatir, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, et al. “Chronos: Learning the language of time series.” Transactions on Machine Learning Research (2024).
[2] Hyndman, R. J., and G. Athanasopoulos. “Forecasting: principles and practice 3rd Ed.” O Texts (2018).
[3] Salinas, David, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. “DeepAR: Probabilistic forecasting with autoregressive recurrent networks.” International Journal of Forecasting 36, no. 3 (2020): 1181-1191.
[4] Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: long-term forecasting with transformers.” In The Eleventh International Conference on Learning Representations (2023).
[5] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research 21, no. 140 (2020): 1-67.
[6] Godahewa, Rakshitha, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. “Monash time series forecasting archive.” In NeurIPS Track on Datasets and Benchmarks (2021).


About the Authors

Abdul Fatir Ansari is a Senior Applied Scientist at Amazon Web Services, specializing in machine learning and forecasting, with a focus on foundation models for structured data, such as time series. He received his PhD from the National University of Singapore, where his research centered on deep generative models for images and time series.

Caner Turkmen is a Senior Applied Scientist at Amazon Web Services, where he works on research problems at the intersection of machine learning and forecasting. Before joining AWS, he worked in the management consulting industry as a data scientist, serving the financial services and telecommunications sectors. He holds a PhD in Computer Engineering from Bogazici University in Istanbul.

Oleksandr Shchur is a Senior Applied Scientist at Amazon Web Services, where he works on time series forecasting in AutoGluon. Before joining AWS, he completed a PhD in Machine Learning at the Technical University of Munich, Germany, doing research on probabilistic models for event data. His research interests include machine learning for temporal data and generative modeling.

Lorenzo Stella is a Senior Applied Scientist at Amazon Web Services, working on machine learning, forecasting, and generative AI for analytics and decision-making. He holds a PhD in Computer Science and Electrical Engineering from IMTLucca (Italy) and KU Leuven (Belgium), where his research focused on numerical optimization algorithms for machine learning and optimal control applications.

Read More

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

Today, the Accounts Payable (AP) and Accounts Receivable (AR) analysts in Amazon Finance operations receive queries from customers through email, cases, internal tools, or phone. When a query arises, analysts must engage in a time-consuming process of reaching out to subject matter experts (SMEs) and go through multiple policy documents containing standard operating procedures (SOPs) relevant to the query. This back-and-forth communication process often takes from hours to days, primarily because analysts, especially the new hires, don’t have immediate access to the necessary information. They spend hours consulting SMEs and reviewing extensive policy documents.

To address this challenge, Amazon Finance Automation developed a large language model (LLM)-based question-answer chat assistant on Amazon Bedrock. This solution empowers analysts to rapidly retrieve answers to customer queries, generating prompt responses within the same communication thread. As a result, it drastically reduces the time required to address customer queries.

In this post, we share how Amazon Finance Automation built this generative AI Q&A chat assistant using Amazon Bedrock.

Solution overview

The solution is based on a Retrieval Augmented Generation (RAG) pipeline running on Amazon Bedrock, as shown in the following diagram. When a user submits a query, RAG works by first retrieving relevant documents from a knowledge base, then generating a response with the LLM from the retrieved documents.

The solution consists of the following key components:

  1. Knowledge base – We used Amazon OpenSearch Service as the vector store for embedding documents. For performance evaluation, we processed and indexed multiple Amazon finance policy documents into the knowledge base. Alternatively, Amazon Bedrock Knowledge Bases provides fully managed support for end-to-end RAG workflows. We’re planning to migrate to Amazon Bedrock Knowledge Bases to eliminate cluster management and add extensibility to our pipeline.
  2. Embedding model – At the time of writing, we’re using the Amazon Titan Multimodal Embeddings G1 model on Amazon Bedrock. The model is pre-trained on large and unique datasets and corpora from Amazon and provides accuracy that is higher than or comparable to other embedding models on the market based on our comparative analysis.
  3. Generator model – We used a foundation model (FM) provided by Amazon Bedrock for its balanced ability to deliver highly accurate answers quickly.
  4. Diversity ranker – It’s responsible for rearranging the results obtained from vector index to avoid skewness or bias towards any specific document or section.
  5. Lost in the middle ranker – It’s responsible for efficiently distributing the most relevant results towards the top and bottom of the prompt, maximizing the impact of the prompt’s content.
  6. Guardrails – We used Amazon Bedrock Guardrails to detect personal identifiable information (PII) and safeguard against prompt injection attacks.
  7. Validation engine – Removes PII from the response and checks whether the generated answer aligns with the retrieved context. If not, it returns a hardcoded “I don’t know” response to prevent hallucinations.
  8. Chat assistant UI – We developed the UI using Streamlit, an open source Python library for web-based application development on machine learning (ML) use cases.

Evaluate RAG performance

The accuracy of the chat assistant is the most critical performance metric to Amazon Finance Operations. After we built the first version of the chat assistant, we measured the bot response accuracy by submitting questions to the chat assistant. The SMEs manually evaluated the RAG responses one by one, and found only 49% of the responses were correct. This was far below the expectation, and the solution needed improvement.

However, manually evaluating the RAG isn’t sustainable—it requires hours of effort from finance operations and engineering teams. Therefore, we adopted the following automated performance evaluation approach:

  • Prepare testing data – We constructed a test dataset with three data fields:
    • question – This consists of 100 questions from policy documents where answers reside in a variety of sources, such as policy documents and engineering SOPs, covering complex text formats such as embedded tables and images.
    • expected_answer – These are manually labeled answers by Amazon Finance Operations SMEs.
    • generated_answer – This is the answer generated by the bot.
  • NLP scores – We used a test dataset to calculate the ROUGE score and METEOR score. Because these scores merely use word-matching algorithms and ignore the semantic meaning of the text, they aren’t aligned with the SME scores. Based on our analysis, the variance was approximately 30% compared to human evaluations.
  • LLM-based score – We used an FM offered by Amazon Bedrock to score the RAG performance. We designed specialized LLM prompts to evaluate the RAG performance by comparing the generated answer with the expected answer. We generated a set of LLM-based metrics, including accuracy, acceptability, and factualness, and the citation representing the evaluation reasoning. The variance of this approach was approximately 5% compared to human analysis, so we decided to stick to this approach of evaluation. If your RAG system is built on Amazon Bedrock Knowledge Bases, you can use the new RAG evaluation for Amazon Bedrock Knowledge Bases tool to evaluate the retrieve or the retrieve and generate functionality with an LLM as a judge. It provides retrieval evaluation metrics such as context relevance and context coverage. It also provides retrieve and generate evaluation metrics such as correctness, completeness, and helpfulness, as well as responsible AI metrics such as harmfulness and answer refusal.

Improve the accuracy of RAG pipeline

Based on the aforementioned evaluation techniques, we focused on the following areas in the RAG pipeline to improve the overall accuracy.

Add document semantic chunking to improve accuracy from 49% to 64%

Upon diagnosing incorrect responses in the RAG pipeline, we identified 14% of the inaccuracy was due to incomplete contexts sent to the LLM. These incomplete contexts were originally generated by the segmentation algorithm based on a fixed chunk size (for example, 512 tokens or 384 words), which doesn’t consider document boundaries such as sections and paragraphs.

To address this problem, we designed a new document segmentation approach using QUILL Editor, Amazon Titan Text Embeddings, and OpenSearch Service, using the following steps:

  1. Convert the unstructured text to a structured HTML document using QUILL Editor. In this way, the HTML document preserves the document formatting that divides the contents into logical chunks.
  2. Identify the logical structure of the HTML document and insert divider strings based on HTML tags for document segmentation.
  3. Use an embedding model to generate semantic vector representation of document chunks.
  4. Assign tags based on important keywords in the section to identify the logical boundaries between sections.
  5. Insert the embedding vectors of the segmented documents to the OpenSearch Service vector store.

The following diagram illustrates the document retriever splitting workflow.

When processing the document, we follow specific rules:

  • Extract the start and end of a section of a document precisely
  • Extract the titles of the section and pair them with section content accurately
  • Assign tags based on important keywords from the sections
  • Persist the markdown information from the policy while indexing
  • Exclude images and tables from the processing in the initial release

With this approach, we can improve RAG accuracy from 49% to 64%.

Use prompt engineering to improve accuracy from 64% to 76%

Prompt engineering is a crucial technique to improve the performance of LLMs. We learned from our project that there is no one-size-fits-all prompt engineering approach; it’s a best practice to design task-specific prompts. We adopted the following approach to enhance the effectiveness of the prompt-to-RAG generator:

  • In approximately 14% of cases, we identified that the LLM generated responses even when no relevant context was retrieved from the RAG, leading to hallucinations. In this case, we engineered prompts and asked the LLM not to generate any response when there is no relevant context provided.
  • In approximately 13% of cases, we received user feedback that the response from the LLM was too brief, lacking complete context. We engineered prompts that encouraged the LLM to be more comprehensive.
  • We engineered prompts to enable the capability to generate both concise and detailed answers for the users.
  • We used LLM prompts for generation of citations to properly attribute our source used to generate the answer. In the UI, the citations are listed with hyperlinks following the LLM response, and users can use these citations to validate the LLM performance.
  • We improved our prompts to introduce better chain-of-thought (CoT) reasoning:
    • The LLM’s unique characteristic of using internally generated reasoning contributes to improved performance and aligns responses with humanlike coherence. Because of this interplay between prompt quality, reasoning requests, and the model’s inherent capabilities, we could optimize performance.
    • Encouraging CoT reasoning prompts the LLM to consider the context of the conversation, making it less prone to hallucinations.
    • By building upon the established context, the model is more likely to generate responses that logically follow the conversation’s narrative, reducing the chances of providing inaccurate or hallucinated answers.
    • We added examples of previously answered questions to establish a pattern for the LLM, encouraging CoT.

We then used meta-prompting using an FM offered by Amazon Bedrock to craft a prompt that caters to the aforementioned requirements.

The following example is a prompt for generating a quick summary and a detailed answer:

You are an AI assistant that helps answer questions based on provided text context. I will give you some passages from a document, followed by a question. Your task is to provide the best possible answer to the question using only the information from the given context. Here is the context:

<context>
{}
</context>

And here is the question:
<question>
{}
</question>

Think carefully about how the context can be used to answer the question.
<thinkingprocess>
- Carefully read the provided context and analyze what information it contains
- Identify the key pieces of information in the context that are relevant to answering the question
- Determine if the context provides enough information to answer the question satisfactorily
- If not, simply state "I don't know, I don't have the complete context needed to answer this
question"
- If so, synthesize the relevant information into a concise summary answer
- Expand the summary into a more detailed answer, utilizing Markdown formatting to make it clear and
readable
</thinkingprocess>

If you don't have enough context to answer the question, provide your response in the following
format:
I don't know, I don't have the complete context needed to answer this question.

If you do have enough context to answer the question, provide your response in the following format:
#### Quick Summary:
Your concise 1-2 sentence summary goes here.
#### Detailed Answer:
Your expanded answer goes here, using Markdown formatting like **bold**, *italics*, and Bullet points to improve readability.

Remember, the ultimate goal is to provide an informative, clear and readable answer to the question
using only the context provided. Let's begin!

The following example is a prompt for generating citations based on the generated answers and retrieved contexts:

You are an AI assistant that specializes in attributing generated answers to specific sections within provided documents. Your task is to determine which sections from the given documents were most likely used to generate the provided answer. If you cannot find exact matches, suggest sections that are closely related to the content of the answer.

Here is the generated answer to analyze:
<generated_answer>
{}
</generated_answer>

And here are the sections from various documents to consider:
<sections>
{}
</sections>

Please carefully read through the generated answer and the provided sections. In the scratchpad space below, brainstorm and reason about which sections are most relevant to the answer:
<scratchpad>
</scratchpad>

After identifying the relevant sections, provide your output in the following format:
**Document Name:** <document name> n
**Document Link:** <document link> n
**Relevant Sections:** n
- <section name 1>
- <section name 2>
- <section name 3>

Do not include any additional explanations or reasoning in your final output. Simply list the document name, link, and relevant section names in the specified format above.

Assistant:

By implementing the prompt engineering approaches, we improved RAG accuracy from 64% to 76%.

Use an Amazon Titan Text Embeddings model to improve accuracy from 76% to 86%

After implementing the document segmentation approach, we still saw lower relevance scores for retrieved contexts (55–65%), and the incorrect contexts were in the top ranks for more than 50% of cases. This indicated that there was still room for improvement.

We experimented with multiple embedding models, including first-party and third-party models. For example, the contextual embedding models such as bge-base-en-v1.5 performed better for context retrieval, comparing to other top embedding models such as all-mpnet-base-v2. We found that using the Amazon Titan Embeddings G1 model increased the possibility of retrieved contexts from approximately 55–65% to 75–80%, and 80% of the retrieved contexts have higher ranks than before.

Finally, by adopting the Amazon Titan Text Embeddings G1 model, we improved the overall accuracy from 76% to 86%.

Conclusion

We achieved remarkable progress in developing a generative AI Q&A chat assistant for Amazon Finance Automation by using a RAG pipeline and LLMs on Amazon Bedrock. Through continual evaluation and iterative improvement, we have addressed challenges of hallucinations, document ingestion issues, and context retrieval inaccuracies. Our results have shown a significant improvement in RAG accuracy from 49% to 86%.

You can follow our journey and adopt a similar solution to address challenges in your RAG application and improve overall performance.


About the Authors

SohebSoheb Moin is a Software Development Engineer at Amazon, who led the development of the Generative AI chatbot. He specializes in leveraging generative AI and Big Data analytics to design, develop, and implement secure, scalable, innovative solutions that empowers Finance Operations with better productivity, automation. Outside of work, Soheb enjoys traveling, playing badminton, and engaging in chess tournaments.

Nitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 19 years of experience building business critical, scalable, high-performance software. Nitin leads data services, communication, work management and several Generative AI initiatives within Finance. In his spare time, he enjoys listening to music and read.

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

SatyenKumar Satyen Gaurav is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.

MohakMohak Chugh is a Software Development Engineer at Amazon, with over 3 years of experience in developing products leveraging Generative AI and Big Data on AWS. His work encompasses a range of areas, including RAG based GenAI chatbots and high performance data reconciliation. Beyond work, he finds joy in playing the piano and performing with his music band.

pbavishiParth Bavishi is a Senior Product Manager at Amazon with over 10 years of experience in building impactful products. He currently leads the development of generative AI capabilities for Amazon’s Finance Automation, driving innovation and efficiency within the organization. A dedicated mentor, Parth enjoys sharing his product management knowledge and finds satisfaction in activities like volleyball and reading.

Read More

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

We are excited to announce the availability of Cohere’s advanced reranking model Rerank 3.5 through our new Rerank API in Amazon Bedrock. This powerful reranking model enables AWS customers to significantly improve their search relevance and content ranking capabilities. This model is also available for Amazon Bedrock Knowledge Base users. By incorporating Cohere’s Rerank 3.5 in Amazon Bedrock, we’re making enterprise-grade search technology more accessible and empowering organizations to enhance their information retrieval systems with minimal infrastructure management.

In this post, we discuss the need for Reranking, the capabilities of Cohere’s Rerank 3.5, and how to get started using it on Amazon Bedrock.

Reranking for advanced retrieval

Reranking is a vital enhancement to Retrieval Augmented Generation (RAG) systems that adds a sophisticated second layer of analysis to improve search result relevance beyond what traditional vector search can achieve. Unlike embedding models that rely on pre-computed static vectors, rerankers perform dynamic query-time analysis of document relevance, enabling more nuanced and contextual matching. This capability allows RAG systems to effectively balance between broad document retrieval and precise context selection, ultimately leading to more accurate and reliable outputs from language models while reducing the likelihood of hallucinations.

Existing search systems significantly benefit from reranking technology by providing more contextually relevant results that directly impact user satisfaction and business outcomes. Unlike traditional keyword matching or basic vector search, reranking performs an intelligent second-pass analysis that considers multiple factors, including semantic meaning, user intent, and business rules to optimize search result ordering. In ecommerce specifically, reranking helps surface the most relevant products by understanding nuanced relationships between search queries and product attributes, while also incorporating crucial business metrics like conversion rates and inventory levels. This advanced relevance optimization leads to improved product discovery, higher conversion rates, and enhanced customer satisfaction across digital commerce platforms, making reranking an essential component for any modern enterprise search infrastructure.

Introducing Cohere Rerank 3.5

Cohere’s Rerank 3.5 is designed to enhance search and RAG systems. This intelligent cross-encoding model takes a query and a list of potentially relevant documents as input, then returns the documents sorted by semantic similarity to the query. Cohere Rerank 3.5 excels in understanding complex information requiring reasoning and is able to understand the meaning behind enterprise data and user questions. Its ability to comprehend and analyze enterprise data and user questions across over 100 languages including Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish, makes it particularly valuable for global organizations in sectors such as finance, healthcare, hospitality, energy, government, and manufacturing.

One of the key advantages of Cohere Rerank 3.5 is its ease of implementation. Through a single Rerank API call in Amazon Bedrock, you can integrate Rerank into existing systems at scale, whether keyword-based or semantic. Reranking strictly improves first-stage retrievals on standard text retrieval benchmarks.

Cohere Rerank 3.5 is state of the art in the financial domain, as illustrated in the following figure.

Cohere Rerank 3.5 is also state of the art in the ecommerce domain, as illustrated in the following figure. Cohere’s ecommerce benchmarks revolve around retrieval on various products, including fashion, electronics, food, and more.

Products were structured as strings in a key-value pair format such as the following:

“Title”: “Title” 
“Description”: “Long-form description” “Type”: <Some categorical data> etc.....

Cohere Rerank 3.5 also excels in hospitality, as shown in the following figure. Hospitality benchmarks revolve around retrieval on hospitality experiences and lodging options.

Documents were structured as strings in a key-value pairs format such as the following:

“Listing Title”: “Rental unit in Toronto” “Location”: “171 John Street, Toronto, Ontario, Canada”

“Description”: “Escape to our serene villa with stunning downtown views....”

We see noticeable gains in project management performance across all types of issue tracking tasks, as illustrated in the following figure.

Cohere’s project management benchmarks span a variety of retrieval tasks, such as:

  • Search through engineering tickets from various project management and issue tracking software tools
  • Search through GitHub issues on popular open source repos

Get started with Cohere Rerank 3.5

To start using Cohere Rerank 3.5 with Rerank API and Amazon Bedrock Knowledge Bases, navigate to the Amazon Bedrock console, and click on Model Access on the left hand pane. Click on Modify Access, select Cohere Rerank 3.5, click Next and hit submit.

Get Started with Amazon Bedrock Rerank API

The Cohere Rerank 3.5 model, powered by the Amazon Bedrock Rerank API, allows you to rerank input documents directly based on their semantic relevance to a user query – without requiring a pre-configured knowledge base. The flexibility makes it a powerful tool for various use cases.

To begin, set up your environment by importing the necessary libraries and initializing Boto3 clients:

import boto3
import json
region = boto3.Session().region_name

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime',region_name=region)

modelId = "cohere.rerank-v3-5:0"
model_package_arn = f"arn:aws:bedrock:{region}::foundation-model/{modelId}”

Next, define a main function that reorders a list of text documents by computing relevance scores based on the user query:

def rerank_text(text_query, text_sources, num_results, model_package_arn):
    response = bedrock_agent_runtime.rerank(
        queries=[
            {
                "type": "TEXT",
                "textQuery": {
                    "text": text_query
                }
            }
        ],
        sources=text_sources,
        rerankingConfiguration={
            "type": "BEDROCK_RERANKING_MODEL",
            "bedrockRerankingConfiguration": {
                "numberOfResults": num_results,
                "modelConfiguration": {
                    "modelArn": model_package_arn,
                }
            }
        }
    )
    return response['results']

For instance, imagine a scenario where you need to identify emails related to returning items from a multilingual dataset. The example below demonstrates this process:

example_query = "What emails have been about returning items?"

documents = [
    "Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?",
    "Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?",
    "مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب",
    "Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?",
    "Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.",
    "Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?",
    "Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.",
    "早上好,关于我最近的订单,我有一个问题。我收到了错误的商品",
    "Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."
]

Now, prepare the list of text sources that will be passed into the rerank_text() function:

text_sources = []
for text in documents:
    text_sources.append({
        "type": "INLINE",
        "inlineDocumentSource": {
            "type": "TEXT",
            "textDocument": {
                "text": text,
            }
        }
    })

You can then invoke rerank_text() by specifying the user query, the text resources, the desired number of top-ranked results, and the model ARN:

response = rerank_text(example_query, text_sources, 3, model_package_arn)
print(response)

The output generated by the Amazon Bedrock Rerank API with Cohere Rerank 3.5 for this query is:

[{'index': 4, 'relevanceScore': 0.1122397780418396},
 {'index': 8, 'relevanceScore': 0.07777658104896545},
 {'index': 2, 'relevanceScore': 0.0770234540104866}]

The relevance scores provided by the API are normalized to a range of [0, 1], with higher scores indicating higher relevance to the query. Here the 5th item in the list of documents is the most relevant. (Translated from German to English: Hello, I have a question about my last order. I received the wrong item and need to return it.)

You can also get started using Cohere Rerank 3.5 with Amazon Bedrock Knowledge Bases by completing the following steps:

  1. In the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
  2. Choose Create knowledge base.
  3. Provide your knowledge base details, such as name, permissions, and data source.
  1. To configure your data source, specify the location of your data.
  2. Select an embedding model to convert the data into vector embeddings, and have Amazon Bedrock create a vector store in your account to store the vector data.

When you select this option (available only in the Amazon Bedrock console), Amazon Bedrock creates a vector index in Amazon OpenSearch Serverless (by default) in your account, removing the need to manage anything yourself.

  1. Review your settings and create your knowledge base.
  2. In the Amazon Bedrock console, choose your knowledge base and choose Test knowledge base.
  3. Choose the icon for additional configuration options for testing your knowledge base.
  4. Choose your model (for this post, Cohere Rerank 3.5) and choose Apply.

The configuration pane shows the new Reranking section menu with additional configuration options. The number of reranked source chunks returns the specified number of highest relevant chunks.

Conclusion

In this post, we explored how to use Cohere’s Rerank 3.5 model in Amazon Bedrock, demonstrating its powerful capabilities for enhancing search relevance and robust reranking capabilities for enterprise applications, enhancing user experience and optimizing information retrieval workflows. Start improving your search relevance today with Cohere’s Rerank model on Amazon Bedrock.

Cohere Rerank 3.5 in Amazon Bedrock is available in the following AWS Regions: in us-west-2 (US West – Oregon), ca-central-1 (Canada – Central), eu-central-1 (Europe – Frankfurt), and ap-northeast-1 (Asia Pacific – Tokyo).

Share your feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

To learn more about Cohere Rerank 3.5’s features and capabilities, view the Cohere in Amazon Bedrock product page.


About the Authors

Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.

James Yi is a Senior AI/ML Partner Solutions Architect at Amazon Web Services. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Read More

AWS DeepRacer: How to master physical racing?

AWS DeepRacer: How to master physical racing?

As developers gear up for re:Invent 2024, they again face the unique challenges of physical racing. What are the obstacles? Let’s have a look.

In this blog post, I will look at what makes physical AWS DeepRacer racing—a real car on a real track—different to racing in the virtual world—a model in a simulated 3D environment. I will cover the basics, the differences in virtual compared to physical, and what steps I have taken to get a deeper understanding of the challenge.

The AWS DeepRacer League is wrapping up. In two days, 32 racers will face off in Las Vegas for one last time. This year, the qualification has been all-virtual, so the transition from virtual to physical racing will be a challenge.

The basics

AWS DeepRacer relies on the racer training a model within the simulator, a 3D environment built around ROS and Gazebo, originally built on AWS RoboMaker.

The trained model is subsequently used for either virtual or physical races. The model comprises a convolutional neural network (CNN) and an action space translating class labels into speed and throttle movement. In the basic scenario involving a single camera, a 160 x 120 pixels, 8-bit grayscale image (similar to the following figure) is captured 15 times per second, passed through the neural network, and the action with the highest weight (probability) is executed.

The small piece of AI magic is that during model evaluation (racing) there’s no context; each image is processed independently of the image before it, and without knowledge of the state of the car itself. If you process the images in reverse order the results remain the same!

Virtual compared to physical

The virtual worlds are 3D worlds created in Gazebo, and the software is written in Python and C++ using ROS as the framework. As shown in the following image, the 3D simulation is fairly flat, with basic textures and surfaces. There is little or no reflections or shine, and the environment is as visually clean as you make it. Input images are captured 15 times per second.

Within this world a small car is simulated. Compared to a real car, the model is very basic and lacks quite a few of the things that make a real car work: There is no suspension, the tires are rigid cylinders, there is no Ackermann steering, and there are no differentials. It’s almost surprising that this car can drive at all. On the positive side the camera is perfect; irrespective of lighting conditions you get crisp clear pictures with no motion blur.

A typical virtual car drives at speeds between 0.5 and 4.0 meters per second, depending on the shape of the track. If you go too fast, it will often oversteer and spin out of the turn because of the relatively low grip.

In contrast, the real world is less perfect—simulation-to-real gap #1 is around visual noise created by light, reflections (if track is printed on reflective material), and background noise (such as if the barriers around the track are too low, and the car sees people and objects in the back). Input images are captured 30 times per second.

The car itself—based on the readily available WLToys A979—has all the things the model car doesn’t: proper tires, suspension, and differential. One problem is that the car is heavy—around 1.5 kg—and the placement of some components causes the center of gravity to be very high. This causes simulation-to-real gap #2: Roll and pitch during corners at high speeds cause the camera to rotate, confusing the neural network as the horizon moves.

Gap #3 comes from motion blur when the light is too dim; the blur can cause the dashed centerline to look like a solid line, making it hard to distinguish the centerline from the solid inner and outer lines, as shown in the following figure.

The steering geometry, the differentials, the lack of engineering precision of the A979, and the corresponding difficulty in calibrating it, causes gap #4. Even if the model wants to go straight, the car still pulls left or right, needing constant correction to stay on track. This is most noticeable when the car is unable to drive down the straights in a straight line.

The original AWS DeepRacer, without modifications, has a smaller speed range of about 2 meters per second. It has a better grip but suffers from the previously mentioned roll movements. If you go too fast, it will understeer and potentially roll over. Since 2023, the AWS pit-crews operate their fleets of AWS DeepRacers with shock spacers to stiffen the suspension, reduce the roll, and increase the max effective speed.

Four questions

Looking at the sim-to-real gaps there are four questions that we want to explore:

  • How can we train the model to better handle the real world? This includes altering the simulator to close some of the gaps, combined with adapting reward function, action space, and training methodology to make better use of this simulator.
  • How can we better evaluate what the car does, and why? In the virtual world, we can perform log analysis to investigate; in the real world this has not yet been possible.
  • How can we evaluate our newly trained models? A standard AWS DeepRacer track, with its size of 8 meters x 6 meters, is prohibitively large. Is it possible to downscale the track to fit in a home?
  • Will a modified car perform better? Upgrade my AWS DeepRacer with better shocks? Add ball bearings and shims to improve steering precision? Or build a new lighter car based on a Raspberry Pi?

Solutions

To answer these questions, some solutions are required to support the experiments. The following assumes that you’re using Deepracer-for-Cloud to run the training locally or in an Amazon Elastic Compute Cloud (Amazon EC2) instance. We won’t go into the details but provide references that will enable you to try things out on your own.

Customized simulator

The first thing to look at is how you can alter the simulator. The simulator code is available, and modifying it doesn’t require too many skills. You can alter the car and the physics of the world or adjust the visual environment.

Change the environment

Changing the environments means altering the 3D world. This can be done by altering the features in a pre-existing track by adding or removing track parts (such as lines), changing lighting, adding background features (such as walls or buildings), swapping out textures, and so on. Making changes to the world will require building a new Docker image, which can take quite some time, but there are ways to speed that up. Going a step further, it’s also possible to make the world programmatically (command line or code) alterable during run-time.

The starting point are the track COLLADA (.dae) files found in the meshes folder. You can import it into Blender (shown in the following figure), make your changes, and export the file again. Note that lights and camera positions from Blender aren’t considered by Gazebo. To alter the lighting conditions, you will have to alter the .world file in worlds—the files are XML files in sdformat.

See Custom Tracks for some examples of tuned tracks.

Car and physics

The competition cars owned by AWS can’t be altered, so the objective of tuning the car in the simulator is to make it behave in ways more similar to the real one. Trained neural networks have an embedded expectation of what will happen next; which means that the simulated car learned that by taking a specific action, it would get a turn of a given radius. If the simulator car steers more or less than the physical one in a given situation, the outcome becomes unpredictable.

Lack of Ackermann steering, no differentials, but wheels that can deflect up to 30 degrees—real wheels only go to a bit more than 20 degrees outwards and less than that inwards. My experience is that the real car, surprisingly enough, still has a shorter turning radius than the virtual one.

The car models are found in the urdf folder. There are three different cars, relating to the different versions of physics, which you configure in your actions space (model_metadata.json). Today, only the deepracer (v3 and v4 physics) and deepracer_kinematics (v5 physics) models are relevant. There are variant models for single camera and for stereo camera, both with and without the LIDAR.

Each physics version is different; the big question is what impact, if any, each version has on the behavior of the physical car.

  • Version 3: Steering and throttle is managed through a PID controller, making speed and steering changes smooth (and slow). The simulation environment runs at all times—including during image processing and inference—leading to a higher latency between image capture and action taking effect.
  • Version 4: Steering and throttle is managed through a PID controller, but the world is put on hold during inference, reducing the latency.
  • Version 5: Steering and throttle is managed through a position and velocity controller, and the world is put on hold during inference, almost eliminating latency. (This is very unnatural; the car can take alternating 30 degree left and right turns and will go almost straight ahead.)

The PID controller for v3 and v4 can be changed in the racecar control file. By changing the P, I, and D values, you can tune how fast or how slow the car accelerates and steers.

You can also tune the friction. In our simulator, friction is defined for the wheels, not the surfaces that the car drives on. The values (called mu and mu2) are found in racecar.gazebo; increasing them (once per tire!) will allow the car to drive faster without spinning.

Finally, I implemented an experimental version of the Ackermann steering geometry including differentials. Why? When turning, a car’s wheels follow two circles with the same center point, the inner one is having a smaller radius than the outer one. In short, the inner wheels will have to steer more (larger curvature), but rotate slower (smaller circumference) than the outer wheels.

Customized car software

The initial work to create an altered software stack for the original AWS DeepRacer started in 2022. The first experiments included operating the AWS DeepRacer with an R/C controller and capturing the camera images and IMU data to create an in-car video. There was a lot to learn about ROS2, including creating a custom node for publishing IMU sensor data and capturing and creating videos on the fly. During the Berlin Summit in 2022, I also got to give my modified car a spin on the track!

In the context of physical racing, the motivation for customizing the car software is to obtain more information—what does the car do, and why. Watching the following video, you can clearly see the rolling movement in the turns, and the blurring of certain parts of the image discussed earlier.

The work triggered a need to alter several of the open source AWS DeepRacer packages, and included work such as optimizing the performance from camera to inference through compressing images and enabling GPU and compute stick acceleration of the inference. This turned into several scripts comprising all the changes to the different nodes and creating an upgraded software package that could be installed on an original AWS DeepRacer car.

The work evolved, and a logging mechanism using ROS Bag allowed us to analyze not only pictures, but also the actions that the car took. Using the deepracer-viz library of Jochem Lugtenburg, a fellow AWS DeepRacer community leader, I added a GradCam overlay on the video feed (shown in the following video), which gives a better understanding of what’s going on.

The outcome of this has evolved into the community AWS DeepRacer Custom Car repository, which allows anyone to upgrade their AWS DeepRacer with improved software with two commands and without having to compile the modules themselves!

Benefits are:

  • Performance improvement by using compressed image transport for the main processing pipeline.
  • Inference using OpenVINO with Intel GPU (original AWS DeepRacer), OpenVino with Myriad Neural Compute Stick (NCS2), or TensorFlow Lite.
  • Model Optimizer caching, speeding up switching of models.
  • Capture in-car camera and inference results to a ROS Bag for logfile analysis.
  • UI tweaks and fixes.
  • Support for Raspberry Pi4, enabling us to create the DeepRacer Pi!

Testing on a custom track

Capturing data is great, but you need a way to test it all—bringing models trained in a customized environment onto a track to see what works and what doesn’t.

The question turned out to be: How hard is it to make a track that has the same design as the official tracks, but that takes up less space than the 8m x 6m of the re:Invent 2018 track? After re:Invent 2023, I started to investigate. The goal was to create a custom track that would fit in my garage with a theoretical maximum size of 5.5m x 4.5m. The track should be printable on vinyl in addition to being available in the Simulator for virtual testing.

After some trial and error, it proved to be quite straightforward, even if it requires multiple steps, starting in a Jupyter Notebook, moving into a vector drawing program (Inkscape), and finalizing in Blender (to create the simulator meshes).

The trapezoid track shown in the following two figures (center line and final sketch) is a good example of how to create a brand new track. The notebook starts with eight points in an array and builds out the track step by step, adding the outer line, center line, and color.

In the end I chose to print a narrower version of Trapezoid—Trapezoid Narrow, shown in the following figure—to fit behind my garage, with dimensions of 5.20m x 2.85m including the green borders around the track. I printed it on PVC with a thickness 500 grams per square meter. The comparatively heavy material was a good choice. It prevents folds and wrinkles and generally ensures that the track stays in place even when you walk on it.

Around the track, I added a boundary of mesh PVC mounted on some 20 x 20 centimeter aluminum poles. Not entirely a success, because the light shone through and I needed to add a lining of black fleece. The following image shows the completed track before the addition of black fleece.

Experiments and conclusions

re:Invent is just days away. Experiments are still running, and because I need to fight my way through the Wildcard race, this is not the time to include all the details. Let’s just say that things aren’t always as straightforward as expected.

As a preview of what’s going on, I’ll end this post with the latest iteration of the in-car video, showing a AWS DeepRacer Pi doing laps in the garage. Check back after re:Invent for the big reveal!


About the author

Lars Lorentz Ludvigsen is a technology enthusiast who was introduced to AWS DeepRacer in late 2019 and was instantly hooked. Lars works as a Managing Director at Accenture where he helps clients to build the next generation of smart connected products. In addition to his role at Accenture, he’s an AWS Community Builder who focuses on developing and maintaining the AWS DeepRacer community’s software solutions.

Read More

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference

Easily deploy and manage hundreds of LoRA adapters with SageMaker efficient multi-adapter inference

The new efficient multi-adapter inference feature of Amazon SageMaker unlocks exciting possibilities for customers using fine-tuned models. This capability integrates with SageMaker inference components to allow you to deploy and manage hundreds of fine-tuned Low-Rank Adaptation (LoRA) adapters through SageMaker APIs. Multi-adapter inference handles the registration of fine-tuned adapters with a base model and dynamically loads them from GPU memory, CPU memory, or local disk in milliseconds, based on the request. This feature provides atomic operations for adding, deleting, or updating individual adapters across a SageMaker endpoint’s running instances without affecting performance or requiring a redeployment of the endpoint.

The efficiency of LoRA adapters allows for a wide range of hyper-personalization and task-based customization which had previously been too resource-intensive and costly to be feasible. For example, marketing and software as a service (SaaS) companies can personalize artificial intelligence and machine learning (AI/ML) applications using each of their customer’s images, art style, communication style, and documents to create campaigns and artifacts that represent them. Similarly, enterprises in industries like healthcare or financial services can reuse a common base model with task-based adapters to efficiently tackle a variety of specialized AI tasks. Whether it’s diagnosing medical conditions, assessing loan applications, understanding complex documents, or detecting financial fraud, you can simply swap in the appropriate fine-tuned LoRA adapter for each use case at runtime. This flexibility and efficiency unlocks new opportunities to deploy powerful, customized AI across your organization. With this new efficient multi-adapter inference capability, SageMaker reduces the complexity of deploying and managing the adapters that power these applications.

In this post, we show how to use the new efficient multi-adapter inference feature in SageMaker.

Problem statement

You can use powerful pre-trained foundation models (FMs) without needing to build your own complex models from scratch. However, these general-purpose models might not always align with your specific needs or your unique data. To make these models work for you, you can use Parameter-Efficient Fine-Tuning (PEFT) techniques like LoRA.

The benefit of PEFT and LoRA is that it lets you fine-tune models quickly and cost-effectively. These methods are based on the idea that only a small part of a large FM needs updating to adapt it to new tasks or domains. By freezing the base model and just updating a few extra adapter layers, you can fine-tune models much faster and cheaper, while still maintaining high performance. This flexibility means you can quickly customize pre-trained models at low cost to meet different requirements. When inferencing, the LoRA adapters can be loaded dynamically at runtime to augment the results from the base model for best performance. You can create a library of task-specific, customer-specific, or domain-specific adapters that can be swapped in as needed for maximum efficiency. This allows you to build AI tailored exactly to your business.

Although fine-tuned LoRA adapters can effectively address targeted use cases, managing these adapters can be challenging at scale. You can use open-source libraries, or the AWS managed Large Model Inference (LMI) deep learning container (DLC) to dynamically load and unload adapter weights. Current deployment methods use fixed adapters or Amazon Simple Storage Service (Amazon S3) locations, making post-deployment changes impossible without updating the model endpoint and adding unnecessary complexity. This deployment method also makes it impossible to collect per-adapter metrics, making the evaluation of their health and performance a challenge.

Solution overview

In this solution, we show how to use efficient multi-adapter inference in SageMaker to host and manage multiple LoRA adapters with a common base model. The approach is based on an existing SageMaker capability, inference components, where you can have multiple containers or models on the same endpoint and allocate a certain amount of compute to each container. With inference components, you can create and scale multiple copies of the model, each of which retains the compute that you have allocated. With inference components, deploying multiple models that have specific hardware requirements becomes a much simpler process, allowing for the scaling and hosting of multiple FMs. An example deployment would look like the following figure.

This feature extends inference components to a new type of component, inference component adapters, which you can use to allow SageMaker to manage your individual LoRA adapters at scale while having a common inference component for the base model that you’re deploying. In this post, we show how to create, update, and delete inference component adapters and how to call them for inference. You can envision this architecture as the following figure.

IC and Adapters

Prerequisites

To run the example notebooks, you need an AWS account with an AWS Identity and Access Management (IAM) role with permissions to manage resources created. For details, refer to Create an AWS account.

If this is your first time working with Amazon SageMaker Studio, you first need to create a SageMaker domain. Additionally, you may need to request a service quota increase for the corresponding SageMaker hosting instances. In this example, you host the base model and multiple adapters on the same SageMaker endpoint, so you will use an ml.g5.12xlarge SageMaker hosting instance.

In this example, you learn how to deploy a base model (Meta Llama 3.1 8B Instruct) and LoRA adapters on an SageMaker real-time endpoint using inference components. You can find the example notebook in the GitHub repository.

import sagemaker
import boto3
import json

role = sagemaker.get_execution_role() # execution role for the endpoint
sess = sagemaker.session.Session() # sagemaker session for interacting with different AWS APIs
bucket = sess.default_bucket() # bucket to house artifacts
region = sess._region_name

sm_client = boto3.client(service_name='sagemaker')
sm_rt_client = boto3.client(service_name='sagemaker-runtime')

Download the base model from the Hugging Face model hub. Because Meta Llama 3.1 8B Instruct is a gated model, you will need a Hugging Face access token and to submit a request for model access on the model page. For more details, see Accessing Private/Gated Models.

from huggingface_hub import snapshot_download

model_name = sagemaker.utils.name_from_base("llama-3-1-8b-instruct")

HF_TOKEN = "<<YOUR_HF_TOKEN>>"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model_id_pathsafe = model_id.replace("/","-")
local_model_path = f"./models/{model_id_pathsafe}"
s3_model_path = f"s3://{bucket}/models/{model_id_pathsafe}"

snapshot_download(repo_id=model_id, use_auth_token=HF_TOKEN, local_dir=local_model_path, allow_patterns=[".json", ".safetensors"])

Copy your model artifact to Amazon S3 to improve model load time during deployment:

!aws s3 cp —recursive {local_model_path} {s3_model_path}

Select one of the available LMI container images for hosting. Efficient adapter inference capability is available in 0.31.0-lmi13.0.0 and higher.

inference_image_uri = "763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.31.0-lmi13.0.0-cu124"

Create a container environment for the hosting container. LMI container parameters can be found in the LMI Backend User Guides.

The parameters OPTION_MAX_LORAS and OPTION_MAX_CPU_LORAS control how adapters move between GPU, CPU, and disk. OPTION_MAX_LORAS sets a limit on the number of adapters concurrently stored in GPU memory, with excess adapters offloaded to CPU memory.  OPTION_MAX_CPU_LORAS determines how many adapters are staged in CPU memory, offloading excess adapters to local SSD storage.

In the following example, 30 adapters can live in GPU memory and 70 adapters in CPU memory before going to local storage.

env = {
    "HF_MODEL_ID": f"{s3_model_path}",
    "OPTION_ROLLING_BATCH": "lmi-dist",
    "OPTION_MAX_ROLLING_BATCH_SIZE": "16",
    "OPTION_TENSOR_PARALLEL_DEGREE": "max",
    "OPTION_ENABLE_LORA": "true",
    "OPTION_MAX_LORAS": "30",
    "OPTION_MAX_CPU_LORAS": "70",
    "OPTION_DTYPE": "fp16",
    "OPTION_MAX_MODEL_LEN": "6000"
}

With your container image and environment defined, you can create a SageMaker model object that you will use to create an inference component later:

model_name = sagemaker.utils.name_from_base("llama-3-1-8b-instruct")

create_model_response = sm_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = role,
    PrimaryContainer = {
        "Image": inference_image_uri,
        "Environment": env,
    },
)

Set up a SageMaker endpoint

To create a SageMaker endpoint, you need an endpoint configuration. When using inference components, you don’t specify a model in the endpoint configuration. You load the model as a component later on.

endpoint_config_name = f"{model_name}"
variant_name = "AllTraffic"
instance_type = "ml.g5.12xlarge"
model_data_download_timeout_in_seconds = 900
container_startup_health_check_timeout_in_seconds = 900

initial_instance_count = 1

sm_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = role,
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": instance_type,
            "InitialInstanceCount": initial_instance_count,
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
            "RoutingConfig": {"RoutingStrategy": "LEAST_OUTSTANDING_REQUESTS"},
        }
    ]
)

Create the SageMaker endpoint with the following code:

create_endpoint_response = sm_client.create_endpoint(
    EndpointName = endpoint_name, EndpointConfigName = endpoint_config_name
)

With your endpoint created, you can now create the inference component for the base model. This will be the base component that the adapter components you create later will depend on.

Notable parameters here are ComputeResourceRequirements. These are a component-level configuration that determine the amount of resources that the component needs (memory, vCPUs, accelerators). The adapters will share these resources with the base component.

base_inference_component_name = f"base-{model_name}"

variant_name = "AllTraffic"

initial_copy_count = 1
min_memory_required_in_mb = 32000
number_of_accelerator_devices_required = 4

sm_client.create_inference_component(
    InferenceComponentName = base_inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification={
        "ModelName": model_name,
        "StartupParameters": {
            "ModelDataDownloadTimeoutInSeconds": model_data_download_timeout_in_seconds,
            "ContainerStartupHealthCheckTimeoutInSeconds": container_startup_health_check_timeout_in_seconds,
        },
        "ComputeResourceRequirements": {
            "MinMemoryRequiredInMb": min_memory_required_in_mb,
            "NumberOfAcceleratorDevicesRequired": number_of_accelerator_devices_required,
        },
    },
    RuntimeConfig={
        "CopyCount": initial_copy_count,
    },
)

 In this example, you create a single adapter, but you could host up to hundreds of them per endpoint. They will need to be compressed and uploaded to Amazon S3.

The adapter package has the following files at the root of the archive with no sub-folders.

Adapter Files

For this example, an adapter was fine-tuned using QLoRA and Fully Sharded Data Parallel (FSDP) on the training split of the ECTSum dataset. Training took 21 minutes on an ml.p4d.24xlarge and cost approximately $13 using current on-demand pricing.

For each adapter you are going to deploy, you need to specify an InferenceComponentName, an ArtifactUrl with the S3 location of the adapter archive, and a BaseInferenceComponentName to create the connection between the base model inference component and the new adapter inference components. You repeat this process for each additional adapter.

ic_ectsum_name = f"adapter-ectsum-{base_inference_component_name}"
adapter_s3_uri = "<<S3_PATH_FOR_YOUR_ADAPTER>>

sm_client.create_inference_component(
    InferenceComponentName = adapter_ic1_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": inference_component_name,
        "Container": {
            "ArtifactUrl": adapter_s3_uri
        },
    },
)

Use the deployed adapter

First, you build a prompt to invoke the model for earnings summarization, filling in the source text with a random item from the ECTSum dataset. Then you store the ground truth summary from the item for comparison later.

from datasets import load_dataset
dataset_name = "mrSoul7766/ECTSum"

test_dataset = load_dataset(dataset_name, trust_remote_code=True, split="test")

test_item = test_dataset.shuffle().select(range(1))

prompt =f"""
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are an AI assistant trained to summarize earnings calls.
    Provide a concise summary of the call, capturing the key points and overall context.
    Focus on quarter over quarter revenue, earnings per share, changes in debt, highlighted risks, and growth opportunities.
    <|eot_id|><|start_header_id|>user<|end_header_id|>
    Summarize the following earnings call:

    {test_item["text"]}
    <|eot_id|><|start_header_id|>assistant<|end_header_id|>"""

ground_truth_response = test_item["summary"]

To test the base model, specify the EndpointName for the endpoint you created earlier and the name of the base inference component as InferenceComponentName, along with your prompt and other inference parameters in the Body parameter:

component_to_invoke = base_inference_component_name

response_model = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 100, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

base_model_reponse = response_model["Body"].read().decode("utf8")["generated_text"]

To invoke the adapter, use the adapter inference component name in your invoke_endpoint call:

component_to_invoke = adapter_ic1_name

response_model = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = component_to_invoke,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 100, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_reponse = response_model["Body"].read().decode("utf8")["generated_text"]

Compare outputs

Compare the outputs of the base model and adapter to ground truth. While the base model might appear subjectively better in this test, the adapter’s response is actually much closer to the ground truth response. This will be proven with metrics in the next section.

Ground Truth:

q3 non-gaap earnings per share $3.71.
q3 sales rose 15.4 percent to $747 million.
bio rad laboratories - now anticipates 2021 non-gaap currency-neutral revenue growth between 12 to 13 percent.
sees 2021 estimated non-gaap operating margin of about 19.5 percent.
qtrly gaap income per share $129.96.

----------------------------------

Base Model Response:

Here is a summary of the earnings call:

**Key Points:**

* Revenue: $747 million, up 15.4% year-over-year (13.8% on a currency-neutral basis)
* Earnings per share: $129.96, up from $3 per share in Q3 2020
* Gross margin: 58.6% on a GAAP basis, 57.9% on a non-GAAP basis
* Operating income: $156.8 million, up from $109.6 million in Q3 2020
* Net income: $3.928

----------------------------------

Adapter Model Response:

                Here is a concise summary of the call:

                q3 revenue $747.6 million versus refinitiv ibes estimate of $753.9 million.
q3 earnings per share $3.71.
sees fy earnings per share $11.85 to $12.05.
sees fy 2021 non-gaap revenue growth to be 12% to 13%.
sees fy 2021 non-gaap gross margin to be 57.5% to 57.8%.
sees fy 2021 non-gaap operating margin to be 19.5%.

To validate the true adapter performance, you can use a tool like fmeval to run an evaluation of summarization accuracy. This will calculate the METEOR, ROUGE, and BertScore metrics for the adapter vs. the base model. Doing so against the test split of ECTSum yields the following results.

Testing Score Text

The fine-tuned adapter shows a 59% increase in METEOR score, 159% increase in ROUGE score, and 8.6% increase in BertScore.

The following diagram shows the frequency distribution of scores for the different metrics, with the adapter consistently scoring better more often in all metrics.

Testing Scores

We observed an end-to-end latency difference of up to 10%  between base model invocation and the adapter in our tests. If the adapter is loaded from CPU memory or disk, it will incur an additional cold start delay for the first load to GPU. But depending on your container configurations and instance type chosen, these values may vary.

Update an existing adapter

Because adapters are managed as inference components, you can update them on a running endpoint. SageMaker handles the unloading and deregistering of the old adapter and loading and registering of the new adapter onto every base inference component on all the instances that it is running on for this endpoint. To update an adapter inference component, use the update_inference_component API and supply the existing inference component name and the Amazon S3 path to the new compressed adapter archive.

You can train a new adapter, or re-upload the existing adapter artifact to test this functionality.

update_inference_component_response = sm_client.update_inference_component(
    InferenceComponentName = adapter_ic1_name,
    Specification={
        "Container": {
            "ArtifactUrl": new_adapter_s3_uri
        },
    },
)

Remove adapters

If you need to delete an adapter, call the delete_inference_component API with the inference component name to remove it:

sess = sagemaker.session.Session()
sess.delete_inference_component(adapter_ic1_name, wait = True)

Deleting the base model inference component will automatically delete the base inference component and any associated adapter inference components:

sess.delete_inference_component(base_inference_component_name, wait = True)

Pricing

SageMaker multi-adapter inference is generally available in AWS Regions US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Jakarta, Mumbai, Seoul, Singapore, Sydney, Tokyo), Canada (Central), Europe (Frankfurt, Ireland, London, Stockholm), Middle East (UAE), and South America (São Paulo), and is available at no extra cost.

Conclusion

The new efficient multi-adapter inference feature in SageMaker opens up exciting possibilities for customers with fine-tuning use cases. By allowing the dynamic loading of fine-tuned LoRA adapters, you can quickly and cost-effectively customize AI models to your specific needs. This flexibility unlocks new opportunities to deploy powerful, customized AI across organizations in industries like marketing, healthcare, and finance. The ability to manage these adapters at scale through SageMaker inference components makes it effortless to build tailored generative AI solutions.


About the Authors

Dmitry Soldatkin is a Senior Machine Learning Solutions Architect at AWS, helping customers design and build AI/ML solutions. Dmitry’s work covers a wide range of ML use cases, with a primary interest in generative AI, deep learning, and scaling ML across the enterprise. He has helped companies in many industries, including insurance, financial services, utilities, and telecommunications. He has a passion for continuous innovation and using data to drive business outcomes. Prior to joining AWS, Dmitry was an architect, developer, and technology leader in data analytics and machine learning fields in the financial services industry.

Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.

Ram Vegiraju is an ML Architect with the Amazon SageMaker Service team. He focuses on helping customers build and optimize their AI/ML solutions on Amazon SageMaker. In his spare time, he loves traveling and writing.

Read More

Improve the performance of your Generative AI applications with Prompt Optimization on Amazon Bedrock

Improve the performance of your Generative AI applications with Prompt Optimization on Amazon Bedrock

Prompt engineering refers to the practice of writing instructions to get the desired responses from foundation models (FMs). You might have to spend months experimenting and iterating on your prompts, following the best practices for each model, to achieve your desired output. Furthermore, these prompts are specific to a model and task, and performance isn’t guaranteed when they are used with a different FM. This manual effort required for prompt engineering can slow down your ability to test different models.

Today, we are excited to announce the availability of Prompt Optimization on Amazon Bedrock. With this capability, you can now optimize your prompts for several use cases with a single API call or a click of a button on the Amazon Bedrock console.

In this post, we discuss how you can get started with this new feature using an example use case in addition to discussing some performance benchmarks.

Solution overview

At the time of writing, Prompt Optimization for Amazon Bedrock supports Prompt Optimization for Anthropic’s Claude 3 Haiku, Claude 3 Sonnet, Claude 3 Opus, and Claude-3.5-Sonnet models, Meta’s Llama 3 70B and Llama 3.1 70B models, Mistral’s Large model and Amazon’s Titan Text Premier model. Prompt Optimizations can result in significant improvements for Generative AI tasks. Some example performance benchmarks for several tasks were conducted and are discussed.

In the following sections, we demonstrate how to use the Prompt Optimization feature. For our use case, we want to optimize a prompt that looks at a call or chat transcript, and classifies the next best action.

Use automatic prompt optimization

To get started with this feature, complete the following steps:

  1. On the Amazon Bedrock console, choose Prompt management in the navigation pane.
  2. Choose Create prompt.
  3. Enter a name and optional description for your prompt, then choose Create.

  1. For User message, enter the prompt template that you want to optimize.

For example, we want to optimize a prompt that looks at a call or chat transcript and classifies the next best action as one of the following:

  • Wait for customer input
  • Assign agent
  • Escalate

The following screenshot shows what our prompt looks like in the prompt builder.

  1. In the Configurations pane, for Generative AI resource, choose Models and choose your preferred model. For this example, we use Anthropic’s Claude 3.5 Sonnet.
  2. Choose Optimize.

A pop-up appears that indicates that your prompt is being optimized.

When optimization is complete, you should see a side-by-side view of the original and the optimized prompt for your use case.

  1. Add values to your test variables (in this case, transcript) and choose Run.

You can then see the output from the model in the desired format.

As we can see in this example, the prompt is more explicit, with clear instructions on how to process the original transcript provided as a variable. This results in the correct classification, in the required output format. Once a prompt has been optimized, it can be deployed into an application by creating a version which creates a snapshot of its configuration. Multiple versions can be stored to enable switching between different use-case prompt configurations. See prompt management for more details on prompt version control and deployment.

Performance benchmarks

We ran the Prompt Optimization feature on several open source datasets. We are excited to share the improvements seen in a few important and common use cases that we see our customers working with:

  • Summarization (XSUM)
  • RAG-based dialog continuation (DSTC)
  • Function calling (GLAIVE)

To measure performance improvement with respect to the baseline prompts, we use ROUGE-2 F1 for the summarization use case, HELM-F1 for the dialog continuation use case, and HELM-F1 and JSON matching for function calling. We saw a performance improvement of 18% on the summarization use case, 8% on dialog completion, and 22% on function calling benchmarks. The following table contains the detailed results.

Use Case Original Prompt Optimized Prompt Performance Improvement
Summarization First, please read the article below.
{context}
 Now, can you write me an extremely short abstract for it?
<task>
Your task is to provide a concise 1-2 sentence summary of the given text that captures the main points or key information.
</task><context>
{context}
</context><instructions>
Please read the provided text carefully and thoroughly to understand its content. Then, generate a brief summary in your own words that is much shorter than the original text while still preserving the core ideas and essential details. The summary should be concise yet informative, capturing the essence of the text in just 1-2 sentences.
</instructions><result_format>
Summary: [WRITE YOUR 1-2 SENTENCE SUMMARY HERE]
</result_format>
18.04%
Dialog continuation Functions available:
{available_functions}
Examples of calling functions:
Input:
Functions: [{"name": "calculate_area", "description": "Calculate the area of a shape", "parameters": {"type": "object", "properties": {"shape": {"type": "string", "description": "The type of shape (e.g. rectangle, triangle, circle)"}, "dimensions": {"type": "object", "properties": {"length": {"type": "number", "description": "The length of the shape"}, "width": {"type": "number", "description": "The width of the shape"}, "base": {"type": "number", "description": "The base of the shape"}, "height": {"type": "number", "description": "The height of the shape"}, "radius": {"type": "number", "description": "The radius of the shape"}}}}, "required": ["shape", "dimensions"]}}]
Conversation history: USER: Can you calculate the area of a rectangle with a length of 5 and width of 3?
Output:
{"name": "calculate_area", "arguments": {"shape": "rectangle", "dimensions": {"length": 5, "width": 3}}}Input:
Functions: [{"name": "search_books", "description": "Search for books based on title or author", "parameters": {"type": "object", "properties": {"search_query": {"type": "string", "description": "The title or author to search for"}}, "required": ["search_query"]}}]
Conversation history: USER: I am looking for books by J.K. Rowling. Can you help me find them?
Output:
{"name": "search_books", "arguments": {"search_query": "J.K. Rowling"}}Input:
Functions: [{"name": "calculate_age", "description": "Calculate the age based on the birthdate", "parameters": {"type": "object", "properties": {"birthdate": {"type": "string", "format": "date", "description": "The birthdate"}}, "required": ["birthdate"]}}]
Conversation history: USER: Hi, I was born on 1990-05-15. Can you tell me how old I am today?
Output:
{"name": "calculate_age", "arguments": {"birthdate": "1990-05-15"}}
Current chat history:
{conversation_history}
Respond to the last message. Call a function if necessary.

Task: Respond to the user's message in the given conversation by calling appropriate functions if necessary.

Instructions:
1. Review the list of available functions:
<available_functions>
{available_functions}
</available_functions>

2. Study the examples of how to call these functions:
<fewshot_examples>

<example>
H:
<context>Functions: [{"name": "calculate_area", "description": "Calculate the area of a shape", "parameters": {"type": "object", "properties": {"shape": {"type": "string", "description": "The type of shape (e.g. rectangle, triangle, circle)"}, "dimensions": {"type": "object", "properties": {"length": {"type": "number", "description": "The length of the shape"}, "width": {"type": "number", "description": "The width of the shape"}, "base": {"type": "number", "description": "The base of the shape"}, "height": {"type": "number", "description": "The height of the shape"}, "radius": {"type": "number", "description": "The radius of the shape"}}}}, "required": ["shape", "dimensions"]}}]</context>
<question>USER: Can you calculate the area of a rectangle with a length of 5 and width of 3?</question>
A:
<output>{"name": "calculate_area", "arguments": {"shape": "rectangle", "dimensions": {"length": 5, "width": 3}}}</output>
</example>

<example>
H:
<context>Functions: [{"name": "search_books", "description": "Search for books based on title or author", "parameters": {"type": "object", "properties": {"search_query": {"type": "string", "description": "The title or author to search for"}}, "required": ["search_query"]}}]</context>
<question>USER: I am looking for books by J.K. Rowling. Can you help me find them?</question>
A:
<output>{"name": "search_books", "arguments": {"search_query": "J.K. Rowling"}}</output>
</example>

<example>
H:
<context>Functions: [{"name": "calculate_age", "description": "Calculate the age based on the birthdate", "parameters": {"type": "object", "properties": {"birthdate": {"type": "string", "format": "date", "description": "The birthdate"}}, "required": ["birthdate"]}}]</context>
<question>USER: Hi, I was born on 1990-05-15. Can you tell me how old I am today?</question>
A:
<output>{"name": "calculate_age", "arguments": {"birthdate": "1990-05-15"}}</output>
</example>

</fewshot_examples>

3. Carefully read the current conversation history:
<conversation_history>
{conversation_history}
</conversation_history>

4. Analyze the last message from the user and determine if any of the available functions need to be called to provide an appropriate response.

5. If a function call is necessary, follow the format demonstrated in the examples to invoke the relevant function with the required arguments.

6. If no function call is needed, provide a direct response to the user's message.

7. Your response should be concise, relevant, and tailored to the specific context of the conversation.

8. Enclose your final response in <response></response> tags, without any additional preamble or explanation.

Provide your response immediately after these instructions, following the specified format.

8.23%
Function Calling

## Retrieved knowledge snippets

Domain: Restaurant
Entity: THE COPPER KETTLE
Review: My friend Mark took me to the copper kettle to celebrate my promotion. I decided to treat myself to Shepherds Pie. It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. I may come back to try the strawberries and cream come time for Wimbledon..
Highlight: It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio.

Domain: Restaurant
Entity: THE COPPER KETTLE
Review: Last week, my colleagues and I visited THE COPPER KETTLE that serves British cuisine. We enjoyed a nice view from inside of the restaurant. The atmosphere was enjoyable and the restaurant was located in a nice area. However, the food was mediocre and was served in small portions.
Highlight: We enjoyed a nice view from inside of the restaurant.

## Dialog

user: Howdy, I'm looking for a British restaurant for breakfast.

agent: There are several British restaurants available. Would you prefer a moderate or expensive price range?

user: Moderate price range please.

agent: Five restaurants match your criteria. Four are in Centre area and one is in the West. Which area would you prefer?

user: I would like the Center of town please.

agent: How about The Copper Kettle?

user: Do they offer a good view?

agent:

Given the above information from a retrieval augmented generation (RAG) system, respond with the appropriate answer to the question from the user.

<task_description>
You are an advanced question-answering system that utilizes information from a retrieval augmented generation (RAG) system to provide accurate and relevant responses to user queries.
</task_description><instructions>
1. Carefully review the provided context information:
<context>
Domain: Restaurant
Entity: THE COPPER KETTLE
Review: My friend Mark took me to the copper kettle to celebrate my promotion. I decided to treat myself to Shepherds Pie. It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio. I may come back to try the strawberries and cream come time for Wimbledon..
Highlight: It was not as flavorful as I'd have liked and the consistency was just runny, but the servers were awesome and I enjoyed the view from the patio.Domain: Restaurant
Entity: THE COPPER KETTLE
Review: Last week, my colleagues and I visited THE COPPER KETTLE that serves British cuisine. We enjoyed a nice view from inside of the restaurant. The atmosphere was enjoyable and the restaurant was located in a nice area. However, the food was mediocre and was served in small portions.
Highlight: We enjoyed a nice view from inside of the restaurant.
</context>2. Analyze the user's question:
<question>
user: Howdy, I'm looking for a British restaurant for breakfast.agent: There are several British restaurants available. Would you prefer a moderate or expensive price range?user: Moderate price range please.agent: Five restaurants match your criteria. Four are in Centre area and one is in the West. Which area would you prefer?user: I would like the Center of town please.agent: How about The Copper Kettle?user: Do they offer a good view?

agent:
</question>

3. Leverage the context information and your knowledge to generate a concise and accurate answer to the user's question.

4. Ensure your response directly addresses the specific query while incorporating relevant details from the context.

5. Provide your answer in a clear and easy-to-understand manner, without any unnecessary preamble or explanation.
</instructions>

<output_format>
Answer: [Insert your concise answer here]
</output_format>

<example>
Context:
The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France. It is named after the engineer Gustave Eiffel, whose company designed and built the tower. Constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair, it was initially criticized by some of France's leading artists and intellectuals for its design, but it has become a global cultural icon of France and one of the most recognizable structures in the world.

Question: What is the Eiffel Tower?

Answer: The Eiffel Tower is a wrought-iron lattice tower in Paris, France, named after its designer Gustave Eiffel, and constructed as the centerpiece of the 1889 World's Fair.
</example>

22.03%

The consistent improvements across different tasks highlight the robustness and effectiveness of Prompt Optimization in enhancing prompt performance for various natural language processing (NLP) tasks. This shows Prompt Optimization can save you considerable time and effort while achieving better outcomes by testing models with optimized prompts implementing the best practices for each model.

Conclusion

Prompt Optimization on Amazon Bedrock empowers you to effortlessly enhance your prompt’s performance across a wide range of use cases with just a single API call or a few clicks on the Amazon Bedrock console. The substantial improvements demonstrated on open-source benchmarks for tasks like summarization, dialog continuation, and function calling underscore this new feature’s capability to streamline the prompt engineering process significantly. Prompt Optimization on Amazon Bedrock enables you to easily test many different models for your generative-AI application, following the best prompt engineering practices for each model. The reduced manual effort, will greatly accelerate the development of generative-AI applications in your organization.

We encourage you to try out Prompt Optimization with your own use cases and reach out to us for feedback and collaboration.


About the Authors

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Chris Pecora is a Generative AI Data Scientist at Amazon Web Services. He is passionate about building innovative products and solutions while also focusing on customer-obsessed science. When not running experiments and keeping up with the latest developments in generative AI, he loves spending time with his kids.

Zhengyuan Shen is an Applied Scientist at Amazon Bedrock, specializing in foundational models and ML modeling for complex tasks including natural language and structured data understanding. He is passionate about leveraging innovative ML solutions to enhance products or services, thereby simplifying the lives of customers through a seamless blend of science and engineering. Outside work, he enjoys sports and cooking.

Shipra Kanoria is a Principal Product Manager at AWS. She is passionate about helping customers solve their most complex problems with the power of machine learning and artificial intelligence. Before joining AWS, Shipra spent over 4 years at Amazon Alexa, where she launched many productivity-related features on the Alexa voice assistant.

Read More

Search enterprise data assets using LLMs backed by knowledge graphs

Search enterprise data assets using LLMs backed by knowledge graphs

Enterprises are facing challenges in accessing their data assets scattered across various sources because of increasing complexities in managing vast amount of data. Traditional search methods often fail to provide comprehensive and contextual results, particularly for unstructured data or complex queries.

Search solutions in modern big data management must facilitate efficient and accurate search of enterprise data assets that can adapt to the arrival of new assets. Customers want to search through all of the data and applications across their organization, and they want to see the provenance information for all of the documents retrieved. The application needs to search through the catalog and show the metadata information related to all of the data assets that are relevant to the search context. To accomplish all of these goals, the solution should include the following features:

  • Provide connections between related entities and data sources
  • Consolidate fragmented data cataloging systems that contain metadata
  • Provide reasoning behind the search outputs

In this post, we present a generative AI-powered semantic search solution that empowers business users to quickly and accurately find relevant data assets across various enterprise data sources. In this solution, we integrate large language models (LLMs) hosted on Amazon Bedrock backed by a knowledge base that is derived from a knowledge graph built on Amazon Neptune to create a powerful search paradigm that enables natural language-based questions to integrate search across documents stored in Amazon Simple Storage Service (Amazon S3), data lake tables hosted on the AWS Glue Data Catalog, and enterprise assets in Amazon DataZone.

Foundation models (FMs) on Amazon Bedrock provide powerful generative models for text and language tasks. However, FMs lack domain-specific knowledge and reasoning capabilities. Knowledge graphs available on Neptune provide a means to represent interconnected facts and entities with inferencing and reasoning abilities for domains. Equipping FMs with structured reasoning abilities using domain-specific knowledge graphs harnesses the best of both approaches. This allows FMs to retain their inductive abilities while grounding their language understanding and generation in well-structured domain knowledge and logical reasoning. In the context of enterprise data asset search powered by a metadata catalog hosted on services such Amazon DataZone, AWS Glue, and other third-party catalogs, knowledge graphs can help integrate this linked data and also enable a scalable search paradigm that integrates metadata that evolves over time.

Solution overview

The solution integrates with your existing data catalogs and repositories, creating a unified, scalable semantic layer across the entire data landscape. When users ask questions in plain English, the search is not just for keywords; it comprehends the query’s intent and context, relating it to relevant tables, documents, and datasets across your organization. This semantic understanding enables more accurate, contextual, and insightful search results, making the entire company’s data as accessible and simple to search as using a consumer search engine, but with the depth and specificity your business demands. This significantly enhances decision-making, efficiency, and innovation throughout your organization by unlocking the full potential of your data assets. The following video shows the sample working solution.

Using graph data processing and the integration of natural language-based search on embedded graphs, these hybrid systems can unlock powerful insights from complex data structures.

The solution presented in this post consists of an ingestion pipeline and a search application UI that the user can submit queries to in natural language while searching for data assets.

The following diagram illustrates the end-to-end architecture, consisting of the metadata API layer, ingestion pipeline, embedding generation workflow, and frontend UI.

The ingestion pipeline (3) ingests metadata (1) from services (2), including Amazon DataZone, AWS Glue, and Amazon Athena, to a Neptune database after converting the JSON response from the service APIs into an RDF triple format. The RDF is converted into text and loaded into an S3 bucket, which is accessed by Amazon Bedrock (4) as the source of the knowledge base. You can extend this solution to include metadata from third-party cataloging solutions as well. The end-users access the application, which is hosted on Amazon CloudFront (5).

A state machine in AWS Step Functions defines the workflow of the ingestion process by invoking AWS Lambda functions, as illustrated in the following figure.

The functions perform the following actions:

  1. Read metadata from services (Amazon DataZone, AWS Glue, and Athena) in JSON format. Enhance the JSON format metadata to JSON-LD format by adding context, and load the data to an Amazon Neptune Serverless database as RDF triples. The following is an example of RDF triples in N-triples file format:
    <arn:aws:glue:us-east-1:440577664410:table/default/market_sales_table#sales_qty_sold>
    <http://www.w3.org/2000/01/rdf-schema#label> "sales_qty_sold" .
    <arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table#disnt> 
    <http://www.w3.org/2000/01/rdf-schema#label> "disnt" .
    <arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table> 
    <http://www.amazonaws.com/datacatalog/hasColumn> 
    <arn:aws:glue:us-east-1:440577664410:table/sampleenv_pub_db/mkt_sls_table#item_id> .
    <arn:aws:glue:us-east-1:440577664410:table/sampledata_pub_db/raw_customer> 
    <http://www.w3.org/2000/01/rdf-schema#label> "raw_customer" .

    For more details about RDF data format, refer to the W3C documentation.

  2. Run SPARQL queries in the Neptune database to populate additional triples from inference rules. This step enriches the metadata by using the graph inferencing and reasoning capabilities. The following is a SPARQL query that inserts new metadata inferred from existing triples:
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    INSERT
      {
        ?asset <http://www.amazonaws.com/datacatalog/exists_in_aws_account> ?account
      }
    WHERE
      {
        ?asset <http://www.amazonaws.com/datacatalog/isTypeOf> "GlueTableAssetType" .
        ?asset <http://www.amazonaws.com/datacatalog/catalogId> ?account .
      }

  3. Read triples from the Neptune database and convert them into text format using an LLM hosted on Amazon Bedrock. This solution uses Anthropic’s Claude 3 Haiku v1 for RDF-to-text conversion, storing the resulting text files in an S3 bucket.

Amazon Bedrock Knowledge Bases is configured to use the preceding S3 bucket as a data source to create a knowledge base. Amazon Bedrock Knowledge Bases creates vector embeddings from the text files using the Amazon Titan Text Embeddings v2 model.

A Streamlit application is hosted in Amazon Elastic Container Service (Amazon ECS) as a task, which provides a chatbot UI for users to submit queries against the knowledge base in Amazon Bedrock.

Prerequisites

The following are prerequisites to deploy the solution:

  • Capture the user pool ID and application client ID, which will be required while launching the CloudFormation stack for building the web application.
  • Create an Amazon Cognito user (for example, username=test_user) for your Amazon Cognito user pool that will be used to log in to the application. An email address must be included while creating the user.

Prepare the test data

A sample dataset is needed for testing the functionalities of the solution. In your AWS account, prepare a table using Amazon DataZone and Athena completing Step 1 through Step 8 in Amazon DataZone QuickStart with AWS Glue data. This will create a table and capture its metadata in the Data Catalog and Amazon DataZone.

To test how the solution is combining metadata from different data catalogs, create another table only in the Data Catalog, not in Amazon DataZone. On the Athena console, open the query editor and run the following query to create a new table:

CREATE TABLE raw_customer AS SELECT 203 AS cust_id, 'John Doe' AS cust_name

Deploy the application

Complete the following steps to deploy the application:

  1. To launch the CloudFormation template, choose Launch Stack or download the template file (yaml) and launch the CloudFormation stack in your AWS account.
  2. Modify the stack name or leave as default, then choose Next.
  3. In the Parameters section, input the Amazon Cognito user pool ID (CognitoUserPoolId) and application client ID (CognitoAppClientId). This is required for successful deployment of the stacks.
  4. Review and update other AWS CloudFormation parameters if required. You can use the default values for all the parameters and continue with the stack deployment.
    The following table lists the default parameters for the CloudFormation template.

    Parameter Name Description Default Value
    EnvironmentName Unique name to distinguish different web applications in the same AWS account (min length 1 and max length 4). dev
    S3DataPrefixKB S3 object prefix where the knowledge base source documents (metadata files) should be stored. knowledge_base
    Cpu CPU configuration of the ECS task. 512
    Memory Memory configuration of the ECS task. 1024
    ContainerPort Port for the ECS task host and container. 80
    DesiredTaskCount Number of desired ECS task count. 1
    MinContainers Minimum containers for auto scaling. Should be less than or equal to DesiredTaskCount. 1
    MaxContainers Maximum containers for auto scaling. Should be greater than or equal to DesiredTaskCount. 3
    AutoScalingTargetValue CPU utilization target percentage for ECS task auto scaling. 80
  5. Launch the stack.

The CloudFormation stack creates the required resources to launch the application by invoking a series of nested stacks. It deploys the following resources in your AWS account:

  • An S3 bucket to save metadata details from AWS Glue, Athena, and Amazon DataZone, and its corresponding text data
  • An additional S3 bucket to store code, artifacts, and logs related to the deployment
  • A virtual private cloud (VPC), subnets, and network infrastructure
  • An Amazon OpenSearch Serverless index
  • An Amazon Bedrock knowledge base
  • A data source for the knowledge base that connects to the S3 data bucket provisioned, with an event rule to sync the data
  • A Lambda function that watches for objects dropped under the S3 prefix configured as parameter S3DataPrefixKB and starts an ingestion job using Amazon Bedrock Knowledge Bases APIs, which will read data from Amazon S3, chunk it, convert the chunks into embeddings using the Amazon Titan Embeddings model, and store these embeddings in OpenSearch Serverless
  • An serverless Neptune database to store the RDF triples
  • A State Functions state machine that invokes a series of Lambda functions that read from the different AWS services, generate RDF triples, and convert them to text documents
  • An ECS cluster and service to host the Streamlit web application

After the CloudFormation stack is deployed, a Step Functions workflow will run automatically that orchestrates the metadata extract, transform, and load (ETL) job, and stores the final results in Amazon S3. View the execution status and details of the workflow by fetching the state machine Amazon Resource Name (ARN) from the CloudFormation stack. If AWS Lake Formation is enabled for the AWS Glue databases and tables in the account, complete the following steps after the CloudFormation stack is deployed to update the permission and extract the metadata details from AWS Glue and update the metadata details to load to the knowledge base:

  1. Add a role to the AWS Glue Lambda function that grants access to the AWS Glue database.
  2. Fetch the state machine ARN from the CloudFormation stack.
  3. Run the state machine with default input values to extract the metadata details and write to Amazon S3.

You can search for the application stack name <MainStackName>-deploy-<EnvironmentName> (for example, mm-enterprise-search-deploy-dev) on the AWS CloudFormation console. Locate the web application URL in the stack outputs (CloudfrontURL). Launch the web application by choosing the URL link.

Use the application

You can access the application from a web browser using the domain name of the Amazon CloudFront distribution created in the deployment steps. Log in using a user credential that exists in the Amazon Cognito user pool.

Now you can submit a query using a text input. The AWS account used in this example contains sample tables related to sales and marketing. We ask the question, “How to query sales data?” The answer includes metadata on the table mkt_sls_table that was created in the previous steps.

We ask another question: “How to get customer names from sales data?” In the previous steps, we created the raw_customer table, which wasn’t published as a data asset in Amazon DataZone. The table only exists in the Data Catalog. The application returns an answer that combines metadata from Amazon DataZone and AWS Glue.

This powerful solution opens up exciting possibilities for enterprise data discovery and insights. We encourage you to deploy it in your own environment and experiment with different types of queries across your data assets. Try combining information from multiple sources, asking complex questions, and see how the semantic understanding improves your search experience.

Clean up

The total cost of running this setup is less than $10 per day. However, we recommend deleting the CloudFormation stack after use because the deployed resources incur costs. Deleting the main stack also deletes all the nested stacks except the VPC because of dependency. You also need to delete the VPC from the Amazon VPC console.

Conclusion

In this post, we presented a comprehensive and extendable multimodal search solution of enterprise data assets. The integration of LLMs and knowledge graphs shows that by combining the strengths of these technologies, organizations can unlock new levels of data discovery, reasoning, and insight generation, ultimately driving innovation and progress across a wide range of domains.

To learn more about LLM and knowledge graph use cases, refer to the following resources:


About the Authors

Sudipta Mitra is a Generative AI Specialist Solutions Architect at AWS, who helps customers across North America use the power of data and AI to transform their businesses and solve their most challenging problems. His mission is to enable customers achieve their business goals and create value with data and AI. He helps architect solutions across AI/ML applications, enterprise data platforms, data governance, and unified search in enterprises.

Gi Kim is a Data & ML Engineer with the AWS Professional Services team, helping customers build data analytics solutions and AI/ML applications. With over 20 years of experience in solution design and development, he has a background in multiple technologies, and he works with specialists from different industries to develop new innovative solutions using his skills. When he is not working on solution architecture and development, he enjoys playing with his dogs at a beach under the San Francisco Golden Gate Bridge.

Surendiran Rangaraj is a Data & ML Engineer at AWS who helps customers unlock the power of big data, machine learning, and generative AI applications for their business solutions. He works closely with a diverse range of customers to design and implement tailored strategies that boost efficiency, drive growth, and enhance customer experiences.

Read More

Embodied AI Chess with Amazon Bedrock

Embodied AI Chess with Amazon Bedrock

Generative AI continues to transform numerous industries and activities, with one such application being the enhancement of chess, a traditional human game, with sophisticated AI and large language models (LLMs). Using the Custom Model Import feature in Amazon Bedrock, you can now create engaging matches between foundation models (FMs) fine-tuned for chess gameplay, combining classical strategy with generative AI capabilities.


Amazon Bedrock provides managed access to leading FMs from Anthropic, Meta, Mistral AI, AI21 Labs, Cohere, Stability AI, and Amazon, enabling developers to build sophisticated AI-powered applications. These models demonstrate remarkable capabilities in understanding complex game patterns, strategic decision-making, and adaptive learning. With the Custom Model Import feature, you can now seamlessly deploy your customized chess models fine-tuned on specific gameplay styles or historical matches, eliminating the need to manage infrastructure while enabling serverless, on-demand inference. This capability allows you to experiment on fascinating matchups between:

  • Base FMs vs. custom fine-tuned models
  • Custom fine-tuned models trained on distinct grandmaster playing styles

In this post, we demonstrate Embodied AI Chess with Amazon Bedrock, bringing a new dimension to traditional chess through generative AI capabilities. Our setup features a smart chess board that can detect moves in real time, paired with two robotic arms executing those moves. Each arm is controlled by different FMs—base or custom. This physical implementation allows you to observe and experiment with how different generative AI models approach complex gaming strategies in real-world chess matches.

Solution overview

The chess demo uses a broad spectrum of AWS services to create an interactive and engaging gaming experience. The following architecture diagram illustrates the service integration and data flow in the demo.

Connected Edge Intelligence Chess with Amazon Bedrock - Architecture

On the frontend, AWS Amplify hosts a responsive React TypeScript application while providing secure user authentication through Amazon Cognito using the Amplify SDK. This authentication layer connects users to backend services through GraphQL APIs, managed by AWS AppSync, allowing for real-time data synchronization and game state management.

The application’s core backend functionality is handled by a combination of Unit and Pipeline Resolvers. Whereas Unit Resolvers manage lightweight operations such as game state management, creation, and deletion, the critical move-making processes are orchestrated through Pipeline Resolvers. These resolvers queue moves for processing by AWS Step Functions, providing reliable and scalable game flow management.

For generative AI-powered gameplay, Amazon Bedrock integration enables access to both FMs and custom fine-tuned models. The FMs fine-tuned using Amazon SageMaker are then imported into Amazon Bedrock through the Custom Model Import feature, making them available alongside FMs for on-demand access during gameplay. More details on fine-tuning and importing a fine-tuned FM into Amazon Bedrock can be found in the blog post Import a question answering fine-tuned model into Amazon Bedrock as a custom model.

The execution of chess moves on the board is coordinated by a custom component called Chess Game Manager, running on AWS IoT Greengrass. This component bridges the gap between the cloud infrastructure and the physical hardware.

When processing a move, the Step Functions workflow publishes a move request to an AWS IoT Core topic and pauses, awaiting confirmation. The Chess Game Manager component consumes the message, and implements a three-phase validation system to make sure moves are executed accurately. First, it validates the intended move with the smart chessboard, which can detect piece positions. Second, it sends requests to the two robotic arms to physically move the chess pieces. Finally, it confirms with the smart chessboard that the pieces are in their correct positions after the move. This third-phase validation by the smart chessboard is the concept of “trust but verify” in Embodied AI, where the physical state of something may be different from what is shown in a dashboard. Therefore, after the state of the move is registered, the Step Functions workflow continues. After a move has been confirmed, the component publishes a response message back to AWS IoT Core, on a separate topic, which signals the Step Functions workflow to continue.

The demo offers a few gameplay options. Players can choose from the following list of opponents:

  • Generative AI models available on Amazon Bedrock
  • Custom fine-tuned models deployed to Amazon Bedrock
  • Chess engines
  • Human opponents
  • Random moves

An infrastructure as code (IaC) approach was taken when constructing this project. You will use the AWS Cloud Deployment Kit (AWS CDK) when building the components for deployment into any AWS account. After you download the code base, you can deploy the project following the instructions outlined in the GitHub repo.

Prerequisites

This post assumes you have the following:

Chess with fine-tuned models

Traditional approaches to chess AI have focused on handcrafted rules and search algorithms. These methods, though effective, often struggle to capture the nuanced decision-making and long-term strategic thinking characteristic of human grandmasters. More recently, reinforcement learning (RL) has shown promise in mastering chess by allowing AI agents to learn through self-play and trial and error. RL models can discover strategies and evaluate board positions, but they often require extensive computational resources and training time—typically several weeks to months of continuous learning to reach grandmaster-level play.

Fine-tuning generative AI FMs offers a compelling alternative by learning the underlying patterns and principles of chess in just a few days using standard GPU instances, making it a more resource-efficient approach for developing specialized chess AI. The fine-tuning process significantly reduces the time and computational resources needed because the model already understands basic patterns and structures, allowing it to focus on learning chess-specific strategies and tactics.

Prepare the dataset

This section dives into the process of preparing a high-quality dataset for fine-tuning a chess-playing model, focusing on extracting valuable insights from games played by grandmasters and world championship games.

At the heart of our dataset lies the Portable Game Notation (PGN), a standard chess format that records every aspect of a chess game. PGN includes Forsyth–Edwards Notation (FEN), which captures the exact position of pieces on the board at any given moment. Together, these formats store both the moves played and important game details like player names and dates, giving our model comprehensive data to learn from.

Dataset preparation consists of the following key steps:

  • Data acquisition – We begin by downloading a collection of games in PGN format from publicly available PGN files on the PGN mentor program website. We used the games played by Magnus Carlsen, a renowned chess grandmaster. You can download a similar dataset using the following commands:
# Download games zip file to the target directory - You may choose a different set of games – replace filename with the name of the file you want to download
curl -o /data/filename.zip https://www.pgnmentor.com/players/filename.zip

# Unzip the file in the target directory 
unzip filename.zip
  • Filtering for success – To train a model focused on winning strategies, we filter the games to include only games where the player emerged victorious. This allows the model to learn from successful games.
  • PGN to FEN conversion – Each move in a PGN file represents a transition in the chessboard state. To capture these states effectively, we convert PGN notation to FEN format. This conversion process involves iterating through the moves in the PGN, updating the board state accordingly, and generating the corresponding FEN for each move.

The following is a sample game in a PGN file:

[Event “Titled Tue DDth MMM Late”]
[Site “chess.com INT”]
[Date “YYYY.MM.DD”]
[Round “10”]
[White “Player 1 last name,Player 1 first name”]
[Black “Player 2 last name, Player 2 first name “]
[Result “0-1”]
[WhiteElo “2xxx”]
[BlackElo “2xxx”]
[ECO “A00”]1.e4 c5 2.d4 cxd4 3.c3 Nc6 4.cxd4 d5 5.exd5 Qxd5 6.Nf3 e5 7.Nc3 Bb4 8.Bd2 Bxc3 9.Bxc3 e4 10.Nd2 Nf6 11.Bc4 Qg5 12.Qb3 O-O 13.O-O-O Bg4 14.h4 Bxd1 15.Rxd1 Qf5 16.g4 Nxg4 17.Rg1 Nxf2 18.d5 Ne5 19.Rg5 Qd7 20.Bxe5 f5 21.d6+  1-0

The following are sample JSON records with FEN, capturing next move and next color to move. We followed two approaches for the JSON record creation. For models that have good understanding of FEN format, we used a more concise record:

{
    "move": "d4",
    "fen": "rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 2",
    "nxt_color": "WHITE"
}

For models with limited understanding of FEN format, we used a more detailed record:

{
    "move": "d4",
    "fen": "rnbqkbnr/pp1ppppp/8/2p5/4P3/8/PPPP1PPP/RNBQKBNR w KQkq - 0 2",
    "nxt_color": "WHITE",
    "move_history": "e4, c5"
}

The records include the following parameters:

  • move – A valid next move for the given FEN state.
  • fen – The current board position in FEN.
  • nxt_color – Which color has the next turn to move.
  • move_history – The history of game moves performed until the current board state.

For each game in the PGN file, multiple records similar to the preceding examples are created to capture the FEN, next move, and next move color.

  • Move validation – We validate the legality of each move captured in the records in the preceding format. This step maintains data integrity and prevents the model from learning incorrect or impossible chess moves.
  • Dataset splitting – We split the processed dataset into two parts: a training set and an evaluation set. The training set is used to train the model, and the evaluation set is used to assess the model’s performance on unseen data. This splitting helps us understand how well the model generalizes to new chess positions.

By following these steps, we create a comprehensive and refined dataset that enables our chess AI to learn from successful games, understand legal moves, and grasp the nuances of strategic chess play. This approach to data preparation creates the foundation for fine-tuning a model that can play chess at a high level.

Fine-tune a model

With our refined dataset prepared from successful games and legal moves, we now proceed to fine-tune a model using Amazon SageMaker JumpStart. The fine-tuning process requires clear instructions through a structured prompt template. Here again, based on the FM, we followed two approaches.

For fine-tuning an FM that understands FEN format, we used a more concise prompt template:

template = {
    "prompt": (
        "<s>[INST] You are a chess engine. Given a chess position in FEN notation and the color to move, provide the next best valid move in SAN (Standard Algebraic Notation) format to progress towards winning the game of chess. Your response must be a single move wrapped in <move></move> tags.nn"
        "Chess Position (FEN): {fen}n"
        "Color to Move: {nxt_color} [/INST]"
    ),
    "completion": " <move>{move}</move> </s>"
}

Alternatively, for models with limited FEN knowledge, we provide a prompt template similar to the following:

template = {
    "prompt": (
        "<s>[INST]nYou are a chess engine that provides the next best valid move in SAN format based on:n- FEN position where:n  Black pieces: p=pawn, r=rook, n=knight, b=bishop, q=queen, k=king (lowercase)n  White pieces: P=pawn, R=rook, N=knight, B=bishop, Q=queen, K=king (uppercase)n  Numbers 1-8 indicate consecutive empty squaresn- Color to moven- Move historynnAnalyze these inputs to recommend a legal move that progresses toward winning. Respond with a single move in <move></move> tags.nn"
        "Chess Position (FEN): {fen}n"
        "Color to Move: {nxt_color}n"
        "Move History: {move_history}n"
    ),
    "completion": " <move>{move}</move> </s>"
}

Training and evaluation datasets along with the template.json file created using one of the preceding templates are then uploaded to an Amazon Simple Storage Service (Amazon S3) bucket so they are ready for the fine-tuning job that will be submitted using SageMaker JumpStart.

Now that the dataset is prepared and our model is selected, we submit a SageMaker training job with the following code:

estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    environment={"accept_eula": "true"},  
    disable_output_compression=True,
    instance_type="ml.g5.24xlarge"
)
# By default, instruction tuning is set to false. 
estimator.set_hyperparameters(instruction_tuned=True, epoch="3", max_input_length="1024")
estimator.fit({"training": train_test_data_location})

Let’s break down the preceding code, and look at some important sections:

  • estimator – this is the SageMaker object used to accept all training parameters, while launching and orchestrating the training job.
  • model_id – This is the SageMaker JumpStart model ID for the LLM that you need to fine-tune.
  • accept_eula – This EULA varies from provider to provider and must be accepted when deploying or fine-tuning models from SageMaker JumpStart.
  • instance_type – This is the compute instance the fine-tuning job will take place on. In this case, it’s a g5.24xlarge. This specific instance contains 4 NVIDIA A10G GPUs with 96 GiB of GPU memory. When deciding on an instance type, select the one that best balances your computational needs with your budget to maximize value.
  • fit – The .fit method is the actual line of code that launches the SageMaker training job. All of the algorithm metrics and instance usage metrics can be viewed in Amazon CloudWatch logs, which are directly integrated with SageMaker.

When the SageMaker training job is complete, the model artifacts will be stored in an S3 bucket specified either by the user or the system default.

The notebook we use for fine-tuning one of the models can be accessed in the following GitHub repo.

Challenges and best practices for fine-tuning

In this section, we discuss common challenges and best practices for fine-tuning.

Automated Optimizations with SageMaker JumpStart

Fine-tuning an LLM for chess move prediction using SageMaker presents unique opportunities and challenges. We used SageMaker JumpStart to do the fine-tuning because it provides automated optimizations for different model sizes when fine-tuning for chess applications. SageMaker JumpStart automatically applies appropriate quantization techniques and resource allocations based on model size. For example:

  • 3B–7B models – Enables FSDP with full precision training
  • 13B models – Configures FSDP with optional 8-bit quantization
  • 70B models – Automatically implements 8-bit quantization and disables FSDP for stability

This means if you create a SageMaker JumpStart Estimator without explicitly specifying the int8_quantization parameter, it will automatically use these default values based on the model size you’re working with. This design choice is made because larger models (like 70B) require significant computational resources, so quantization is enabled by default to reduce the memory footprint during training.

Data preparation and format

Dataset identification and preparation can be a challenge. We used readily available PGN datasets from world championships and grandmaster matches to streamline the data preparation process for chess LLM fine-tuning, significantly reducing the complexity of dataset curation.

Choosing the right chess format that produces optimal results with an LLM is critical for successful results post-fine-tuning. We discovered that Standard Algebraic Notation (SAN) significantly outperforms Universal Chess Interface (UCI) format in terms of training convergence and model performance.

Prompt consistency

Using consistent prompt templates during fine-tuning helps the model learn the expected input-output patterns more effectively, and Amazon Bedrock Prompt Management provide robust tools to create and manage these templates systematically. We recommend using the prompt template suggestions provided by the model providers for improved performance.

Model size and resource allocation

Successful LLM training requires a good balance of cost management through multiple approaches, with instance selection being a primary aspect. You can start with the following recommended instance and work your way up, depending on the quality and time available for training.

Model Size Memory Requirements Recommended Instance and Quantization
3B – 7B 24 GB Fits on g5.2xlarge with QLoRA 4-bit quantization
8B -13B 48 GB Requires g5.4xlarge with efficient memory management
70B 400 GB Needs g5.48xlarge or p4d.24xlarge with multi-GPU setup

Import the fine-tuned model into Amazon Bedrock

After the model is fine-tuned and the model artifacts are in the designated S3 bucket, it’s time to import it to Amazon Bedrock using Custom Model Import.

The following section outlines two ways to import the model: using the SDK or the Amazon Bedrock console.

The following is a code snippet showing how the model can be imported using the SDK:

create_model_import_job_resp = br_client.create_model_import_job(
        jobName=rivchess_imp_jb_nm,
        importedModelName=rivchess_model_nm,
        roleArn=role_arn,
        modelDataSource=rivchess_model_src)

In the code snippet, a create model import job is submitted to import the fine-tuned model into Amazon Bedrock. The parameters in the job are as follows:

  • JobName – The name of the import job so it may be identified using the SDK or Amazon Bedrock console
  • ImportedModelName – The name of the imported model, which will be used to invoke inference using the SDK and identify said model on the Amazon Bedrock console
  • roleArn – The role with the correct permissions to import a model onto Amazon Bedrock
  • modelDataSource – The S3 bucket in which the model artifacts were stored in, upon the completed training job

To use the Amazon Bedrock console, complete the following steps:

  1. On the Amazon Bedrock console, under Foundation models in the navigation pane, choose Imported models.

  1. Choose Import model.

  1. Provide the following information:
    1. For Model name, enter a name for your model.
    2. For Import job name¸ enter a name for your import job.
    3. For Model import settings, select Amazon S3 bucket and enter your bucket location.
    4. Create an IAM role or use an existing one.
  2. Choose Import.

After the job is submitted, the job will populate the queue on the Imported models page.

When the model import job is complete, the model may now be called for inference using the Amazon Bedrock console or SDK.

Test the fine-tuned model to play chess

To test the fine-tuned model that is imported into Amazon Bedrock, we use the AWS SDK for Python (Boto3) library to invoke the imported model. We simulated the fine-tuned model against the Stockfish library for a game of up to 50 moves or when the game is won either by the fine-tuned model or by Stockfish.

The Stockfish Python library requires the appropriate version of the executable to be downloaded from the Stockfish website. We also use the chess Python library to visualize the status of the board. This is basically simulating a chess player at a particular Elo rating. An Elo rating represents a player’s strength as a numerical value.

Stockfish and chess Python libraries are GPL-3.0 licensed chess engines, and any usage, modification, or distribution of these libraries must comply with the GPL 3.0 license terms. Review the license agreements before using the Stockfish and chess Python libraries.

The first step is to install the chess and Stockfish libraries:

!pip install chess stockfish —upgrade —quiet

We then initialize the Stockfish library. The path to the command line executable needs to be provided:

stockfish = Stockfish(path='/home/sagemaker-user/riv2024-chess/stockfish/stockfish-ubuntu-x86-64-sse41-popcnt')
stockfish.update_engine_parameters({"Hash": 2048, "UCI_Chess960": "true"})
stockfish.set_elo_rating(1350)
fen_state = stockfish.get_fen_position()

We set the Elo rating, using Stockfish API methods (set_elo_rating). Additional configuration can be provided by following the Stockfish Python Library documentation.

We initialize the chess Python library similarly with equivalent code to the Stockfish Python library initialization. Further configuration can be provided to the chess library following the chess Python library documentation.

board = chess.Board()
board.reset_board()
board.chess960 = True
stockfish.set_fen_position(board.fen())

Upon initialization, we initiate the fine-tuned model imported into Amazon Bedrock against the Stockfish library. In the following code, the first move is performed by Stockfish. Then the fine-tuned model is invoked using the Amazon Bedrock invoke_model API wrapped in a helper function by providing the FEN position of the chess board currently. We continue playing each side until one side wins or when a total of 50 moves are played. We check if each move proposed by the fine-tuned model is legal or not. We continue to invoke the fine-tuned model up to five times if the proposed move is an illegal move.

while True:

    sfish_move = stockfish.get_best_move()
    try:
        move_color = 'WHITE' if board.turn else 'BLACK'
        uci_move = board.push_san(sfish_move).uci()
        stockfish.set_fen_position(board.fen())
        move_count += 1
        move_list.append(f"{sfish_move}")
        print(f'SF Move  - {sfish_move} | {move_color} | Is Move Legal: {stockfish.is_fen_valid(board.fen())} | FEN: {board.fen()} | Move Count: {move_count}')
    except (chess.InvalidMoveError, chess.IllegalMoveError) as e:
        print(f"Stockfish Error for {move_color}: {e}")
        print(f"### Move Count: {move_count} ###")
        print(f'Moves list - {s.join(move_list)}')
        break

    if board.is_checkmate():
        print("Stockfish won!")
        print(f"### Move Count: {move_count} ###")
        print(f'Moves list - {s.join(move_list)}')
        break

    if board.is_stalemate():
        print("Draw!")
        print(f"### Move Count: {move_count} ###")
        print(f'Moves list - {s.join(move_list)}')
        break

    next_turn = 'WHITE' if board.turn else 'BLACK'
    llm_next_move = get_llm_next_move(board.fen(), next_turn, None)
    if llm_next_move is None:
        print("Failed to get a move from LLM. Ending the game.")
        break

    ill_mov_cnt = 0
    while True:
        try:
            is_llm_move_legal = True
            prev_fen = board.fen()
            uci_move = board.push_san(llm_next_move).uci()
            is_llm_move_legal = stockfish.is_fen_valid(board.fen())
            if is_llm_move_legal:
                print(f'LLM Move - {llm_next_move} | {next_turn} | Is Move Legal: {stockfish.is_fen_valid(board.fen())} | FEN: {board.fen()} | Move Count: {move_count}')
                stockfish.set_fen_position(board.fen())
                move_count += 1
                move_list.append(f"{llm_next_move}")
                break
            else:
                board.pop()
                print('Popping board and retrying LLM Next Move!!!')
                llm_next_move = get_llm_next_move(board.fen(), next_turn, llm_next_move, s.join(move_list))
        except (chess.AmbiguousMoveError, chess.IllegalMoveError, chess.InvalidMoveError) as e:
            print(f"LLM Error #{ill_mov_cnt}: {llm_next_move} for {next_turn} is illegal move!!! for {prev_fen}  | FEN: {board.fen()}")
            if ill_mov_cnt == 5:
                print(f"{ill_mov_cnt} illegal moves so far, exiting....")
                break
            ill_mov_cnt += 1
            llm_next_move = get_llm_next_move(board.fen(), next_turn, llm_next_move)

        if board.is_checkmate():
            print("LLM won!")
            print(f"### Move Count: {move_count} ###")
            print(f'Moves list - {s.join(move_list)}')
            break

        if board.is_stalemate():
            print("Draw!")
            print(f"### Move Count: {move_count} ###")
            print(f'Moves list - {s.join(move_list)}')
            break
    if move_count == 50:
        print("Played 50 moves hence quitting!!!!")
        break
board

We observe and measure the effectiveness of the model by counting the number of successful legal moves its able to successfully propose.

The notebook we use for testing the fine-tuned model can be accessed from the following GitHub repo.

Deploy the project

You can initiate the deployment of the project using instructions outlined in the GitHub repo, starting with the following command:

pnpm cdk deploy

This will initiate an AWS CloudFormation stack to run. After the stack is successfully deployed to your AWS account, you can begin setting up user access. Navigate to the newly created Amazon Cognito user pool, where you can create your own user account for logging in to the application. After creating your account, you can add yourself to the admin group to gain administrative privileges within the application.

After you complete the user setup, navigate to Amplify, where your chess application should now be visible. You’ll find a published URL for your hosted demo—simply choose this link to access the application. Use the login credentials you created in the Amazon Cognito user pool to access and explore the application.

After you’re logged in with admin privileges, you’ll be automatically directed to the /admin page. You can perform the following actions on this page:

  • Create a session (game instance) by selecting from various gameplay options.
  • Start the game from the admin panel.
  • Choose the session to load the necessary cookie data.
  • Navigate to the participants screen to view and test the game. The interface is intuitive, but following these steps in order will provide proper game setup and functionality.

Set up the AWS IoT Core resources

Configuring the solution for IoT gameplay follows a similar process to the previous section—you’ll still need to deploy the UI stack. However, this deployment includes an additional IoT flag that signals the stack to deploy the AWS IoT rules in charge of handling game requests and responses. The specific deployment steps are outlined in this section.

Follow the steps from before, but add the following flag when deploying:

pnpm cdk deploy -c iotDevice=true

This will deploy the solution, adding a critical step to the Step Functions workflow, which publishes a move request message to the topic of an AWS IoT rule and then waits for a response.

Users will need to configure an IoT edge device to consume game requests from this topic. This involves setting up a device capable of publishing and subscribing to topics using the MQTT protocol, processing move requests, and sending success messages back to the topic of the AWS IoT rule that is waiting for responses, which then feeds back into the Step Functions workflow. Although the configuration is flexible and can be customized to your needs, we recommend using AWS IoT Greengrass on your edge device. AWS IoT Greengrass is an open source edge runtime and cloud service for building, deploying, and managing device software. This enables secure topic communication between your IoT devices and the AWS Cloud, allowing you to perform edge verifications such as controlling the robotic arms and synchronizing with the physical board before publishing either a success or failure message back to the cloud.

Setting up a Greengrass Core Device and Client Devices

To setup an AWS IoT Greengrass V2 core device, you can deploy the Chess Game Manager component to it, by following the instructions in the GitHub repo for Greengrass Component. The component contains a recipe, where you’ll need to define the configuration that is required for your IoT devices. The default configuration contains a list of topics used to process game requests and responses, to perform board validations and notifications of new moves, and to coordinate move requests and responses from the robotic arms. You also need to update the names of the client devices that will connect to the component, these client devices must be registered as AWS IoT Things on AWS IoT Core.

Users will also need to have a client application that controls the robotic arms, and a client application that fetches information from the smart chess board. Both client applications need to connect and communicate with the Greengrass core device running the Chess Game Manager component. In our demo, we tested with two separate robotic arms client applications, for the first one we used a pair of CR10A arms from Dobot Robotics, and communicated with the robotic arms using its TCP-IP-CR-Python-V4 SDK; For the second one we used a pair of RO1 arms from Standard Bots, using its Standard bots API. For the smart chess board client application, we used a DGT Smart Board, the board comes with a USB cable that allows us to fetch piece move updates using serial communication.

Preventing illegal moves

When using FMs in Amazon Bedrock to generate the next move, the system employs a retry mechanism that makes three distinct attempts with the generative AI model, each providing more context than the last:

  • First attempt – The model is prompted to predict the next best move based on the current board state.
  • Second attempt – If the first move was illegal, the model is informed of its failure and prompted to try again, including the context of why the previous attempt failed.
  • Third attempt – If still unsuccessful, the model is provided with information on previous illegal moves, with an explanation of past failures. However, this attempt includes a list of all legal moves available. The model is then prompted to select from this list the next logical move.

If all three generative AI attempts fail, the system automatically falls back to a chess engine for a guaranteed valid move.

For the custom imported fine-tuned models in Amazon Bedrock, the system employs a retry mechanism that makes five distinct attempts with the model. It all five attempts fail, the system automatically falls back to a chess engine for a guaranteed move.

During chess evaluation tests, models that underwent fine-tuning with over 100,000 training records demonstrated notable effectiveness. These enhanced models prevailed in 80% of their matches against base versions, and the remaining 20% ended in draws.

Clean up

To clean up and remove all deployed resources, run the following command from the AWS CLI:

pnpm cdk destroy

To clean up the imported models in Amazon Bedrock, use the following code:

aws bedrock delete-imported-model 
   --model-identifier <your-model-name> 
   --region <your aws region>

You can also delete the imported models by going to the Amazon Bedrock console and selecting the imported model on the Imported models page.

To clean up the imported models in the S3 bucket, use the following commands after replacing the values corresponding to your environment:

# Delete a single model file

aws s3 rm s3://bucket-name/path/to/model/file

# Delete multiple model files in a directory

aws s3 rm s3://bucket-name/models/ --recursive

# Delete specific model files using include/exclude patterns

aws s3 rm s3://bucket-name/ --recursive --exclude "*" --include "model*.tar.gz"

This code uses the following parameters:

  • –recursive – Required when deleting multiple files or directories
  • –dryrun – Tests the deletion command without actually removing files

Conclusion

This post demonstrated how you can fine-tune FMs to create Embodied AI Chess, showcasing the seamless integration of cloud services, IoT capabilities, and physical robotics. With the AWS comprehensive suite of services, including Amazon Bedrock Custom Model Import, Amazon S3, AWS Amplify, AWS AppSync, AWS Step Functions, AWS IoT Core, and AWS IoT Greengrass, developers can create immersive chess experiences that bridge the digital and physical realms.

Give this solution a try and let us know your feedback in the comments.

References

More information is available at the following resources:


About the Authors

Channa Samynathan is a Senior Worldwide Specialist Solutions Architect for AWS Edge AI & Connected Products, bringing over 28 years of diverse technology industry experience. Having worked in over 26 countries, his extensive career spans design engineering, system testing, operations, business consulting, and product management across multinational telecommunication firms. At AWS, Channa uses his global expertise to design IoT applications from edge to cloud, educate customers on the value proposition of AWS, and contribute to customer-facing publications.

Dwaragha Sivalingam is a Senior Solutions Architect specializing in generative AI at AWS, serving as a trusted advisor to customers on cloud transformation and AI strategy. With seven AWS certifications including ML Specialty, he has helped customers in many industries, including insurance, telecom, utilities, engineering, construction, and real estate. A machine learning enthusiast, he balances his professional life with family time, enjoying road trips, movies, and drone photography.

Daniel Sánchez is a senior generative AI strategist based in Mexico City with over 10 years of experience in cloud computing, specializing in machine learning and data analytics. He has worked with various developer groups across Latin America and is passionate about helping companies accelerate their businesses using the power of data.

Jay Pillai is a Principal Solutions Architect at AWS. In this role, he functions as the Lead Architect, helping partners ideate, build, and launch Partner Solutions. As an Information Technology Leader, Jay specializes in artificial intelligence, generative AI, data integration, business intelligence, and user interface domains. He holds 23 years of extensive experience working with several clients across supply chain, legal technologies, real estate, financial services, insurance, payments, and market research business domains.

Mohammad Tahsin is an AI/ML Specialist Solutions Architect at Amazon Web Services. He lives for staying up to date with the latest technologies in AI/ML and helping guide customers to deploy bespoke solutions on AWS. Outside of work, he loves all things gaming, digital art, and cooking.

Nicolai van der Smagt is a Senior Solutions Architect at AWS. Since joining in 2017, he’s worked with startups and global customers to build innovative solutions using AI on AWS. With a strong focus on real-world impact, he helps customers bring generative AI projects from concept to implementation. Outside of work, Nicolai enjoys boating, running, and exploring hiking trails with his family.

Patrick O’Connor is a WorldWide Prototyping Engineer at AWS, where he assists customers in solving complex business challenges by developing end-to-end prototypes in the cloud. He is a creative problem-solver, adept at adapting to a wide range of technologies, including IoT, serverless tech, HPC, distributed systems, AI/ML, and generative AI.

Paul Vincent is a Principal Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team. He works with AWS customers to bring their innovative ideas to life. Outside of work, he loves playing drums and piano, talking with others through Ham radio, all things home automation, and movie nights with the family.

Rupinder Grewal is a Senior AI/ML Specialist Solutions Architect with AWS. He currently focuses on serving of models and MLOps on Amazon SageMaker. Prior to this role, he worked as a Machine Learning Engineer building and hosting models. Outside of work, he enjoys playing tennis and biking on mountain trails.

Sam Castro is a Sr. Prototyping Architect on the AWS Prototyping and Cloud Engineering (PACE) team. With a strong background in software delivery, IoT, serverless technologies, and generative AI, he helps AWS customers solve complex challenges and explore innovative solutions. Sam focuses on demystifying technology and demonstrating the art of the possible. In his spare time, he enjoys mountain biking, playing soccer, and spending time with friends and family.

Tamil Jayakumar is a Specialist Solutions Architect & Prototyping Engineer with AWS specializing in IoT, robotics, and generative AI. He has over 14 years of proven experience in software development, creating minimum viable products (MVPs) and end-to-end prototypes. He is a hands-on technologist, passionate about solving technology challenges using innovative solutions both on software and hardware, aligning business needs to IT capabilities.

Read More