In Part 1 of this series, we introduced Amazon SageMaker Fast Model Loader, a new capability in Amazon SageMaker that significantly reduces the time required to deploy and scale large language models (LLMs) for inference. We discussed how this innovation addresses one of the major bottlenecks in LLM deployment: the time required to load massive models onto accelerators. By streaming model weights directly from Amazon Simple Storage Service (Amazon S3) to the accelerator, Fast Model Loader can achieve up to 15 times faster loading times compared to traditional methods.
As the AI landscape continues to evolve and models grow even larger, innovations like Fast Model Loader become increasingly crucial. By significantly reducing model loading times, this feature has the potential to transform the way you deploy and scale your LLMs, enabling more responsive and efficient AI applications across a wide range of use cases.
In this post, we provide a detailed, hands-on guide to implementing Fast Model Loader in your LLM deployments. We explore two approaches: using the SageMaker Python SDK for programmatic implementation, and using the Amazon SageMaker Studio UI for a more visual, interactive experience. Whether you’re a developer who prefers working with code or someone who favors a graphical interface, you’ll learn how to take advantage of this powerful feature to accelerate your LLM deployments.
Solution overview
Fast Model Loader is currently integrated with SageMaker Large Model Inference (LMI) containers (starting with v13) for GPU instances. It introduces two key techniques to enable lightning-fast model loads:
Weight streaming
Model sharding for streaming
Use Fast Model Loader with the SageMaker Python SDK
In this section, we show how to use the SageMaker Python SDK to use this new feature. You can find the example notebook in the following GitHub repo. Complete the following steps:
First, use ModelBuilder to prepare and package the model inference components.
The SchemaBuilder parameter is used to infer the serialization and deserialization methods for the model. For more information on SchemaBuilder, refer to Define serialization and deserialization methods.
You can choose to specify OPTION_TENSOR_PARALLEL_DEGREE as a ModelBuilder environment variable as shown in the following commented lines, or in the next step as part of the ModelBuildersharding_config:
from sagemaker.serve.builder.model_builder import ModelBuilder
from sagemaker.serve.builder.schema_builder import SchemaBuilder
import logging
# Define sample input and output for the model
prompt = "Falcons are"
response = "Falcons are small to medium-sized birds of prey related to hawks and eagles."
# Create the input schema structure
sample_input = {
"inputs": prompt,
"parameters": {"max_new_tokens": 32}
}
# Define the expected output format
sample_output = [{"generated_text": response}]
model_builder = ModelBuilder(
model="meta-textgeneration-llama-3-1-70b",
role_arn=role,
sagemaker_session=sess,
schema_builder=SchemaBuilder(sample_input=sample_input, sample_output=sample_output),
#env_vars={
# "OPTION_TENSOR_PARALLEL_DEGREE": "8",
#},
)
Next, use the optimize() function to prepare the model shards for deployment.
The optimize() function will start a model optimization job and will take a few minutes to finish. The tensor parallel degree should be set to how many GPUs you want each inference component to have access to. You can find the model shards at the output_path S3 location under a folder starting with sagemaker-fast-model-loader-xxx.
model_builder.optimize(
instance_type="ml.p4d.24xlarge",
accept_eula=True,
output_path=output_path,
sharding_config={
"OverrideEnvironment": {
# The value must be equal to the subsequent number of GPUs that will be used for each IC.
"OPTION_TENSOR_PARALLEL_DEGREE": "8"
}
}
)
You can reuse the sharded model that was generated by previous optimization jobs. The following code sample demonstrates how to use model_metadata to overwrite the model path, which needs to point to the Amazon S3 location of the existing model shards:
When the model optimization job is complete, you can use the build() function to generate the artifacts according to the model server:
# use the build() function to generate the artifacts according to the model server
final_model = model_builder.build()
If you’re using existing model shards without running an optimization job, you need to make sure the _is_sharded_model value is set to True and the EnableNetworkIsolation is set to False because Fast Model Loader requires network access:
# You only need to set the values if you are using existing sharded models
if not final_model._is_sharded_model:
final_model._is_sharded_model = True
if final_model._enable_network_isolation:
final_model._enable_network_isolation = False
Use the deploy() function to deploy the model to an endpoint, where you can specify the required resources, such as GPU memory and number of accelerators:
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
resources_required = ResourceRequirements(
requests={
"memory" : 204800,
"num_accelerators": 8
}
)
# deploy the optimized model to an endpoint
final_model.deploy(
instance_type="ml.p4d.24xlarge",
accept_eula=True,
endpoint_logging=False,
resources=resources_required
)
After the endpoint is up and running, you can test the endpoint using the following code example:
from sagemaker.predictor import retrieve_default
endpoint_name = final_model.endpoint_name
predictor = retrieve_default(endpoint_name)
payload = { "inputs": "I believe the meaning of life is",
"parameters": {
"max_new_tokens": 64,
"top_p": 0.9,
"temperature": 0.6
}
}
response = predictor.predict(payload)
print(response)
To clean up, run the following code cell to delete the resources created for the endpoint:
In this section, we show how to use the faster model loading feature through the SageMaker Studio UI. Complete the following steps:
On the SageMaker Studio console, chose JumpStart in the navigation pane.
Choose your model.
On the model details page, choose Optimize.
Accept the EULA and proceed to the optimization configurations.
Select Fast model loading and set the OPTION_TENSOR_PARALLEL_DEGREE to 8, because this example uses an ml.p4d.24xlarge instance that has 8 GPUs. If you’re using an instance with a different number of GPUs, set the value to match the instance.
Set the output path to the Amazon S3 path where the sharded model will be stored.
Choose Create job.
After the inference optimization job starts, you can check the status of the job on the Inference optimization page. Here, each of the jobs have tags associated with them as to what optimization configuration was used.
View the details of the job by choosing the job ID.
Deploy the optimized model by choosing Deploy on the optimized job page.
Verify the endpoint settings and choose Deploy to initiate a SageMaker endpoint deployment.
You will get a notification on the SageMaker Studio UI, and the status will change to In service when the endpoint creation is complete.
You can now send a sample inference request to test the model.
After the test, you can delete the endpoint from the SageMaker Studio console to clean up the resources created in this example.
Conclusion
Fast Model Loader represents a significant advancement in how you can deploy and scale LLMs on SageMaker. In this post, we walked through the step-by-step process of implementing this feature through both the SageMaker Python SDK and SageMaker Studio UI. By using weight streaming and model sharding techniques, you can now achieve dramatically faster model loading times, enabling more responsive scaling for your LLM-based applications.
The integration with SageMaker LMI containers (starting from LMI v13) makes it straightforward to adopt this feature in your existing workflows. Whether you’re dealing with bursty traffic patterns or need to rapidly scale your LLM services, Fast Model Loader provides the tools you need to optimize your model deployment pipeline.
Try out Fast Model Loader for your own use case, and leave your feedback and questions in the comments.
About the Authors
Melanie Li, PhD, is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries.
James Park is a Solutions Architect at Amazon Web Services. He works with Amazon.com to design, build, and deploy technology solutions on AWS, and has a particular interest in AI and machine learning. In h is spare time he enjoys seeking out new cultures, new experiences, and staying up to date with the latest technology trends.You can find him on LinkedIn.
Saurabh Trikande is a Senior Product Manager for Amazon Bedrock and SageMaker Inference. He is passionate about working with customers and partners, motivated by the goal of democratizing AI. He focuses on core challenges related to deploying complex AI applications, inference with multi-tenant models, cost optimizations, and making the deployment of Generative AI models more accessible. In his spare time, Saurabh enjoys hiking, learning about innovative technologies, following TechCrunch, and spending time with his family.
Lokeshwaran Ravi is a Senior Deep Learning Compiler Engineer at AWS, specializing in ML optimization, model acceleration, and AI security. He focuses on enhancing efficiency, reducing costs, and building secure ecosystems to democratize AI technologies, making cutting-edge ML accessible and impactful across industries.
Raghu Ramesha is an ML Solutions Architect with the Amazon SageMaker Service team. He focuses on helping customers build, deploy, and migrate ML production workloads to SageMaker at scale. He specializes in machine learning, AI, and computer vision domains, and holds a master’s degree in Computer Science from UT Dallas. In his free time, he enjoys traveling and photography.
Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging generative AI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of large language models. In his free time, Vivek enjoys hiking, watching movies, and trying different cuisines.
Giuseppe Zappia is a Principal AI/ML Specialist Solutions Architect at AWS, focused on helping large enterprises design and deploy ML solutions on AWS. He has over 20 years of experience as a full stack software engineer, and has spent the past 5 years at AWS focused on the field of machine learning.
Multi-modal large language models (MLLMs) have enabled numerous advances in understanding and reasoning in domains like vision, but we have not yet seen this broad success for time-series. Although prior works on time-series MLLMs have shown promising performance in time-series forecasting, very few works show how an LLM could be used for time-series reasoning in natural language. We propose a novel multi-modal time-series LLM approach that learns generalizable information across various domains with powerful zero-shot performance. First, we train a lightweight time-series encoder on top of an…Apple Machine Learning Research
*Equal Contributors
Data from wearable sensors (e.g., heart rate, step count) can be used to model mood patterns. We characterize feature representations and modeling strategies with multi-modal discrete time series data for mood pattern classification with a large dataset with naturalistic missingness (n=116,819 participants) using 12 wearable data streams, with a focus on capturing periodic trends in data. Considering both performance and robustness, periodicity-based aggregate feature representations with gradient boosting models outperformed other representations and architectures…Apple Machine Learning Research
Motivated by the phenomenon of strategic agents gaming a recommendation system to maximize the number of times they are recommended to users, we study a strategic variant of the linear contextual bandit problem, where the arms strategically misreport privately observed contexts to the learner. % under strategic context manipulation. We treat the algorithm design problem as one of emph{mechanism design} under uncertainty and propose the Optimistic Grim Trigger Mechanism (OptGTM) that minimizes regret while simultaneously incentivizing the agents to be approximately truthful. We show that…Apple Machine Learning Research
Single-cell genomics has significantly advanced our understanding of cellular behavior, catalyzing innovations in treatments and precision medicine. However, single-cell sequencing technologies are inherently destructive and can only measure a limited array of data modalities simultaneously. This limitation underscores the need for new methods capable of realigning cells. Optimal transport (OT) has emerged as a potent solution, but traditional discrete solvers are hampered by scalability, privacy, and out-of-sample estimation issues. These challenges have spurred the development of neural…Apple Machine Learning Research
Given a source and a target probability measure supported on Rdmathbb{R}^dRd, the Monge problem aims for the most efficient way to map one distribution to the other.
This efficiency is quantified by defining a cost function between source and target data.
Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, ℓ22(x,y)=12∥x−y∥22ell^2_2(x,y)=tfrac12|x-y|_2^2ℓ22(x,y)=21∥x−y∥22.
The benefits of using elastic costs, defined through a regularizer τtauτ as c(x,y)=ℓ22(x,y)+τ(x−y)c(x, y)=ell^2_2(x,y)+tau(x-y)c(x,y)=ℓ22(x,y)+τ(x−y), was…Apple Machine Learning Research
Chronos-Bolt is the newest addition to AutoGluon-TimeSeries, delivering accurate zero-shot forecasting up to 250 times faster than the original Chronos models [1].
Time series forecasting plays a vital role in guiding key business decisions across industries such as retail, energy, finance, and healthcare. Traditionally, forecasting has relied on statistical models [2] like ETS and ARIMA, which remain strong baselines, particularly when training data is limited. Over the past decade, advancements in deep learning have spurred a shift toward so-called global models such as DeepAR [3] and PatchTST [4]. These approaches train a single deep learning model across multiple time series in a dataset—for example, sales across a broad e-commerce catalog or observability metrics for thousands of customers.
Foundation models (FMs) such as Chronos [1] have taken the idea of training a single model across multiple time series a significant step further. These models are pretrained on a vast corpus of real and synthetic time series data, covering diverse domains, frequencies, and history lengths. As a result, they enable zero-shot forecasting—delivering accurate predictions on unseen time series datasets. This lowers the entry barrier to forecasting and greatly simplifies forecasting pipelines by providing accurate forecasts without the need for training. Chronos models have been downloaded over 120 million times from Hugging Face and are available for Amazon SageMaker customers through AutoGluon-TimeSeries and Amazon SageMaker JumpStart.
In this post, we introduce Chronos-Bolt, our latest FM for forecasting that has been integrated into AutoGluon-TimeSeries.
Introducing Chronos-Bolt
Chronos-Bolt is based on the T5 encoder-decoder architecture [5] and has been trained on nearly 100 billion time series observations. It chunks the historical time series context into patches of multiple observations, which are then input into the encoder. The decoder then uses these representations to directly generate quantile forecasts across multiple future steps—a method known as direct multi-step forecasting. This differs from the original Chronos models that rely on autoregressive decoding. The chunking of time series and direct multi-step forecasting makes Chronos-Bolt up to 250 times faster and 20 times more memory-efficient than the original Chronos models.
The following plot compares the inference time of Chronos-Bolt against the original Chronos models for forecasting 1024 time series with a context length of 512 observations and a prediction horizon of 64 steps.
Chronos-Bolt models are not only significantly faster, but also more accurate than the original Chronos models. The following plot reports the probabilistic and point forecasting performance of Chronos-Bolt in terms of the Weighted Quantile Loss (WQL) and the Mean Absolute Scaled Error (MASE), respectively, aggregated over 27 datasets (see [1] for dataset details). Remarkably, despite having no prior exposure to these datasets during training, the zero-shot Chronos-Bolt models outperform commonly used statistical models and deep learning models that have been trained on these datasets (highlighted by *). Furthermore, they also perform better than other FMs, denoted by a +, which indicates that these models were pretrained on certain datasets in our benchmark and are not entirely zero-shot. Notably, Chronos-Bolt (Base) also surpasses the original Chronos (Large) model in terms of the forecasting accuracy while being over 600 times faster.
Chronos-Bolt models are now available on Hugging Face in four sizes—Tiny (9M), Mini (21M), Small (48M), and Base (205M)—and can also be used on the CPU.
Solution overview
In this post, we showcase how to use Chronos-Bolt models using the familiar interface of AutoGluon-TimeSeries. AutoGluon-TimeSeries enables SageMaker customers to build and deploy models for time series forecasting, including FMs such as Chronos-Bolt and other global models, and effortlessly ensemble them with statistical models to maximize accuracy.
Perform zero-shot forecasting with Chronos-Bolt
To get started, you need to install AutoGluon v1.2 by running the following command in an Amazon SageMaker Studionotebook or in the terminal:
pip install autogluon.timeseries~=1.2.0
AutoGluon-TimeSeries uses the TimeSeriesDataFrame to work with time series datasets. The TimeSeriesDataFrame expects data in the long dataframe format with at least three columns: an ID column denoting the IDs of individual time series in the dataset, a timestamp column, and a target column that contains the raw time series values. The timestamps must be uniformly spaced, with missing observations denoted by NaN and Chronos-Bolt will handle them appropriately. The following snippet loads the Australian Electricity dataset [6] that contains electricity demand data at 30-minute intervals for five Australian states into a TimeSeriesDataFrame:
We have specified that the TimeSeriesPredictor should produce forecasts for the next 48 steps, or 1 day in this case. AutoGluon-TimeSeries offers various presets that can be used when fitting the predictor. The bolt_base preset, used in this example, employs the Base (205M) variant of Chronos-Bolt for zero-shot inference. Because no model fitting is required for zero-shot inference, the call to fit() returns almost instantaneously. The predictor is now ready to generate zero-shot forecasts, which can be done through the predict method:
predictions = predictor.predict(train_data)
AutoGluon-TimeSeries generates both point and probabilistic (quantile) forecasts for the target value. The probabilistic forecast captures the uncertainty of the target value, which is essential for many planning tasks.
We can also visualize the predictions and compare them against the ground truth target value over the forecast horizon:
Chronos-Bolt generates an accurate zero-shot forecast, as shown in the following plot illustrating point forecasts and the 80% prediction intervals.
Fine-tune Chronos-Bolt with AutoGluon
So far, we have used Chronos-Bolt in inference-only mode for zero-shot forecasting. However, AutoGluon-TimeSeries also allows you to fine-tune Chronos-Bolt on your specific datasets. We recommend using a GPU instance such as g5.2xlarge for fine-tuning. The following snippet specifies two settings for the Chronos-Bolt (Small, 48M) model: zero-shot and fine-tuned. AutoGluon-TimeSeries will perform a lightweight fine-tuning of the pretrained model on the provided training data. We add name suffixes to identify the zero-shot and fine-tuned versions of the model.
The predictor will be fitted for at most 10 minutes, as specified by the time_limit. After fitting, we can evaluate the two model variants on the test data and generate a leaderboard:
predictor.leaderboard(test_data)
Fine-tuning resulted in a significantly improved forecast accuracy, as shown by the test MASE scores. All AutoGluon-TimeSeries models report scores in a “higher is better” format, meaning that most forecasting error metrics like MASE are multiplied by -1 when reported.
Augment Chronos-Bolt with exogenous information
Chronos-Bolt is a univariate model, meaning it relies solely on the historical data of the target time series for making predictions. However, in real-world scenarios, additional exogenous information related to the target series (such as holidays or promotions) is often available. Using this information when making predictions can improve forecast accuracy. AutoGluon-TimeSeries now features covariate regressors, which can be combined with univariate models like Chronos-Bolt to incorporate exogenous information. A covariate regressor in AutoGluon-TimeSeries is a tabular regression model that is fit on the known covariates and static features to predict the target column at each time step. The predictions of the covariate regressor are subtracted from the target column, and the univariate model then forecasts the residuals.
We use a grocery sales dataset to demonstrate how Chronos-Bolt can be combined with a covariate regressor. This dataset includes three known covariates: scaled_price, promotion_email, and promotion_homepage, and the task is to forecast the unit_sales:
The following code fits a TimeSeriesPredictor to forecast unit_sales for the next 7 weeks. We have specified the target column we are interested in forecasting and the names of known covariates while constructing the TimeSeriesPredictor. Two configurations are defined for Chronos-Bolt: a zero-shot setting, which uses only the historical context of unit_sales without considering the known covariates, and a covariate regressor setting, which employs a CatBoost model as the covariate_regressor. We also use the target_scaler, which makes sure the time series have a comparable scale before training, which typically results in better accuracy.
After the predictor has been fit, we can evaluate it on the test dataset and generate the leaderboard. Using the covariate regressor with Chronos-Bolt improves over its univariate zero-shot performance considerably.
The covariates might not always be useful—for some datasets, the zero-shot model might achieve better accuracy. Therefore, it’s important to try multiple models and select the one that achieves the best accuracy on held-out data.
Conclusion
Chronos-Bolt models empower practitioners to generate high-quality forecasts rapidly in a zero-shot manner. AutoGluon-TimeSeries enhances this capability by enabling users to fine-tune Chronos-Bolt models effortlessly, integrate them with covariate regressors, and ensemble them with a diverse range of forecasting models. For advanced users, it provides a comprehensive set of features to customize forecasting models beyond what was demonstrated in this post. AutoGluon predictors can be seamlessly deployed to SageMaker using AutoGluon-Cloud and the official Deep Learning Containers.
To learn more about using AutoGluon-TimeSeries to build accurate and robust forecasting models, explore our tutorials. Stay updated by following AutoGluon on X (formerly Twitter) and starring us on GitHub!
References
[1] Ansari, Abdul Fatir, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, et al. “Chronos: Learning the language of time series.” Transactions on Machine Learning Research (2024). [2] Hyndman, R. J., and G. Athanasopoulos. “Forecasting: principles and practice 3rd Ed.” O Texts (2018). [3] Salinas, David, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. “DeepAR: Probabilistic forecasting with autoregressive recurrent networks.” International Journal of Forecasting 36, no. 3 (2020): 1181-1191. [4] Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: long-term forecasting with transformers.” In The Eleventh International Conference on Learning Representations (2023). [5] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research 21, no. 140 (2020): 1-67. [6] Godahewa, Rakshitha, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. “Monash time series forecasting archive.” In NeurIPS Track on Datasets and Benchmarks (2021).
About the Authors
Abdul Fatir Ansari is a Senior Applied Scientist at Amazon Web Services, specializing in machine learning and forecasting, with a focus on foundation models for structured data, such as time series. He received his PhD from the National University of Singapore, where his research centered on deep generative models for images and time series.
Caner Turkmen is a Senior Applied Scientist at Amazon Web Services, where he works on research problems at the intersection of machine learning and forecasting. Before joining AWS, he worked in the management consulting industry as a data scientist, serving the financial services and telecommunications sectors. He holds a PhD in Computer Engineering from Bogazici University in Istanbul.
Oleksandr Shchur is a Senior Applied Scientist at Amazon Web Services, where he works on time series forecasting in AutoGluon. Before joining AWS, he completed a PhD in Machine Learning at the Technical University of Munich, Germany, doing research on probabilistic models for event data. His research interests include machine learning for temporal data and generative modeling.
Lorenzo Stella is a Senior Applied Scientist at Amazon Web Services, working on machine learning, forecasting, and generative AI for analytics and decision-making. He holds a PhD in Computer Science and Electrical Engineering from IMTLucca (Italy) and KU Leuven (Belgium), where his research focused on numerical optimization algorithms for machine learning and optimal control applications.
Today, the Accounts Payable (AP) and Accounts Receivable (AR) analysts in Amazon Finance operations receive queries from customers through email, cases, internal tools, or phone. When a query arises, analysts must engage in a time-consuming process of reaching out to subject matter experts (SMEs) and go through multiple policy documents containing standard operating procedures (SOPs) relevant to the query. This back-and-forth communication process often takes from hours to days, primarily because analysts, especially the new hires, don’t have immediate access to the necessary information. They spend hours consulting SMEs and reviewing extensive policy documents.
To address this challenge, Amazon Finance Automation developed a large language model (LLM)-based question-answer chat assistant on Amazon Bedrock. This solution empowers analysts to rapidly retrieve answers to customer queries, generating prompt responses within the same communication thread. As a result, it drastically reduces the time required to address customer queries.
In this post, we share how Amazon Finance Automation built this generative AI Q&A chat assistant using Amazon Bedrock.
Solution overview
The solution is based on a Retrieval Augmented Generation (RAG) pipeline running on Amazon Bedrock, as shown in the following diagram. When a user submits a query, RAG works by first retrieving relevant documents from a knowledge base, then generating a response with the LLM from the retrieved documents.
The solution consists of the following key components:
Knowledge base – We used Amazon OpenSearch Service as the vector store for embedding documents. For performance evaluation, we processed and indexed multiple Amazon finance policy documents into the knowledge base. Alternatively, Amazon Bedrock Knowledge Bases provides fully managed support for end-to-end RAG workflows. We’re planning to migrate to Amazon Bedrock Knowledge Bases to eliminate cluster management and add extensibility to our pipeline.
Embedding model – At the time of writing, we’re using the Amazon Titan Multimodal Embeddings G1 model on Amazon Bedrock. The model is pre-trained on large and unique datasets and corpora from Amazon and provides accuracy that is higher than or comparable to other embedding models on the market based on our comparative analysis.
Generator model – We used a foundation model (FM) provided by Amazon Bedrock for its balanced ability to deliver highly accurate answers quickly.
Diversity ranker – It’s responsible for rearranging the results obtained from vector index to avoid skewness or bias towards any specific document or section.
Lost in the middle ranker – It’s responsible for efficiently distributing the most relevant results towards the top and bottom of the prompt, maximizing the impact of the prompt’s content.
Guardrails – We used Amazon Bedrock Guardrails to detect personal identifiable information (PII) and safeguard against prompt injection attacks.
Validation engine – Removes PII from the response and checks whether the generated answer aligns with the retrieved context. If not, it returns a hardcoded “I don’t know” response to prevent hallucinations.
Chat assistant UI – We developed the UI using Streamlit, an open source Python library for web-based application development on machine learning (ML) use cases.
Evaluate RAG performance
The accuracy of the chat assistant is the most critical performance metric to Amazon Finance Operations. After we built the first version of the chat assistant, we measured the bot response accuracy by submitting questions to the chat assistant. The SMEs manually evaluated the RAG responses one by one, and found only 49% of the responses were correct. This was far below the expectation, and the solution needed improvement.
However, manually evaluating the RAG isn’t sustainable—it requires hours of effort from finance operations and engineering teams. Therefore, we adopted the following automated performance evaluation approach:
Prepare testing data – We constructed a test dataset with three data fields:
question – This consists of 100 questions from policy documents where answers reside in a variety of sources, such as policy documents and engineering SOPs, covering complex text formats such as embedded tables and images.
expected_answer – These are manually labeled answers by Amazon Finance Operations SMEs.
generated_answer – This is the answer generated by the bot.
NLP scores – We used a test dataset to calculate the ROUGE score and METEOR score. Because these scores merely use word-matching algorithms and ignore the semantic meaning of the text, they aren’t aligned with the SME scores. Based on our analysis, the variance was approximately 30% compared to human evaluations.
LLM-based score – We used an FM offered by Amazon Bedrock to score the RAG performance. We designed specialized LLM prompts to evaluate the RAG performance by comparing the generated answer with the expected answer. We generated a set of LLM-based metrics, including accuracy, acceptability, and factualness, and the citation representing the evaluation reasoning. The variance of this approach was approximately 5% compared to human analysis, so we decided to stick to this approach of evaluation. If your RAG system is built on Amazon Bedrock Knowledge Bases, you can use the new RAG evaluation for Amazon Bedrock Knowledge Bases tool to evaluate the retrieve or the retrieve and generate functionality with an LLM as a judge. It provides retrieval evaluation metrics such as context relevance and context coverage. It also provides retrieve and generate evaluation metrics such as correctness, completeness, and helpfulness, as well as responsible AI metrics such as harmfulness and answer refusal.
Improve the accuracy of RAG pipeline
Based on the aforementioned evaluation techniques, we focused on the following areas in the RAG pipeline to improve the overall accuracy.
Add document semantic chunking to improve accuracy from 49% to 64%
Upon diagnosing incorrect responses in the RAG pipeline, we identified 14% of the inaccuracy was due to incomplete contexts sent to the LLM. These incomplete contexts were originally generated by the segmentation algorithm based on a fixed chunk size (for example, 512 tokens or 384 words), which doesn’t consider document boundaries such as sections and paragraphs.
To address this problem, we designed a new document segmentation approach using QUILL Editor, Amazon Titan Text Embeddings, and OpenSearch Service, using the following steps:
Convert the unstructured text to a structured HTML document using QUILL Editor. In this way, the HTML document preserves the document formatting that divides the contents into logical chunks.
Identify the logical structure of the HTML document and insert divider strings based on HTML tags for document segmentation.
Use an embedding model to generate semantic vector representation of document chunks.
Assign tags based on important keywords in the section to identify the logical boundaries between sections.
Insert the embedding vectors of the segmented documents to the OpenSearch Service vector store.
The following diagram illustrates the document retriever splitting workflow.
When processing the document, we follow specific rules:
Extract the start and end of a section of a document precisely
Extract the titles of the section and pair them with section content accurately
Assign tags based on important keywords from the sections
Persist the markdown information from the policy while indexing
Exclude images and tables from the processing in the initial release
With this approach, we can improve RAG accuracy from 49% to 64%.
Use prompt engineering to improve accuracy from 64% to 76%
Prompt engineering is a crucial technique to improve the performance of LLMs. We learned from our project that there is no one-size-fits-all prompt engineering approach; it’s a best practice to design task-specific prompts. We adopted the following approach to enhance the effectiveness of the prompt-to-RAG generator:
In approximately 14% of cases, we identified that the LLM generated responses even when no relevant context was retrieved from the RAG, leading to hallucinations. In this case, we engineered prompts and asked the LLM not to generate any response when there is no relevant context provided.
In approximately 13% of cases, we received user feedback that the response from the LLM was too brief, lacking complete context. We engineered prompts that encouraged the LLM to be more comprehensive.
We engineered prompts to enable the capability to generate both concise and detailed answers for the users.
We used LLM prompts for generation of citations to properly attribute our source used to generate the answer. In the UI, the citations are listed with hyperlinks following the LLM response, and users can use these citations to validate the LLM performance.
We improved our prompts to introduce better chain-of-thought (CoT) reasoning:
The LLM’s unique characteristic of using internally generated reasoning contributes to improved performance and aligns responses with humanlike coherence. Because of this interplay between prompt quality, reasoning requests, and the model’s inherent capabilities, we could optimize performance.
Encouraging CoT reasoning prompts the LLM to consider the context of the conversation, making it less prone to hallucinations.
By building upon the established context, the model is more likely to generate responses that logically follow the conversation’s narrative, reducing the chances of providing inaccurate or hallucinated answers.
We added examples of previously answered questions to establish a pattern for the LLM, encouraging CoT.
We then used meta-prompting using an FM offered by Amazon Bedrock to craft a prompt that caters to the aforementioned requirements.
The following example is a prompt for generating a quick summary and a detailed answer:
You are an AI assistant that helps answer questions based on provided text context. I will give you some passages from a document, followed by a question. Your task is to provide the best possible answer to the question using only the information from the given context. Here is the context:
<context>
{}
</context>
And here is the question:
<question>
{}
</question>
Think carefully about how the context can be used to answer the question.
<thinkingprocess>
- Carefully read the provided context and analyze what information it contains
- Identify the key pieces of information in the context that are relevant to answering the question
- Determine if the context provides enough information to answer the question satisfactorily
- If not, simply state "I don't know, I don't have the complete context needed to answer this
question"
- If so, synthesize the relevant information into a concise summary answer
- Expand the summary into a more detailed answer, utilizing Markdown formatting to make it clear and
readable
</thinkingprocess>
If you don't have enough context to answer the question, provide your response in the following
format:
I don't know, I don't have the complete context needed to answer this question.
If you do have enough context to answer the question, provide your response in the following format:
#### Quick Summary:
Your concise 1-2 sentence summary goes here.
#### Detailed Answer:
Your expanded answer goes here, using Markdown formatting like **bold**, *italics*, and Bullet points to improve readability.
Remember, the ultimate goal is to provide an informative, clear and readable answer to the question
using only the context provided. Let's begin!
The following example is a prompt for generating citations based on the generated answers and retrieved contexts:
You are an AI assistant that specializes in attributing generated answers to specific sections within provided documents. Your task is to determine which sections from the given documents were most likely used to generate the provided answer. If you cannot find exact matches, suggest sections that are closely related to the content of the answer.
Here is the generated answer to analyze:
<generated_answer>
{}
</generated_answer>
And here are the sections from various documents to consider:
<sections>
{}
</sections>
Please carefully read through the generated answer and the provided sections. In the scratchpad space below, brainstorm and reason about which sections are most relevant to the answer:
<scratchpad>
</scratchpad>
After identifying the relevant sections, provide your output in the following format:
**Document Name:** <document name> n
**Document Link:** <document link> n
**Relevant Sections:** n
- <section name 1>
- <section name 2>
- <section name 3>
Do not include any additional explanations or reasoning in your final output. Simply list the document name, link, and relevant section names in the specified format above.
Assistant:
By implementing the prompt engineering approaches, we improved RAG accuracy from 64% to 76%.
Use an Amazon Titan Text Embeddings model to improve accuracy from 76% to 86%
After implementing the document segmentation approach, we still saw lower relevance scores for retrieved contexts (55–65%), and the incorrect contexts were in the top ranks for more than 50% of cases. This indicated that there was still room for improvement.
We experimented with multiple embedding models, including first-party and third-party models. For example, the contextual embedding models such as bge-base-en-v1.5 performed better for context retrieval, comparing to other top embedding models such as all-mpnet-base-v2. We found that using the Amazon Titan Embeddings G1 model increased the possibility of retrieved contexts from approximately 55–65% to 75–80%, and 80% of the retrieved contexts have higher ranks than before.
We achieved remarkable progress in developing a generative AI Q&A chat assistant for Amazon Finance Automation by using a RAG pipeline and LLMs on Amazon Bedrock. Through continual evaluation and iterative improvement, we have addressed challenges of hallucinations, document ingestion issues, and context retrieval inaccuracies. Our results have shown a significant improvement in RAG accuracy from 49% to 86%.
You can follow our journey and adopt a similar solution to address challenges in your RAG application and improve overall performance.
About the Authors
Soheb Moin is a Software Development Engineer at Amazon, who led the development of the Generative AI chatbot. He specializes in leveraging generative AI and Big Data analytics to design, develop, and implement secure, scalable, innovative solutions that empowers Finance Operations with better productivity, automation. Outside of work, Soheb enjoys traveling, playing badminton, and engaging in chess tournaments.
Nitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 19 years of experience building business critical, scalable, high-performance software. Nitin leads data services, communication, work management and several Generative AI initiatives within Finance. In his spare time, he enjoys listening to music and read.
Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.
Kumar Satyen Gaurav is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.
Mohak Chugh is a Software Development Engineer at Amazon, with over 3 years of experience in developing products leveraging Generative AI and Big Data on AWS. His work encompasses a range of areas, including RAG based GenAI chatbots and high performance data reconciliation. Beyond work, he finds joy in playing the piano and performing with his music band.
Parth Bavishi is a Senior Product Manager at Amazon with over 10 years of experience in building impactful products. He currently leads the development of generative AI capabilities for Amazon’s Finance Automation, driving innovation and efficiency within the organization. A dedicated mentor, Parth enjoys sharing his product management knowledge and finds satisfaction in activities like volleyball and reading.
Carnegie Mellon University is proud to present 194 papers at the 38th conference on Neural Information Processing Systems (NeurIPS 2024), held from December 10-15 at the Vancouver Convention Center. Here is a quick overview of the areas our researchers are working on:
Here are some of our top collaborator institutions:
Authors: Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica
This paper explores an alternative approach to generating high-fidelity, customized images at reduced costs using fine-tuned adapters instead of simply scaling base models with additional data or parameters. Over time, the open-source community has created a large collection of more than 100,000 adapters—small modules that fine-tune base models for specific tasks. However, many of these adapters are highly customized and lack clear descriptions, making them challenging to use effectively. To address this, the paper introduces Stylus, a system designed to match prompts with relevant adapters and automatically compose them for better image generation. Building on recent research showing the benefits of combining multiple adapters, Stylus uses a three-stage process: summarizing adapters with improved descriptions and embeddings, retrieving relevant adapters, and composing adapters based on prompt keywords to ensure a strong match. The authors also present StylusDocs, a curated dataset of 75,000 adapters with pre-computed embeddings, for evaluation. Testing Stylus on popular Stable Diffusion checkpoints shows that it achieves better CLIP/FID Pareto efficiency and is twice as preferred by human and multimodal evaluators compared to the base model.
This work examines the problem of Federated Q-learning, where multiple agents collaboratively learn the optimal Q-function for an unknown infinite-horizon Markov Decision Process with finite state and action spaces. The focus is on understanding the trade-off between sample complexity (the number of data samples needed for learning) and communication complexity (the amount of data exchanged between agents) for intermittent communication algorithms, a commonly used approach in federated settings.
The authors first establish a fundamental limitation: any Federated Q-learning algorithm that achieves linear speedup in sample complexity relative to the number of agents must incur a communication cost of at least Ω(1/1−γ), where γ is the discount factor. They then introduce a new algorithm, Fed-DVR-Q, which is the first to achieve both optimal sample complexity and communication complexity simultaneously. Together, these results provide a comprehensive understanding of the trade-offs between sample and communication efficiency in Federated Q-learning.
Authors: Adam Stooke, Rohit Prabhavalkar, Khe Sim, Pedro Moreno Mengibar
The paper introduces a new transformer-based approach to automatic speech recognition (ASR) that simplifies the alignment process between audio input and text output. Unlike traditional models, the encoder itself aligns audio information internally, reducing the complexity of decoding. The proposed “Aligner-Encoder” model combines efficient training techniques and a lightweight decoder, resulting in significantly faster performance while maintaining competitive accuracy. Notably, the alignment process is evident in the self-attention weights of the model, showcasing its ability to handle the task efficiently.
This work focuses on streaming algorithms for approximating the top eigenvector of a matrix when its rows are presented in a random order. The authors introduce a new algorithm that works efficiently when there is a sufficient gap between the largest and second-largest eigenvalues of the matrix. Their approach uses a small amount of memory, depending on the number of “heavy rows” (rows with large norms), and produces highly accurate results. They also show that using this heavy-row-based parameterization is necessary for achieving high accuracy and improve on prior methods by reducing the gap requirement for random-order streams, though their method assumes the rows are presented in a random order rather than any order.
Recent advancements in unsupervised visual representation learning have highlighted the Joint-Embedding Predictive Architecture (JEPA) as an effective method for extracting visual features from unlabeled images using masking strategies. However, JEPA faces two key challenges: its reliance on Exponential Moving Average (EMA) fails to prevent model collapse, and its predictions struggle to accurately capture the average representation of image patches. To address these issues, this work introduces C-JEPA, a new framework that combines JEPA with a variance-invariance-covariance regularization strategy called VICReg. This approach improves stability, prevents collapse, and ensures better learning of consistent representations. Experiments show that C-JEPA achieves faster convergence and higher performance on standard benchmarks when pre-trained on ImageNet-1K.
Authors: Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang
This work addresses the challenge of enabling humanoid robots to collaborate on tasks like moving large furniture, which require coordination between multiple robots. Existing methods struggle due to a lack of motion capture data for multi-humanoid collaboration and the inefficiency of training multiple agents together. To overcome this, the authors introduce Cooperative Human-Object Interaction (CooHOI), a framework that uses a two-phase learning approach: first, individual humanoids learn object interaction skills from human motion data, and then they learn to work together using multi-agent reinforcement learning. By focusing on shared object dynamics and decentralized execution, the robots achieve coordination through implicit communication. Unlike previous tracking-based methods, CooHOI is efficient, does not rely on multi-humanoid motion data, and can easily scale to more participants and diverse object types.
Authors: Weikang Wan, Ziyu Wang, Yufei Wang, Zackory Erickson, David Held
This paper presents DiffTORI, a framework that uses differentiable trajectory optimization as a policy representation for reinforcement and imitation learning. Trajectory optimization, a common tool in control, is parameterized by a cost and a dynamics function, and recent advances now allow gradients of the loss to be computed with respect to these parameters. This enables DiffTORI to learn cost and dynamics functions end-to-end, addressing the “objective mismatch” in previous model-based RL methods by aligning the dynamics model with task performance. Benchmarking on robotic manipulation tasks with high-dimensional sensory inputs, DiffTORI demonstrates superior performance over prior methods, including feedforward policies, energy-based models, and diffusion models, across a wide range of reinforcement and imitation learning tasks.
Video transformers are notoriously slow to train due to the large number of input tokens, many of which are repeated across frames. Existing methods to remove redundant tokens often introduce significant overhead or require dataset-specific tuning, limiting their practicality. This work introduces Run-Length Tokenization (RLT), a simple and efficient method inspired by run-length encoding, which identifies and removes repeated patches in video frames before inference. By replacing repeated patches with a single token and a positional encoding to reflect its duration, RLT reduces redundancy without requiring tuning or adding significant computational cost. It accelerates training by 30%, maintains baseline performance, and increases throughput by 35% with minimal accuracy loss, while reducing token counts by up to 80% on longer videos.
Authors: Gabriel Sarch, Lawrence Jang, Michael Tarr, William Cohen, Kenneth Marino, Katerina Fragkiadaki
This work introduces In-Context Abstraction Learning (ICAL), a method that enables large-scale language and vision-language models (LLMs and VLMs) to generate high-quality task examples from imperfect demonstrations. ICAL uses a vision-language model to analyze and improve inefficient task trajectories by abstracting key elements like causal relationships, object states, and temporal goals, with iterative refinement through human feedback. These improved examples, when used as prompts, enhance decision-making and reduce reliance on human input over time, making the system more efficient. ICAL outperforms state-of-the-art models in tasks like instruction following, web navigation, and action forecasting, demonstrating its ability to improve performance without heavy manual prompt engineering.
This work focuses on improving the reliability of driving perception systems under challenging and unexpected conditions, particularly with multi-LiDAR setups. Most existing datasets rely on single-LiDAR systems and are collected in ideal conditions, making them insufficient for real-world applications. To address this, the authors introduce Place3D, a comprehensive pipeline that optimizes LiDAR placement, generates data, and evaluates performance. Their approach includes three key contributions: a new metric called the Surrogate Metric of the Semantic Occupancy Grids (M-SOG) for assessing multi-LiDAR configurations, an optimization strategy to improve LiDAR placements based on M-SOG, and the creation of a 280,000-frame dataset capturing both clean and adverse conditions. Experiments show that their optimized placements lead to significant improvements in tasks like semantic segmentation and 3D object detection, even in challenging scenarios with harsh weather or sensor failures.
The paper explores how Large Language Models (LLMs), known for their impressive capabilities but high computational costs, can be made more efficient. It highlights that while activation sparsity—where only some model parameters are used during inference—naturally occurs, current methods fail to maximize its potential during training. The authors propose a novel training algorithm, Learn-To-be-Efficient (LTE), that encourages LLMs to activate fewer neurons, striking a balance between efficiency and performance. Their approach, applicable to models beyond traditional ReLU-based ones, demonstrates improved results across various tasks and reduces inference latency by 25% for LLaMA2-7B at 50% sparsity.
This work explores whether it is possible to understand or replicate a policymaker’s reasoning by analyzing their past decisions. The problem is framed as learning social welfare functions from the family of power mean functions. Two learning tasks are considered: one uses utility vectors of actions and their corresponding social welfare values, while the other uses pairwise comparisons of welfares for different utility vectors. The authors demonstrate that power mean functions can be learned efficiently, even when the social welfare data is noisy. They also propose practical algorithms for these tasks and evaluate their effectiveness.
Authors: Timothy Chu, Josh Alman, Gary L. Miller, Shyam Narayanan, Mark Sellke, Zhao Song
The authors introduce a linear-algebraic tool based on group representation theory to solve three important problems in machine learning. First, they investigate fast attention algorithms for large language models and prove that only low-degree polynomials can produce the low-rank matrices required for subquadratic attention, thereby showing that polynomial-based approximations are essential. Second, they extend the classification of positive definite kernels from Euclidean distances to Manhattan distances, offering a broader foundation for kernel methods. Finally, they classify all functions that transform Manhattan distances into Manhattan distances, generalizing earlier work on Euclidean metrics and introducing new results about stable-rank-preserving functions with potential applications in algorithm design.
This work examines the problem of learning mixtures of Gaussians while ensuring approximate differential privacy. The authors demonstrate that it is possible to learn a mixture of k arbitrary d-dimensional Gaussians with significantly fewer samples than previous methods, achieving optimal performance when the dimensionality d is much larger than the number of components k. For univariate Gaussians, they establish the first optimal bound, showing that the sample complexity scales linearly with k, improving upon earlier methods that required a quadratic dependence on k. Their approach leverages advanced techniques, including the inverse sensitivity mechanism, sample compression for distributions, and volume bounding methods, to achieve these results.
Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yu-hsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen
As the use of large language models (LLMs) increases, serving them quickly and efficiently has become a critical challenge. Speculative decoding offers a promising solution, but existing methods struggle to scale with larger workloads or adapt to different settings. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. By employing a dynamic programming algorithm, Sequoia optimizes the tree structure for speculated tokens, improving scalability. It also introduces a novel sampling and verification method that enhances robustness across various decoding temperatures. Sequoia achieves significant speedups, improving decoding speed on models like Llama2-7B, Llama2-13B, and Vicuna-33B by up to 4.04x, 3.73x, and 2.27x, respectively, and reducing per-token latency for Llama3-70B-Instruct on a single GPU by 9.5x compared to DeepSpeed-Zero-Inference.
Diffusion models have demonstrated impressive capabilities in generating high-quality images, audio, and videos, largely due to pre-training on large datasets that pair data with conditions, such as image-text or image-class pairs. However, even with careful filtering, these datasets often include corrupted pairs where the conditions do not accurately represent the data. This paper provides the first comprehensive study of how such corruption affects diffusion model training. By synthetically corrupting datasets like ImageNet-1K and CC3M, the authors show that slight corruption in pre-training data can surprisingly enhance image quality, diversity, and fidelity across various models. They also provide theoretical insights, demonstrating that slight condition corruption increases entropy and reduces the 2-Wasserstein distance to the ground truth distribution. Building on these findings, the authors propose a method called condition embedding perturbations, which improves diffusion model performance during both pre-training and downstream tasks, offering new insights into the training process.
Authors: Sanae Lotfi, Yilun Kuang, Marc Finzi, Brandon Amos, Micah Goldblum, Andrew Wilson
Large language models (LLMs) with billions of parameters are highly effective at predicting the next token in a sequence. While recent research has computed generalization bounds for these models using compression-based techniques, these bounds often fail to apply to billion-parameter models or rely on restrictive methods that produce low-quality text. Existing approaches also tie the tightness of bounds to the number of independent documents in the training set, ignoring the larger number of dependent tokens, which could offer better bounds. This work uses properties of martingales to derive generalization bounds that leverage the vast number of tokens in LLM training sets. By using more flexible compression techniques like Monarch matrices, Kronecker factorizations, and post-training quantization, the authors achieve meaningful generalization bounds for large-scale models, including LLaMA2-70B, marking the first successful bounds for practical, high-quality text-generating models.
Authors: Gwanghyun Kim, Alonso Martinez, Yu-chuan Su, Brendan Jou, Jose Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Se Young Chun, Krishna Somandepalli
Authors: Sean Mcleish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein
Authors: Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Charlie Chen, Micah Goldblum, C. Bayan Bruss, Christopher De Sa, Andrew Wilson
Miscellaneous Aspects Of Machine Learning (Supervised Learning)