Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Fast and accurate zero-shot forecasting with Chronos-Bolt and AutoGluon

Chronos-Bolt is the newest addition to AutoGluon-TimeSeries, delivering accurate zero-shot forecasting up to 250 times faster than the original Chronos models [1].

Time series forecasting plays a vital role in guiding key business decisions across industries such as retail, energy, finance, and healthcare. Traditionally, forecasting has relied on statistical models [2] like ETS and ARIMA, which remain strong baselines, particularly when training data is limited. Over the past decade, advancements in deep learning have spurred a shift toward so-called global models such as DeepAR [3] and PatchTST [4]. These approaches train a single deep learning model across multiple time series in a dataset—for example, sales across a broad e-commerce catalog or observability metrics for thousands of customers.

Foundation models (FMs) such as Chronos [1] have taken the idea of training a single model across multiple time series a significant step further. These models are pretrained on a vast corpus of real and synthetic time series data, covering diverse domains, frequencies, and history lengths. As a result, they enable zero-shot forecasting—delivering accurate predictions on unseen time series datasets. This lowers the entry barrier to forecasting and greatly simplifies forecasting pipelines by providing accurate forecasts without the need for training. Chronos models have been downloaded over 120 million times from Hugging Face and are available for Amazon SageMaker customers through AutoGluon-TimeSeries and Amazon SageMaker JumpStart.

In this post, we introduce Chronos-Bolt, our latest FM for forecasting that has been integrated into AutoGluon-TimeSeries.

Introducing Chronos-Bolt

Chronos-Bolt is based on the T5 encoder-decoder architecture [5] and has been trained on nearly 100 billion time series observations. It chunks the historical time series context into patches of multiple observations, which are then input into the encoder. The decoder then uses these representations to directly generate quantile forecasts across multiple future steps—a method known as direct multi-step forecasting. This differs from the original Chronos models that rely on autoregressive decoding. The chunking of time series and direct multi-step forecasting makes Chronos-Bolt up to 250 times faster and 20 times more memory-efficient than the original Chronos models.

The following plot compares the inference time of Chronos-Bolt against the original Chronos models for forecasting 1024 time series with a context length of 512 observations and a prediction horizon of 64 steps.

Inference speed comparison between Chronos and Chronos-Bolt

Chronos-Bolt models are not only significantly faster, but also more accurate than the original Chronos models. The following plot reports the probabilistic and point forecasting performance of Chronos-Bolt in terms of the Weighted Quantile Loss (WQL) and the Mean Absolute Scaled Error (MASE), respectively, aggregated over 27 datasets (see [1] for dataset details). Remarkably, despite having no prior exposure to these datasets during training, the zero-shot Chronos-Bolt models outperform commonly used statistical models and deep learning models that have been trained on these datasets (highlighted by *). Furthermore, they also perform better than other FMs, denoted by a +, which indicates that these models were pretrained on certain datasets in our benchmark and are not entirely zero-shot. Notably, Chronos-Bolt (Base) also surpasses the original Chronos (Large) model in terms of the forecasting accuracy while being over 600 times faster.

Zero-shot benchmark for Chronos-Bolt

Chronos-Bolt models are now available on Hugging Face in four sizes—Tiny (9M), Mini (21M), Small (48M), and Base (205M)—and can also be used on the CPU.

Solution overview

In this post, we showcase how to use Chronos-Bolt models using the familiar interface of AutoGluon-TimeSeries. AutoGluon-TimeSeries enables SageMaker customers to build and deploy models for time series forecasting, including FMs such as Chronos-Bolt and other global models, and effortlessly ensemble them with statistical models to maximize accuracy.

Perform zero-shot forecasting with Chronos-Bolt

To get started, you need to install AutoGluon v1.2 by running the following command in an Amazon SageMaker Studio notebook or in the terminal:

pip install autogluon.timeseries~=1.2.0

AutoGluon-TimeSeries uses the TimeSeriesDataFrame to work with time series datasets. The TimeSeriesDataFrame expects data in the long dataframe format with at least three columns: an ID column denoting the IDs of individual time series in the dataset, a timestamp column, and a target column that contains the raw time series values. The timestamps must be uniformly spaced, with missing observations denoted by NaN and Chronos-Bolt will handle them appropriately. The following snippet loads the Australian Electricity dataset [6] that contains electricity demand data at 30-minute intervals for five Australian states into a TimeSeriesDataFrame:

from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

train_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/train.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

The next step involves fitting a TimeSeriesPredictor on this data:

predictor = TimeSeriesPredictor(prediction_length=48).fit(train_data, presets="bolt_base")

We have specified that the TimeSeriesPredictor should produce forecasts for the next 48 steps, or 1 day in this case. AutoGluon-TimeSeries offers various presets that can be used when fitting the predictor. The bolt_base preset, used in this example, employs the Base (205M) variant of Chronos-Bolt for zero-shot inference. Because no model fitting is required for zero-shot inference, the call to fit() returns almost instantaneously. The predictor is now ready to generate zero-shot forecasts, which can be done through the predict method:

predictions = predictor.predict(train_data)

AutoGluon-TimeSeries generates both point and probabilistic (quantile) forecasts for the target value. The probabilistic forecast captures the uncertainty of the target value, which is essential for many planning tasks.

We can also visualize the predictions and compare them against the ground truth target value over the forecast horizon:

test_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/australian_electricity_subset/test.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

predictor.plot(test_data, predictions, max_history_length=200, item_ids=["T000002"])

Chronos-Bolt generates an accurate zero-shot forecast, as shown in the following plot illustrating point forecasts and the 80% prediction intervals.

Forecasts Qualitative

Fine-tune Chronos-Bolt with AutoGluon

So far, we have used Chronos-Bolt in inference-only mode for zero-shot forecasting. However, AutoGluon-TimeSeries also allows you to fine-tune Chronos-Bolt on your specific datasets. We recommend using a GPU instance such as g5.2xlarge for fine-tuning. The following snippet specifies two settings for the Chronos-Bolt (Small, 48M) model: zero-shot and fine-tuned. AutoGluon-TimeSeries will perform a lightweight fine-tuning of the pretrained model on the provided training data. We add name suffixes to identify the zero-shot and fine-tuned versions of the model.

predictor = TimeSeriesPredictor(prediction_length=48, eval_metric="MASE").fit(
    train_data,
    hyperparameters={
        "Chronos": [
            {"model_path": "bolt_small", "ag_args": {"name_suffix": "ZeroShot"}},
            {"model_path": "bolt_small", "fine_tune": True, "ag_args": {"name_suffix": "FineTuned"}},
        ]
    },
    enable_ensemble=False,
    time_limit=600,
)

The predictor will be fitted for at most 10 minutes, as specified by the time_limit. After fitting, we can evaluate the two model variants on the test data and generate a leaderboard:

predictor.leaderboard(test_data)

Fine-tuning Leaderboard

Fine-tuning resulted in a significantly improved forecast accuracy, as shown by the test MASE scores. All AutoGluon-TimeSeries models report scores in a “higher is better” format, meaning that most forecasting error metrics like MASE are multiplied by -1 when reported.

Augment Chronos-Bolt with exogenous information

Chronos-Bolt is a univariate model, meaning it relies solely on the historical data of the target time series for making predictions. However, in real-world scenarios, additional exogenous information related to the target series (such as holidays or promotions) is often available. Using this information when making predictions can improve forecast accuracy. AutoGluon-TimeSeries now features covariate regressors, which can be combined with univariate models like Chronos-Bolt to incorporate exogenous information. A covariate regressor in AutoGluon-TimeSeries is a tabular regression model that is fit on the known covariates and static features to predict the target column at each time step. The predictions of the covariate regressor are subtracted from the target column, and the univariate model then forecasts the residuals.

We use a grocery sales dataset to demonstrate how Chronos-Bolt can be combined with a covariate regressor. This dataset includes three known covariates: scaled_price, promotion_email, and promotion_homepage, and the task is to forecast the unit_sales:

train_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/grocery_sales/train.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)

Grocery Sales DataFrame

The following code fits a TimeSeriesPredictor to forecast unit_sales for the next 7 weeks. We have specified the target column we are interested in forecasting and the names of known covariates while constructing the TimeSeriesPredictor. Two configurations are defined for Chronos-Bolt: a zero-shot setting, which uses only the historical context of unit_sales without considering the known covariates, and a covariate regressor setting, which employs a CatBoost model as the covariate_regressor. We also use the target_scaler, which makes sure the time series have a comparable scale before training, which typically results in better accuracy.

predictor = TimeSeriesPredictor(
    prediction_length=7,
    eval_metric="MASE",
    target="unit_sales",
    known_covariates_names=["scaled_price", "promotion_email", "promotion_homepage"],
).fit(
    train_data,
    hyperparameters={
        "Chronos": [
            {"model_path": "bolt_small", "ag_args": {"name_suffix": "ZeroShot"}},
            {
                "model_path": "bolt_small",
                "covariate_regressor": "CAT",
                "target_scaler": "standard",
                "ag_args": {"name_suffix": "WithRegressor"},
            },
        ],
    },
    time_limit=600,
    enable_ensemble=False,
)

After the predictor has been fit, we can evaluate it on the test dataset and generate the leaderboard. Using the covariate regressor with Chronos-Bolt improves over its univariate zero-shot performance considerably.

test_data = TimeSeriesDataFrame.from_path(
    "https://autogluon.s3.amazonaws.com/datasets/timeseries/grocery_sales/test.csv",
    id_column="item_id",
    timestamp_column="timestamp",
)
predictor.leaderboard(test_data)

Covariate Regressor Results

The covariates might not always be useful—for some datasets, the zero-shot model might achieve better accuracy. Therefore, it’s important to try multiple models and select the one that achieves the best accuracy on held-out data.

Conclusion

Chronos-Bolt models empower practitioners to generate high-quality forecasts rapidly in a zero-shot manner. AutoGluon-TimeSeries enhances this capability by enabling users to fine-tune Chronos-Bolt models effortlessly, integrate them with covariate regressors, and ensemble them with a diverse range of forecasting models. For advanced users, it provides a comprehensive set of features to customize forecasting models beyond what was demonstrated in this post. AutoGluon predictors can be seamlessly deployed to SageMaker using AutoGluon-Cloud and the official Deep Learning Containers.

To learn more about using AutoGluon-TimeSeries to build accurate and robust forecasting models, explore our tutorials. Stay updated by following AutoGluon on X (formerly Twitter) and starring us on GitHub!

References

[1] Ansari, Abdul Fatir, Lorenzo Stella, Ali Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, et al. “Chronos: Learning the language of time series.” Transactions on Machine Learning Research (2024).
[2] Hyndman, R. J., and G. Athanasopoulos. “Forecasting: principles and practice 3rd Ed.” O Texts (2018).
[3] Salinas, David, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. “DeepAR: Probabilistic forecasting with autoregressive recurrent networks.” International Journal of Forecasting 36, no. 3 (2020): 1181-1191.
[4] Nie, Yuqi, Nam H. Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. “A time series is worth 64 words: long-term forecasting with transformers.” In The Eleventh International Conference on Learning Representations (2023).
[5] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of Machine Learning Research 21, no. 140 (2020): 1-67.
[6] Godahewa, Rakshitha, Christoph Bergmeir, Geoffrey I. Webb, Rob J. Hyndman, and Pablo Montero-Manso. “Monash time series forecasting archive.” In NeurIPS Track on Datasets and Benchmarks (2021).


About the Authors

Abdul Fatir Ansari is a Senior Applied Scientist at Amazon Web Services, specializing in machine learning and forecasting, with a focus on foundation models for structured data, such as time series. He received his PhD from the National University of Singapore, where his research centered on deep generative models for images and time series.

Caner Turkmen is a Senior Applied Scientist at Amazon Web Services, where he works on research problems at the intersection of machine learning and forecasting. Before joining AWS, he worked in the management consulting industry as a data scientist, serving the financial services and telecommunications sectors. He holds a PhD in Computer Engineering from Bogazici University in Istanbul.

Oleksandr Shchur is a Senior Applied Scientist at Amazon Web Services, where he works on time series forecasting in AutoGluon. Before joining AWS, he completed a PhD in Machine Learning at the Technical University of Munich, Germany, doing research on probabilistic models for event data. His research interests include machine learning for temporal data and generative modeling.

Lorenzo Stella is a Senior Applied Scientist at Amazon Web Services, working on machine learning, forecasting, and generative AI for analytics and decision-making. He holds a PhD in Computer Science and Electrical Engineering from IMTLucca (Italy) and KU Leuven (Belgium), where his research focused on numerical optimization algorithms for machine learning and optimal control applications.

Read More

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

How Amazon Finance Automation built a generative AI Q&A chat assistant using Amazon Bedrock

Today, the Accounts Payable (AP) and Accounts Receivable (AR) analysts in Amazon Finance operations receive queries from customers through email, cases, internal tools, or phone. When a query arises, analysts must engage in a time-consuming process of reaching out to subject matter experts (SMEs) and go through multiple policy documents containing standard operating procedures (SOPs) relevant to the query. This back-and-forth communication process often takes from hours to days, primarily because analysts, especially the new hires, don’t have immediate access to the necessary information. They spend hours consulting SMEs and reviewing extensive policy documents.

To address this challenge, Amazon Finance Automation developed a large language model (LLM)-based question-answer chat assistant on Amazon Bedrock. This solution empowers analysts to rapidly retrieve answers to customer queries, generating prompt responses within the same communication thread. As a result, it drastically reduces the time required to address customer queries.

In this post, we share how Amazon Finance Automation built this generative AI Q&A chat assistant using Amazon Bedrock.

Solution overview

The solution is based on a Retrieval Augmented Generation (RAG) pipeline running on Amazon Bedrock, as shown in the following diagram. When a user submits a query, RAG works by first retrieving relevant documents from a knowledge base, then generating a response with the LLM from the retrieved documents.

The solution consists of the following key components:

  1. Knowledge base – We used Amazon OpenSearch Service as the vector store for embedding documents. For performance evaluation, we processed and indexed multiple Amazon finance policy documents into the knowledge base. Alternatively, Amazon Bedrock Knowledge Bases provides fully managed support for end-to-end RAG workflows. We’re planning to migrate to Amazon Bedrock Knowledge Bases to eliminate cluster management and add extensibility to our pipeline.
  2. Embedding model – At the time of writing, we’re using the Amazon Titan Multimodal Embeddings G1 model on Amazon Bedrock. The model is pre-trained on large and unique datasets and corpora from Amazon and provides accuracy that is higher than or comparable to other embedding models on the market based on our comparative analysis.
  3. Generator model – We used a foundation model (FM) provided by Amazon Bedrock for its balanced ability to deliver highly accurate answers quickly.
  4. Diversity ranker – It’s responsible for rearranging the results obtained from vector index to avoid skewness or bias towards any specific document or section.
  5. Lost in the middle ranker – It’s responsible for efficiently distributing the most relevant results towards the top and bottom of the prompt, maximizing the impact of the prompt’s content.
  6. Guardrails – We used Amazon Bedrock Guardrails to detect personal identifiable information (PII) and safeguard against prompt injection attacks.
  7. Validation engine – Removes PII from the response and checks whether the generated answer aligns with the retrieved context. If not, it returns a hardcoded “I don’t know” response to prevent hallucinations.
  8. Chat assistant UI – We developed the UI using Streamlit, an open source Python library for web-based application development on machine learning (ML) use cases.

Evaluate RAG performance

The accuracy of the chat assistant is the most critical performance metric to Amazon Finance Operations. After we built the first version of the chat assistant, we measured the bot response accuracy by submitting questions to the chat assistant. The SMEs manually evaluated the RAG responses one by one, and found only 49% of the responses were correct. This was far below the expectation, and the solution needed improvement.

However, manually evaluating the RAG isn’t sustainable—it requires hours of effort from finance operations and engineering teams. Therefore, we adopted the following automated performance evaluation approach:

  • Prepare testing data – We constructed a test dataset with three data fields:
    • question – This consists of 100 questions from policy documents where answers reside in a variety of sources, such as policy documents and engineering SOPs, covering complex text formats such as embedded tables and images.
    • expected_answer – These are manually labeled answers by Amazon Finance Operations SMEs.
    • generated_answer – This is the answer generated by the bot.
  • NLP scores – We used a test dataset to calculate the ROUGE score and METEOR score. Because these scores merely use word-matching algorithms and ignore the semantic meaning of the text, they aren’t aligned with the SME scores. Based on our analysis, the variance was approximately 30% compared to human evaluations.
  • LLM-based score – We used an FM offered by Amazon Bedrock to score the RAG performance. We designed specialized LLM prompts to evaluate the RAG performance by comparing the generated answer with the expected answer. We generated a set of LLM-based metrics, including accuracy, acceptability, and factualness, and the citation representing the evaluation reasoning. The variance of this approach was approximately 5% compared to human analysis, so we decided to stick to this approach of evaluation. If your RAG system is built on Amazon Bedrock Knowledge Bases, you can use the new RAG evaluation for Amazon Bedrock Knowledge Bases tool to evaluate the retrieve or the retrieve and generate functionality with an LLM as a judge. It provides retrieval evaluation metrics such as context relevance and context coverage. It also provides retrieve and generate evaluation metrics such as correctness, completeness, and helpfulness, as well as responsible AI metrics such as harmfulness and answer refusal.

Improve the accuracy of RAG pipeline

Based on the aforementioned evaluation techniques, we focused on the following areas in the RAG pipeline to improve the overall accuracy.

Add document semantic chunking to improve accuracy from 49% to 64%

Upon diagnosing incorrect responses in the RAG pipeline, we identified 14% of the inaccuracy was due to incomplete contexts sent to the LLM. These incomplete contexts were originally generated by the segmentation algorithm based on a fixed chunk size (for example, 512 tokens or 384 words), which doesn’t consider document boundaries such as sections and paragraphs.

To address this problem, we designed a new document segmentation approach using QUILL Editor, Amazon Titan Text Embeddings, and OpenSearch Service, using the following steps:

  1. Convert the unstructured text to a structured HTML document using QUILL Editor. In this way, the HTML document preserves the document formatting that divides the contents into logical chunks.
  2. Identify the logical structure of the HTML document and insert divider strings based on HTML tags for document segmentation.
  3. Use an embedding model to generate semantic vector representation of document chunks.
  4. Assign tags based on important keywords in the section to identify the logical boundaries between sections.
  5. Insert the embedding vectors of the segmented documents to the OpenSearch Service vector store.

The following diagram illustrates the document retriever splitting workflow.

When processing the document, we follow specific rules:

  • Extract the start and end of a section of a document precisely
  • Extract the titles of the section and pair them with section content accurately
  • Assign tags based on important keywords from the sections
  • Persist the markdown information from the policy while indexing
  • Exclude images and tables from the processing in the initial release

With this approach, we can improve RAG accuracy from 49% to 64%.

Use prompt engineering to improve accuracy from 64% to 76%

Prompt engineering is a crucial technique to improve the performance of LLMs. We learned from our project that there is no one-size-fits-all prompt engineering approach; it’s a best practice to design task-specific prompts. We adopted the following approach to enhance the effectiveness of the prompt-to-RAG generator:

  • In approximately 14% of cases, we identified that the LLM generated responses even when no relevant context was retrieved from the RAG, leading to hallucinations. In this case, we engineered prompts and asked the LLM not to generate any response when there is no relevant context provided.
  • In approximately 13% of cases, we received user feedback that the response from the LLM was too brief, lacking complete context. We engineered prompts that encouraged the LLM to be more comprehensive.
  • We engineered prompts to enable the capability to generate both concise and detailed answers for the users.
  • We used LLM prompts for generation of citations to properly attribute our source used to generate the answer. In the UI, the citations are listed with hyperlinks following the LLM response, and users can use these citations to validate the LLM performance.
  • We improved our prompts to introduce better chain-of-thought (CoT) reasoning:
    • The LLM’s unique characteristic of using internally generated reasoning contributes to improved performance and aligns responses with humanlike coherence. Because of this interplay between prompt quality, reasoning requests, and the model’s inherent capabilities, we could optimize performance.
    • Encouraging CoT reasoning prompts the LLM to consider the context of the conversation, making it less prone to hallucinations.
    • By building upon the established context, the model is more likely to generate responses that logically follow the conversation’s narrative, reducing the chances of providing inaccurate or hallucinated answers.
    • We added examples of previously answered questions to establish a pattern for the LLM, encouraging CoT.

We then used meta-prompting using an FM offered by Amazon Bedrock to craft a prompt that caters to the aforementioned requirements.

The following example is a prompt for generating a quick summary and a detailed answer:

You are an AI assistant that helps answer questions based on provided text context. I will give you some passages from a document, followed by a question. Your task is to provide the best possible answer to the question using only the information from the given context. Here is the context:

<context>
{}
</context>

And here is the question:
<question>
{}
</question>

Think carefully about how the context can be used to answer the question.
<thinkingprocess>
- Carefully read the provided context and analyze what information it contains
- Identify the key pieces of information in the context that are relevant to answering the question
- Determine if the context provides enough information to answer the question satisfactorily
- If not, simply state "I don't know, I don't have the complete context needed to answer this
question"
- If so, synthesize the relevant information into a concise summary answer
- Expand the summary into a more detailed answer, utilizing Markdown formatting to make it clear and
readable
</thinkingprocess>

If you don't have enough context to answer the question, provide your response in the following
format:
I don't know, I don't have the complete context needed to answer this question.

If you do have enough context to answer the question, provide your response in the following format:
#### Quick Summary:
Your concise 1-2 sentence summary goes here.
#### Detailed Answer:
Your expanded answer goes here, using Markdown formatting like **bold**, *italics*, and Bullet points to improve readability.

Remember, the ultimate goal is to provide an informative, clear and readable answer to the question
using only the context provided. Let's begin!

The following example is a prompt for generating citations based on the generated answers and retrieved contexts:

You are an AI assistant that specializes in attributing generated answers to specific sections within provided documents. Your task is to determine which sections from the given documents were most likely used to generate the provided answer. If you cannot find exact matches, suggest sections that are closely related to the content of the answer.

Here is the generated answer to analyze:
<generated_answer>
{}
</generated_answer>

And here are the sections from various documents to consider:
<sections>
{}
</sections>

Please carefully read through the generated answer and the provided sections. In the scratchpad space below, brainstorm and reason about which sections are most relevant to the answer:
<scratchpad>
</scratchpad>

After identifying the relevant sections, provide your output in the following format:
**Document Name:** <document name> n
**Document Link:** <document link> n
**Relevant Sections:** n
- <section name 1>
- <section name 2>
- <section name 3>

Do not include any additional explanations or reasoning in your final output. Simply list the document name, link, and relevant section names in the specified format above.

Assistant:

By implementing the prompt engineering approaches, we improved RAG accuracy from 64% to 76%.

Use an Amazon Titan Text Embeddings model to improve accuracy from 76% to 86%

After implementing the document segmentation approach, we still saw lower relevance scores for retrieved contexts (55–65%), and the incorrect contexts were in the top ranks for more than 50% of cases. This indicated that there was still room for improvement.

We experimented with multiple embedding models, including first-party and third-party models. For example, the contextual embedding models such as bge-base-en-v1.5 performed better for context retrieval, comparing to other top embedding models such as all-mpnet-base-v2. We found that using the Amazon Titan Embeddings G1 model increased the possibility of retrieved contexts from approximately 55–65% to 75–80%, and 80% of the retrieved contexts have higher ranks than before.

Finally, by adopting the Amazon Titan Text Embeddings G1 model, we improved the overall accuracy from 76% to 86%.

Conclusion

We achieved remarkable progress in developing a generative AI Q&A chat assistant for Amazon Finance Automation by using a RAG pipeline and LLMs on Amazon Bedrock. Through continual evaluation and iterative improvement, we have addressed challenges of hallucinations, document ingestion issues, and context retrieval inaccuracies. Our results have shown a significant improvement in RAG accuracy from 49% to 86%.

You can follow our journey and adopt a similar solution to address challenges in your RAG application and improve overall performance.


About the Authors

SohebSoheb Moin is a Software Development Engineer at Amazon, who led the development of the Generative AI chatbot. He specializes in leveraging generative AI and Big Data analytics to design, develop, and implement secure, scalable, innovative solutions that empowers Finance Operations with better productivity, automation. Outside of work, Soheb enjoys traveling, playing badminton, and engaging in chess tournaments.

Nitin Arora is a Sr. Software Development Manager for Finance Automation in Amazon. He has over 19 years of experience building business critical, scalable, high-performance software. Nitin leads data services, communication, work management and several Generative AI initiatives within Finance. In his spare time, he enjoys listening to music and read.

YunfeiYunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

SatyenKumar Satyen Gaurav is an experienced Software Development Manager at Amazon, with over 16 years of expertise in big data analytics and software development. He leads a team of engineers to build products and services using AWS big data technologies, for providing key business insights for Amazon Finance Operations across diverse business verticals. Beyond work, he finds joy in reading, traveling and learning strategic challenges of chess.

MohakMohak Chugh is a Software Development Engineer at Amazon, with over 3 years of experience in developing products leveraging Generative AI and Big Data on AWS. His work encompasses a range of areas, including RAG based GenAI chatbots and high performance data reconciliation. Beyond work, he finds joy in playing the piano and performing with his music band.

pbavishiParth Bavishi is a Senior Product Manager at Amazon with over 10 years of experience in building impactful products. He currently leads the development of generative AI capabilities for Amazon’s Finance Automation, driving innovation and efficiency within the organization. A dedicated mentor, Parth enjoys sharing his product management knowledge and finds satisfaction in activities like volleyball and reading.

Read More

Carnegie Mellon University at NeurIPS 2024

Carnegie Mellon University at NeurIPS 2024

Carnegie Mellon University is proud to present 194 papers at the 38th conference on Neural Information Processing Systems (NeurIPS 2024), held from December 10-15 at the Vancouver Convention Center. Here is a quick overview of the areas our researchers are working on:

Here are some of our top collaborator institutions:

Oral Papers

Stylus: Automatic Adapter Selection for Diffusion Models

Authors: Michael Luo, Justin Wong, Brandon Trabucco, Yanping Huang, Joseph Gonzalez, Zhifeng Chen, Ruslan Salakhutdinov, Ion Stoica

This paper explores an alternative approach to generating high-fidelity, customized images at reduced costs using fine-tuned adapters instead of simply scaling base models with additional data or parameters. Over time, the open-source community has created a large collection of more than 100,000 adapters—small modules that fine-tune base models for specific tasks. However, many of these adapters are highly customized and lack clear descriptions, making them challenging to use effectively. To address this, the paper introduces Stylus, a system designed to match prompts with relevant adapters and automatically compose them for better image generation. Building on recent research showing the benefits of combining multiple adapters, Stylus uses a three-stage process: summarizing adapters with improved descriptions and embeddings, retrieving relevant adapters, and composing adapters based on prompt keywords to ensure a strong match. The authors also present StylusDocs, a curated dataset of 75,000 adapters with pre-computed embeddings, for evaluation. Testing Stylus on popular Stable Diffusion checkpoints shows that it achieves better CLIP/FID Pareto efficiency and is twice as preferred by human and multimodal evaluators compared to the base model.

The Sample-Communication Complexity Trade-off in Federated Q-Learning

Authors: Sudeep Salgia, Yuejie Chi

This work examines the problem of Federated Q-learning, where multiple agents collaboratively learn the optimal Q-function for an unknown infinite-horizon Markov Decision Process with finite state and action spaces. The focus is on understanding the trade-off between sample complexity (the number of data samples needed for learning) and communication complexity (the amount of data exchanged between agents) for intermittent communication algorithms, a commonly used approach in federated settings.

The authors first establish a fundamental limitation: any Federated Q-learning algorithm that achieves linear speedup in sample complexity relative to the number of agents must incur a communication cost of at least Ω(1/1−γ), where γ is the discount factor. They then introduce a new algorithm, Fed-DVR-Q, which is the first to achieve both optimal sample complexity and communication complexity simultaneously. Together, these results provide a comprehensive understanding of the trade-offs between sample and communication efficiency in Federated Q-learning.

Spotlight Papers

Aligner Encoders: Self-Attention Transformers Can Be Self-Transducers

Authors: Adam Stooke, Rohit Prabhavalkar, Khe Sim, Pedro Moreno Mengibar

The paper introduces a new transformer-based approach to automatic speech recognition (ASR) that simplifies the alignment process between audio input and text output. Unlike traditional models, the encoder itself aligns audio information internally, reducing the complexity of decoding. The proposed “Aligner-Encoder” model combines efficient training techniques and a lightweight decoder, resulting in significantly faster performance while maintaining competitive accuracy. Notably, the alignment process is evident in the self-attention weights of the model, showcasing its ability to handle the task efficiently.

Approximating the Top Eigenvector in Random Order Streams

Authors: Praneeth Kacham, David Woodruff

This work focuses on streaming algorithms for approximating the top eigenvector of a matrix when its rows are presented in a random order. The authors introduce a new algorithm that works efficiently when there is a sufficient gap between the largest and second-largest eigenvalues of the matrix. Their approach uses a small amount of memory, depending on the number of “heavy rows” (rows with large norms), and produces highly accurate results. They also show that using this heavy-row-based parameterization is necessary for achieving high accuracy and improve on prior methods by reducing the gap requirement for random-order streams, though their method assumes the rows are presented in a random order rather than any order.

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Authors: Shentong Mo, Peter Tong

Recent advancements in unsupervised visual representation learning have highlighted the Joint-Embedding Predictive Architecture (JEPA) as an effective method for extracting visual features from unlabeled images using masking strategies. However, JEPA faces two key challenges: its reliance on Exponential Moving Average (EMA) fails to prevent model collapse, and its predictions struggle to accurately capture the average representation of image patches. To address these issues, this work introduces C-JEPA, a new framework that combines JEPA with a variance-invariance-covariance regularization strategy called VICReg. This approach improves stability, prevents collapse, and ensures better learning of consistent representations. Experiments show that C-JEPA achieves faster convergence and higher performance on standard benchmarks when pre-trained on ImageNet-1K.

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

Authors: Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang

This work addresses the challenge of enabling humanoid robots to collaborate on tasks like moving large furniture, which require coordination between multiple robots. Existing methods struggle due to a lack of motion capture data for multi-humanoid collaboration and the inefficiency of training multiple agents together. To overcome this, the authors introduce Cooperative Human-Object Interaction (CooHOI), a framework that uses a two-phase learning approach: first, individual humanoids learn object interaction skills from human motion data, and then they learn to work together using multi-agent reinforcement learning. By focusing on shared object dynamics and decentralized execution, the robots achieve coordination through implicit communication. Unlike previous tracking-based methods, CooHOI is efficient, does not rely on multi-humanoid motion data, and can easily scale to more participants and diverse object types.

DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning

Authors: Weikang Wan, Ziyu Wang, Yufei Wang, Zackory Erickson, David Held

This paper presents DiffTORI, a framework that uses differentiable trajectory optimization as a policy representation for reinforcement and imitation learning. Trajectory optimization, a common tool in control, is parameterized by a cost and a dynamics function, and recent advances now allow gradients of the loss to be computed with respect to these parameters. This enables DiffTORI to learn cost and dynamics functions end-to-end, addressing the “objective mismatch” in previous model-based RL methods by aligning the dynamics model with task performance. Benchmarking on robotic manipulation tasks with high-dimensional sensory inputs, DiffTORI demonstrates superior performance over prior methods, including feedforward policies, energy-based models, and diffusion models, across a wide range of reinforcement and imitation learning tasks.

Don’t Look Twice: Faster Video Transformers with Run-Length Tokenization

Authors: Rohan Choudhury, Guanglei Zhu, Sihan Liu, Koichiro Niinuma, Kris Kitani, László Jeni

Video transformers are notoriously slow to train due to the large number of input tokens, many of which are repeated across frames. Existing methods to remove redundant tokens often introduce significant overhead or require dataset-specific tuning, limiting their practicality. This work introduces Run-Length Tokenization (RLT), a simple and efficient method inspired by run-length encoding, which identifies and removes repeated patches in video frames before inference. By replacing repeated patches with a single token and a positional encoding to reflect its duration, RLT reduces redundancy without requiring tuning or adding significant computational cost. It accelerates training by 30%, maintains baseline performance, and increases throughput by 35% with minimal accuracy loss, while reducing token counts by up to 80% on longer videos.

ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights

Authors: Gabriel Sarch, Lawrence Jang, Michael Tarr, William Cohen, Kenneth Marino, Katerina Fragkiadaki

This work introduces In-Context Abstraction Learning (ICAL), a method that enables large-scale language and vision-language models (LLMs and VLMs) to generate high-quality task examples from imperfect demonstrations. ICAL uses a vision-language model to analyze and improve inefficient task trajectories by abstracting key elements like causal relationships, object states, and temporal goals, with iterative refinement through human feedback. These improved examples, when used as prompts, enhance decision-making and reduce reliance on human input over time, making the system more efficient. ICAL outperforms state-of-the-art models in tasks like instruction following, web navigation, and action forecasting, demonstrating its ability to improve performance without heavy manual prompt engineering.

Is Your LiDAR Placement Optimized for 3D Scene Understanding?

Authors: Ye Li, Lingdong Kong, Hanjiang Hu, Xiaohao Xu, Xiaonan Huang

This work focuses on improving the reliability of driving perception systems under challenging and unexpected conditions, particularly with multi-LiDAR setups. Most existing datasets rely on single-LiDAR systems and are collected in ideal conditions, making them insufficient for real-world applications. To address this, the authors introduce Place3D, a comprehensive pipeline that optimizes LiDAR placement, generates data, and evaluates performance. Their approach includes three key contributions: a new metric called the Surrogate Metric of the Semantic Occupancy Grids (M-SOG) for assessing multi-LiDAR configurations, an optimization strategy to improve LiDAR placements based on M-SOG, and the creation of a 280,000-frame dataset capturing both clean and adverse conditions. Experiments show that their optimized placements lead to significant improvements in tasks like semantic segmentation and 3D object detection, even in challenging scenarios with harsh weather or sensor failures.

Learn To be Efficient: Build Structured Sparsity in Large Language Models

Authors: Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Zhuoqing Morley Mao, Beidi Chen, Fan Lai, Atul Prakash

The paper explores how Large Language Models (LLMs), known for their impressive capabilities but high computational costs, can be made more efficient. It highlights that while activation sparsity—where only some model parameters are used during inference—naturally occurs, current methods fail to maximize its potential during training. The authors propose a novel training algorithm, Learn-To-be-Efficient (LTE), that encourages LLMs to activate fewer neurons, striking a balance between efficiency and performance. Their approach, applicable to models beyond traditional ReLU-based ones, demonstrates improved results across various tasks and reduces inference latency by 25% for LLaMA2-7B at 50% sparsity.

Learning Social Welfare Functions

Authors: Kanad Pardeshi, Itai Shapira, Ariel Procaccia, Aarti Singh

This work explores whether it is possible to understand or replicate a policymaker’s reasoning by analyzing their past decisions. The problem is framed as learning social welfare functions from the family of power mean functions. Two learning tasks are considered: one uses utility vectors of actions and their corresponding social welfare values, while the other uses pairwise comparisons of welfares for different utility vectors. The authors demonstrate that power mean functions can be learned efficiently, even when the social welfare data is noisy. They also propose practical algorithms for these tasks and evaluate their effectiveness.

Metric Transforms and Low Rank Representations of Kernels

Authors: Timothy Chu, Josh Alman, Gary L. Miller, Shyam Narayanan, Mark Sellke, Zhao Song

The authors introduce a linear-algebraic tool based on group representation theory to solve three important problems in machine learning. First, they investigate fast attention algorithms for large language models and prove that only low-degree polynomials can produce the low-rank matrices required for subquadratic attention, thereby showing that polynomial-based approximations are essential. Second, they extend the classification of positive definite kernels from Euclidean distances to Manhattan distances, offering a broader foundation for kernel methods. Finally, they classify all functions that transform Manhattan distances into Manhattan distances, generalizing earlier work on Euclidean metrics and introducing new results about stable-rank-preserving functions with potential applications in algorithm design.

Sample-Efficient Private Learning of Mixtures of Gaussians

Authors: Hassan Ashtiani, Mahbod Majid, Shyam Narayanan

This work examines the problem of learning mixtures of Gaussians while ensuring approximate differential privacy. The authors demonstrate that it is possible to learn a mixture of k arbitrary d-dimensional Gaussians with significantly fewer samples than previous methods, achieving optimal performance when the dimensionality d is much larger than the number of components k. For univariate Gaussians, they establish the first optimal bound, showing that the sample complexity scales linearly with k, improving upon earlier methods that required a quadratic dependence on k. Their approach leverages advanced techniques, including the inverse sensitivity mechanism, sample compression for distributions, and volume bounding methods, to achieve these results.

Sequoia: Scalable and Robust Speculative Decoding

Authors: Zhuoming Chen, Avner May, Ruslan Svirschevski, Yu-hsun Huang, Max Ryabinin, Zhihao Jia, Beidi Chen

As the use of large language models (LLMs) increases, serving them quickly and efficiently has become a critical challenge. Speculative decoding offers a promising solution, but existing methods struggle to scale with larger workloads or adapt to different settings. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. By employing a dynamic programming algorithm, Sequoia optimizes the tree structure for speculated tokens, improving scalability. It also introduces a novel sampling and verification method that enhances robustness across various decoding temperatures. Sequoia achieves significant speedups, improving decoding speed on models like Llama2-7B, Llama2-13B, and Vicuna-33B by up to 4.04x, 3.73x, and 2.27x, respectively, and reducing per-token latency for Llama3-70B-Instruct on a single GPU by 9.5x compared to DeepSpeed-Zero-Inference.

Slight Corruption in Pre-training Data Makes Better Diffusion Models

Authors: Hao Chen, Yujin Han, Diganta Misra, Xiang Li, Kai Hu, Difan Zou, Masashi Sugiyama, Jindong Wang, Bhiksha Raj

Diffusion models have demonstrated impressive capabilities in generating high-quality images, audio, and videos, largely due to pre-training on large datasets that pair data with conditions, such as image-text or image-class pairs. However, even with careful filtering, these datasets often include corrupted pairs where the conditions do not accurately represent the data. This paper provides the first comprehensive study of how such corruption affects diffusion model training. By synthetically corrupting datasets like ImageNet-1K and CC3M, the authors show that slight corruption in pre-training data can surprisingly enhance image quality, diversity, and fidelity across various models. They also provide theoretical insights, demonstrating that slight condition corruption increases entropy and reduces the 2-Wasserstein distance to the ground truth distribution. Building on these findings, the authors propose a method called condition embedding perturbations, which improves diffusion model performance during both pre-training and downstream tasks, offering new insights into the training process.

Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models

Authors: Sanae Lotfi, Yilun Kuang, Marc Finzi, Brandon Amos, Micah Goldblum, Andrew Wilson

Large language models (LLMs) with billions of parameters are highly effective at predicting the next token in a sequence. While recent research has computed generalization bounds for these models using compression-based techniques, these bounds often fail to apply to billion-parameter models or rely on restrictive methods that produce low-quality text. Existing approaches also tie the tightness of bounds to the number of independent documents in the training set, ignoring the larger number of dependent tokens, which could offer better bounds. This work uses properties of martingales to derive generalization bounds that leverage the vast number of tokens in LLM training sets. By using more flexible compression techniques like Monarch matrices, Kronecker factorizations, and post-training quantization, the authors achieve meaningful generalization bounds for large-scale models, including LLaMA2-70B, marking the first successful bounds for practical, high-quality text-generating models.

Poster Papers

Causality

Causal Inference in the Closed-Loop: Marginal Structural Models for Sequential Excursion Effects

Authors: Alexander Levis, Gabriel Loewinger, Francisco Pereira

Causal Temporal Representation Learning with Nonstationary Sparse Transition

Authors: Xiangchen Song, Zijian Li, Guangyi Chen, Yujia Zheng, Yewen Fan, Xinshuai Dong, Kun Zhang

Discovery of the Hidden World with Large Language Models

Authors: Chenxi Liu, Yongqiang Chen, Tongliang Liu, Mingming Gong, James Cheng, Bo Han, Kun Zhang

From Causal to Concept-Based Representation Learning

Authors: Goutham Rajendran, Simon Buchholz, Bryon Aragam, Bernhard Schölkopf, Pradeep Ravikumar

Identifying General Mechanism Shifts in Linear Causal Representations

Authors: Tianyu Chen, Kevin Bello, Francesco Locatello, Bryon Aragam, Pradeep Ravikumar

Identifying Selections for Unsupervised Subtask Discovery

Authors: Yiwen Qiu, Yujia Zheng, Kun Zhang

Interventional Causal Discovery in a Mixture of DAGs

Authors: Burak Varıcı, Dmitriy Katz, Dennis Wei, Prasanna Sattigeri, Ali Tajer

Learning Discrete Concepts in Latent Hierarchical Models

Authors: Lingjing Kong, Guangyi Chen, Biwei Huang, Eric Xing, Yuejie Chi, Kun Zhang

Learning Discrete Latent Variable Structures with Tensor Rank Conditions

Authors: Zhengming Chen, Ruichu Cai, Feng Xie, Jie Qiao, Anpeng Wu, Zijian Li, Zhifeng Hao, Kun Zhang

Likelihood-based differentiable structure learning

Authors: Chang Deng, Kevin Bello, Pradeep Ravikumar, Bryon Aragam

Linear Causal Representation Learning from Unknown Multi-node Interventions

Authors: Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Ali Tajer

Mutli-Armed Bandits with Network Interference

Authors: Abhineet Agarwal, Anish Agarwal, Lorenzo Masoero, Justin Whitehouse

Natural Counterfactuals With Necessary Backtracking

Authors: Guang-yuan Hao, Jiji Zhang, Biwei Huang, Hao Wang, Kun Zhang

On Causal Discovery in the Presence of Deterministic Relations

Authors: Loka Li, Haoyue Dai, Hanin Al Ghothani, Biwei Huang, Jiji Zhang, Shahar Harel, Isaac Bentwich, Guangyi Chen, Kun Zhang

Sample Complexity of Interventional Causal Representation Learning

Authors: Emre Acartürk, Burak Varıcı, Karthikeyan Shanmugam, Ali Tajer

Computational Biology

Protein-Nucleic Acid Complex Modeling with Frame Averaging Transformer

Authors: Tinglin Huang, Zhenqiao Song, Rex Ying, Wengong Jin

Computer Vision

Adaptive Visual Scene Understanding: Incremental Scene Graph Generation

Authors: Naitik Khandelwal, Xiao Liu, Mengmi Zhang

Crafting Hierarchical Strand-based Hair Geometry with Frequency-decomposed Representative Guide Curves

Authors: Yunlu Chen, Francisco Vicente Carrasco, Christian Häne, Giljoo Nam, Jean-charles Bazin, Fernando D De La Torre

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Authors: Letian Wang, Seung Wook Kim, Jiawei Yang, Cunjun Yu, Boris Ivanovic, Steven Waslander, Yue Wang, Sanja Fidler, Marco Pavone, Peter Karkus

EAGLE: Efficient Adaptive Geometry-based Learning in Cross-view Understanding

Authors: Thanh-dat Truong, Utsav Prabhu, Dongyi Wang, Bhiksha Raj, Susan Gauch, Jeyamkondan Subbiah, Khoa Luu

Hamba: Single-view 3D Hand Reconstruction with Graph-guided Bi-Scanning Mamba

Authors: Haoye Dong, Aviral Chharia, Wenbo Gou, Francisco Vicente Carrasco, Fernando D De La Torre

Lexicon3D: Probing Visual Encoding Models for Complex 3D Scene Understanding

Authors: Yunze Man, Shuhong Zheng, Zhipeng Bao, Martial Hebert, Liangyan Gui, Yu-xiong Wang

MGF: Mixed Gaussian Flow for Diverse Trajectory Prediction

Authors: Jiahe Chen, Jinkun Cao, Dahua Lin, Kris Kitani, Jiangmiao Pang

Metric from Human: Zero-shot Monocular Metric Depth Estimation via Test-time Adaptation

Authors: Yizhou Zhao, Hengwei Bian, Kaihua Chen, Pengliang Ji, Liao Qu, Shao-yu Lin, Weichen Yu, Haoran Li, Hao Chen, Jun Shen, Bhiksha Raj, Min Xu

Vision Foundation Model Enables Generalizable Object Pose Estimation

Authors: Kai Chen, Yiyao Ma, Xingyu Lin, Stephen James, Jianshu Zhou, Yun-hui Liu, Pieter Abbeel, Dou Qi

Computer Vision (Image Generation)

Latent Representation Matters: Human-like Sketches in One-shot Drawing Tasks

Authors: Victor Boutin, Rishav Mukherji, Aditya Agrawal, Sabine Muzellec, Thomas Fel, Thomas Serre, Rufin Vanrullen

Computer Vision (Video Generation)

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Authors: Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, László Jeni, Sergey Tulyakov, Hsin-ying Lee

A Versatile Diffusion Transformer with Mixture of Noise Levels for Audiovisual Generation

Authors: Gwanghyun Kim, Alonso Martinez, Yu-chuan Su, Brendan Jou, Jose Lezama, Agrim Gupta, Lijun Yu, Lu Jiang, Aren Jansen, Jacob Walker, Se Young Chun, Krishna Somandepalli

Computer Vision (Video Understanding)

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos

Authors: Wen-hsuan Chu, Lei Ke, Katerina Fragkiadaki

HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model

Authors: Khoa Vo, Thinh Phan, Kashu Yamazaki, Minh Tran, Ngan Le

Data-centric AI

Data Distribution Valuation

Authors: Xinyi Xu, Shuaiqi Wang, Chuan Sheng Foo, Bryan Kian Hsiang Low, Giulia Fanti

Visual Data Diagnosis and Debiasing with Concept Graphs

Authors: Rwiddhi Chakraborty, Yinong O Wang, Jialu Gao, Runkai Zheng, Cheng Zhang, Fernando D De La Torre

Data-centric AI (Data Augmentation)

Turning Indirect Knowledge into Direct Demonstrations for Computer Agents at Scale

Authors: Tianyue Ou, Frank F. Xu, Aman Madaan, Jiarui Liu, Robert Lo, Abishek Sridhar, Sudipta Sengupta, Dan Roth, Graham Neubig, Shuyan Zhou

Data-centric AI (Data-centric AI Methods And Tools)

Deep Learning (Algorithms)

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios

Authors: Shantanu Jaiswal, Debaditya Roy, Basura Fernando, Cheston Tan

On the Inductive Bias of Stacking Towards Improving Reasoning

Authors: Nikunj Saunshi, Stefani Karp, Shankar Krishnan, Sobhan Miryoosefi, Sashank Jakkam Reddi, Sanjiv Kumar

RGMDT: Return-Gap-Minimizing Decision Tree Extraction in Non-Euclidean Metric Space

Authors: Jingdi Chen, Hanhan Zhou, Yongsheng Mei, Carlee Joe-wong, Nathaniel Bastian, Tian Lan

Deep Learning (Attention Mechanisms)

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding

Authors: Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, Zhangyang “atlas” Wang

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

Authors: Xuezhe Ma, Xiaomeng Yang, Wenhan Xiong, Beidi Chen, Lili Yu, Hao Zhang, Jonathan May, Luke Zettlemoyer, Omer Levy, Chunting Zhou

Towards Understanding the Mechanisms of Associative Memory in Transformers

Authors: Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam

Deep Learning (Everything Else)

FLoRA: Federated Fine-Tuning Large Language Models with Heterogeneous Low-Rank Adaptations

Authors: Ziyao Wang, Zheyu Shen, Yexiao He, Guoheng Sun, Hongyi Wang, Lingjuan Lyu, Ang Li

HORSE: Hierarchical Representation for Large-Scale Neural Subset Selection

Authors: Binghui Xie, Yixuan Wang, Yongqiang Chen, Kaiwen Zhou, Yu Li, Wei Meng, James Cheng

Hydra: Bidirectional State Space Models Through Generalized Matrix Mixers

Authors: Sukjun Hwang, Aakash Sunil Lahoti, Ratish Puduppully, Tri Dao, Albert Gu

MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training

Authors: Cheng Luo, Jiawei Zhao, Zhuoming Chen, Beidi Chen, Animashree Anandkumar

Mixture of Nested Experts: Adaptive Processing of Visual Tokens

Authors: Gagan Jain, Nidhi Hegde, Aditya Kusupati, Arsha Nagrani, Shyamal Buch, Prateek Jain, Anurag Arnab, Sujoy Paul

SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning

Authors: Yexiao He, Ziyao Wang, Zheyu Shen, Guoheng Sun, Yucong Dai, Yongkai Wu, Hongyi Wang, Ang Li

Deep Learning (Representation Learning)

Towards Understanding Extrapolation: a Causal Lens

Authors: Lingjing Kong, Guangyi Chen, Petar Stojanov, Haoxuan Li, Eric Xing, Kun Zhang

Who Needs Features? On the Surprising Effectiveness of Attention Transfer for Vision Transformers

Authors: Alex Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

Deep Learning (Robustness)

Achieving Domain-Independent Certified Robustness via Knowledge Continuity

Authors: Alan Sun, Chiyu Ma, Kenneth Ge, Soroush Vosoughi

Predicting the Performance of Foundation Models via Agreement-on-the-Line

Authors: Rahul Saxena, Taeyoun Kim, Aman Mehra, Christina Baek, J. Zico Kolter, Aditi Raghunathan

ProTransformer: Robustify Transformers via Plug-and-Play Paradigm

Authors: Zhichao Hou, Weizhi Gao, Yuchen Shen, Feiyi Wang, Xiaorui Liu

Fairness

Fair Wasserstein Coresets

Authors: Zikai Xiong, Niccolo Dalmasso, Shubham Sharma, Freddy Lecue, Daniele Magazzeni, Vamsi Potluru, Tucker Balch, Manuela Veloso

Mitigating Biases in Blackbox Feature Extractors for Image Classification Tasks

Authors: Abhipsa Basu, Saswat Subhajyoti Mallick, Venkatesh Babu R

On Socially Fair Low-Rank Approximation and Column Subset Selection

Authors: Zhao Song, Ali Vakilian, David Woodruff, Samson Zhou

SureMap: Simultaneous mean estimation for single-task and multi-task disaggregated evaluation

Authors: Misha Khodak, Lester Mackey, Miro Dudik, Alexandra Chouldechova

Generative Models

A Critical Evaluation of AI Feedback for Aligning Large Language Models

Authors: Archit Sharma, Sedrick Scott Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar

Data Attribution for Text-to-Image Models by Unlearning Synthesized Images

Authors: Sheng-yu Wang, Alexei Efros, Aaron Hertzmann, Jun-yan Zhu, Richard Zhang

Flow Priors for Linear Inverse Problems via Iterative Corrupted Trajectory Matching

Authors: Yasi Zhang, Peiyu Yu, Yaxuan Zhu, Yingshan Chang, Feng Gao, Ying Nian Wu, Oscar Leong

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution

Authors: Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Scott Yih, Victoria Lin

Generative Models (Diffusion Models)

Diffusing Differentiable Representations

Authors: Yash Savani, Marc Finzi, J. Zico Kolter

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Authors: Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, Vincent Sitzmann

Improving the Training of Rectified Flows

Authors: Sangyun Lee, Zinan Lin, Giulia Fanti

Model-based Diffusion for Trajectory Optimization

Authors: Chaoyi Pan, Zeji Yi, Guanya Shi, Guannan Qu

Permutation-Invariant Autoregressive Diffusion for Graph Generation

Authors: Lingxiao Zhao, Xueying Ding, Leman Akoglu

Understanding Hallucinations in Diffusion Models through Mode Interpolation

Authors: Sumukh K Aithal, Pratyush Maini, Zachary Lipton, J. Zico Kolter

Your Diffusion Model is Secretly a Noise Classifier and Benefits from Contrastive Training

Authors: Yunshu Wu, Yingtao Luo, Xianghao Kong, Vagelis Papalexakis, Greg Ver Steeg

Generative Models (In Context Learning)

Can large language models explore in-context?

Authors: Akshay Krishnamurthy, Keegan Harris, Dylan J Foster, Cyril Zhang, Aleksandrs Slivkins

Generative Models (Misc)

Efficient Contextual LLM Cascades through Budget-Constrained Policy Learning

Authors: Xuechen Zhang, Zijian Huang, Ege Onur Taga, Carlee Joe-wong, Samet Oymak, Jiasi Chen

MixEval: Fast and Dynamic Human Preference Approximation with LLM Benchmark Mixtures

Authors: Jinjie Ni, Fuzhao Xue, Xiang Yue, Yuntian Deng, Mahir Shah, Kabir Jain, Graham Neubig, Yang You

Generative Models (Reasoning)

AutoMix: Automatically Mixing Language Models

Authors: Pranjal Aggarwal, Aman Madaan, Ankit Anand, Srividya Pranavi Potharaju, Swaroop Mishra, Pei Zhou, Aditya Gupta, Dheeraj Rajagopal, Karthik Kappaganthu, Yiming Yang, Shyam Upadhyay, Manaal Faruqui, Mausam

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision

Authors: Zhiqing Sun, Longhui Yu, Yikang Shen, Weiyang Liu, Yiming Yang, Sean Welleck, Chuang Gan

Recursive Introspection: Teaching Foundation Model Agents How to Self-Improve

Authors: Yuxiao Qu, Tianjun Zhang, Naman Garg, Aviral Kumar

Transformers Can Do Arithmetic with the Right Embeddings

Authors: Sean Mcleish, Arpit Bansal, Alex Stein, Neel Jain, John Kirchenbauer, Brian Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, Jonas Geiping, Avi Schwarzschild, Tom Goldstein

Graph Neural Networks

Even Sparser Graph Transformers

Authors: Hamed Shirzad, Honghao Lin, Balaji Venkatachalam, Ameya Velingker, David Woodruff, Danica J. Sutherland

Human-computer Interaction

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning

Authors: Zebang Cheng, Zhi-qi Cheng, Jun-yan He, Kai Wang, Yuxiang Lin, Zheng Lian, Xiaojiang Peng, Alexander Hauptmann

Harmonizing Stochasticity and Determinism: Scene-responsive Diverse Human Motion Prediction

Authors: Zhenyu Lou, Qiongjie Cui, Tuo Wang, Zhenbo Song, Luoming Zhang, Cheng Cheng, Haofan Wang, Xu Tang, Huaxia Li, Hong Zhou

Interpretability

Diffusion PID: Interpreting Diffusion via Partial Information Decomposition

Authors: Shaurya Dewan, Rushikesh Zawar, Prakanshul Saxena, Yingshan Chang, Andrew Luo, Yonatan Bisk

Model Lego: Creating Models Like Disassembling and Assembling Building Blocks

Authors: Jiacong Hu, Jing Gao, Jingwen Ye, Yang Gao, Xingen Wang, Zunlei Feng, Mingli Song

Language (Dialogue)

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering

Authors: Ruosen Li, Ruochen Li, Barry Wang, Xinya Du

Language (Generation)

Aligning to Thousands of Varying Preferences via System Message Generalization

Authors: Seongyun Lee, Sue Hyun Park, Seungone Kim, Minjoon Seo

Language (Knowledge)

Alignment for Honesty

Authors: Yuqing Yang, Ethan Chern, Xipeng Qiu, Graham Neubig, Pengfei Liu

Learning Theory

Accelerating ERM for data-driven algorithm design using output-sensitive techniques

Authors: Maria-florina Balcan, Christopher Seiler, Dravyansh Sharma

On the Comparison between Multi-modal and Single-modal Contrastive Learning

Authors: Wei Huang, Andi Han, Yongqiang Chen, Yuan Cao, Zhiqiang Xu, Taiji Suzuki

Oracle-Efficient Differentially Private Learning with Public Data

Authors: Adam Block, Mark Bun, Rathin Desai, Abhishek Shetty, Steven Wu

Sample-Efficient Agnostic Boosting

Authors: Udaya Ghai, Karan Singh

Miscellaneous Aspects Of Machine Learning (General Machine Learning Techniques)

Post-Hoc Reversal: Are We Selecting Models Prematurely?

Authors: Rishabh Ranjan, Saurabh Garg, Mrigank Raman, Carlos Guestrin, Zachary Lipton

Searching for Efficient Linear Layers over a Continuous Space of Structured Matrices

Authors: Andres Potapczynski, Shikai Qiu, Marc Finzi, Christopher Ferri, Charlie Chen, Micah Goldblum, C. Bayan Bruss, Christopher De Sa, Andrew Wilson

Miscellaneous Aspects Of Machine Learning (Supervised Learning)

Multimodal Models

Continual Audio-Visual Sound Separation

Authors: Weiguo Pian, Yiyang Nan, Shijian Deng, Shentong Mo, Yunhui Guo, Yapeng Tian

Do CLIP Models Always Generalize Better than ImageNet Models?

Authors: Qizhou Wang, Yong Lin, Yongqiang Chen, Ludwig Schmidt, Bo Han, Tong Zhang

Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

Authors: Ce Zhang, Simon Stepputtis, Katia Sycara, Yaqi Xie

FlexCap: Describe Anything in Images in Controllable Detail

Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Authors: Brandon Huang, Chancharik Mitra, Leonid Karlinsky, Assaf Arbelle, Trevor Darrell, Roei Herzig

Neuroscience, Cognitive Science

Divergences between Language Models and Human Brains

Authors: Yuchen Zhou, Emmy Liu, Graham Neubig, Michael Tarr, Leila Wehbe

MiSO: Optimizing brain stimulation to create neural activity states

Authors: Yuki Minai, Joana Soldado-magraner, Matthew Smith, Byron M Yu

Online Learning

Communication Bounds for the Distributed Experts Problem

Authors: Zhihao Jia, Qi Pang, Trung Tran, David Woodruff, Zhihao Zhang, Wenting Zheng

Global Rewards in Restless Multi-Armed Bandits

Authors: Naveen Raman, Zheyuan Shi, Fei Fang

Optimal Top-Two Method for Best Arm Identification and Fluid Analysis

Authors: Agniv Bandyopadhyay, Sandeep Juneja, Shubhada Agrawal

Regret Minimization in Stackelberg Games with Side Information

Authors: Keegan Harris, Steven Wu, Maria-florina Balcan

Optimization

Binary Search Tree with Distributional Predictions

Authors: Michael Dinitz, Sungjin Im, Thomas Lavastida, Ben Moseley, Aidin Niaparast, Sergei Vassilvitskii

SequentialAttention++ for Block Sparsification: Differentiable Pruning Meets Combinatorial Optimization

Authors: Taisuke Yasuda, Kyriakos Axiotis, Gang Fu, Mohammadhossein Bateni, Vahab Mirrokni

Optimization (Convex)

John Ellipsoids via Lazy Updates

Authors: David Woodruff, Taisuke Yasuda

Optimization (Large Scale, Parallel And Distributed)

Efficient Federated Learning against Heterogeneous and Non-stationary Client Unavailability

Authors: Ming Xiang, Stratis Ioannidis, Edmund Yeh, Carlee Joe-wong, Lili Su

LSH-MoE: Communication-efficient MoE Training via Locality-Sensitive Hashing

Authors: Xiaonan Nie, Liu Qibin, Fangcheng Fu, Shenhan Zhu, Xupeng Miao, Xiaoyang Li, Yang Zhang, Shouda Liu, Bin Cui

Optimization (Learning For Optimization)

Warm-starting Push-Relabel

Authors: Sami Davies, Sergei Vassilvitskii, Yuyan Wang

Other

Active, anytime-valid risk controlling prediction sets

Authors: Ziyu Xu, Nikos Karampatziakis, Paul Mineiro

Architect: Generating Vivid and Interactive 3D Scenes with Hierarchical 2D Inpainting

Authors: Yian Wang, Xiaowen Qiu, Jiageng Liu, Zhehuan Chen, Jiting Cai, Yufei Wang, Tsun-hsuan Johnson Wang, Zhou Xian, Chuang Gan

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization

Authors: Kai Hu, Weichen Yu, Tianjun Yao, Xiang Li, Wenhe Liu, Lijun Yu, Yining Li, Kai Chen, Zhiqiang Shen, Matt Fredrikson

Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning

Authors: Tong Yang, Shicong Cen, Yuting Wei, Yuxin Chen, Yuejie Chi

GL-NeRF: Gauss-Laguerre Quadrature Enables Training-Free NeRF Acceleration

Authors: Silong Yong, Yaqi Xie, Simon Stepputtis, Katia Sycara

Hierarchical and Density-based Causal Clustering

Authors: Kwangho Kim, Jisu Kim, Larry Wasserman, Edward Kennedy

Imprecise Label Learning: A Unified Framework for Learning with Various Imprecise Label Configurations

Authors: Hao Chen, Ankit Shah, Jindong Wang, Ran Tao, Yidong Wang, Xiang Li, Xing Xie, Masashi Sugiyama, Rita Singh, Bhiksha Raj

Invisible Image Watermarks Are Provably Removable Using Generative AI

Authors: Xuandong Zhao, Kexun Zhang, Zihao Su, Saastha Vasan, Ilya Grishchenko, Christopher Kruegel, Giovanni Vigna, Yu-xiang Wang, Lei Li

MAmmoTH2: Scaling Instructions from the Web

Authors: Xiang Yue, Tianyu Zheng, Ge Zhang, Wenhu Chen

MergeMinds: Boosting Multilingual Reasoning with the Built-in Capabilities of LLMs

Authors: Zixian Huang, Wenhao Zhu, Gong Cheng, Lei Li, Fei Yuan

Neural Collapse Inspired Feature Alignment for Out-of-Distribution Generalization

Authors: Zhikang Chen, Min Zhang, Sen Cui, Haoxuan Li, Gang Niu, Mingming Gong, Changshui Zhang, Kun Zhang

On the Parameter Identifiability of Partially Observed Linear Causal Models

Authors: Xinshuai Dong, Ignavier Ng, Biwei Huang, Yuewen Sun, Songyao Jin, Roberto Legaspi, Peter Spirtes, Kun Zhang

One-Step Diffusion Distillation through Score Implicit Matching

Authors: Weijian Luo, Zemin Huang, Zhengyang Geng, J. Zico Kolter, Guo-jun Qi

Private and Personalized Frequency Estimation in a Federated Setting

Authors: Amrith Setlur, Vitaly Feldman, Kunal Talwar

S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Authors: Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, Beidi Chen

SIRIUS : Contexual Sparisty with Correction for Efficient LLMs

Authors: Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen

Sequential Harmful Shift Detection Without Labels

Authors: Salim I. Amoukou, Tom Bewley, Saumitra Mishra, Freddy Lecue, Daniele Magazzeni, Manuela Veloso

SpecExec: Massively Parallel Speculative Decoding For Interactive LLM Inference on Consumer Devices

Authors: Ruslan Svirschevski, Avner May, Zhuoming Chen, Beidi Chen, Zhihao Jia, Max Ryabinin

Tactile DreamFusion: Exploiting Tactile Sensing for 3D Generation

Authors: Ruihan Gao, Kangle Deng, Gengshan Yang, Wenzhen Yuan, Jun-yan Zhu

Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models

Authors: Aviv Bick, Kevin Li, Eric Xing, J. Zico Kolter, Albert Gu

When and How Does Synthetic Data Improve Reasoning Capabilities of Language Models?

Authors: Amrith Setlur, Saurabh Garg, Naman Garg, Xinyang Geng, Virginia Smith, Aviral Kumar

Privacy

LLM Dataset Inference: Detect Datasets, not Strings

Authors: Pratyush Maini, Hengrui Jia, Nicolas Papernot, Adam Dziedzic

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Authors: Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith

On the Benefits of Public Representations for Private Transfer Learning under Distribution Shift

Authors: Pratiksha Thaker, Amrith Setlur, Steven Wu, Virginia Smith

Reconstruction Attacks on Machine Unlearning: Simple Models are Vulnerable

Authors: Martin Bertran, Shuai Tang, Michael Kearns, Jamie Morgenstern, Aaron Roth, Steven Wu

Reinforcement Learning (Batch Offline)

Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning

Authors: Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung

BECAUSE: Bilinear Causal Representation for Generalizable Offline Model-based Reinforcement Learning

Authors: Haohong Lin, Wenhao Ding, Jian Chen, Laixi Shi, Jiacheng Zhu, Bo Li, Ding Zhao

OASIS: Conditional Distribution Shaping for Offline Safe Reinforcement Learning

Authors: Yihang Yao, Zhepeng Cen, Wenhao Ding, Haohong Lin, Shiqi Liu, Tingnan Zhang, Wenhao Yu, Ding Zhao

Reinforcement Learning (Everything Else)

Incremental Learning of Retrievable Skills For Efficient Continual Task Adaptation

Authors: Daehee Lee, Minjong Yoo, Woo Kyung Kim, Wonje Choi, Honguk Woo

REBEL: Reinforcement Learning via Regressing Relative Rewards

Authors: Zhaolin Gao, Jonathan Chang, Wenhao Zhan, Owen Oertell, Gokul Swamy, Kianté Brantley, Thorsten Joachims, Drew Bagnell, Jason Lee, Wen Sun

Understanding Preference Learning Through the Lens of Coverage

Authors: Yuda Song, Gokul Swamy, Aarti Singh, J. Bagnell, Wen Sun

Reinforcement Learning (Multi-agent)

Language Grounded Multi-Agent Communication for Ad-hoc Teamwork

Authors: Huao Li, Hossein Nourkhiz Mahjoub, Behdad Chalaki, Vaishnav Tadiparthi, Kwonjoon Lee, Ehsan Moradi Pari, Charles Lewis, Katia Sycara

Multi-Agent Imitation Learning: Value is Easy, Regret is Hard

Authors: Jingwu Tang, Gokul Swamy, Fei Fang, Steven Wu

Reinforcement Learning (Planning)

Identifying Latent State-Transition Processes for Individualized Reinforcement Learning

Authors: Yuewen Sun, Biwei Huang, Yu Yao, Donghuo Zeng, Xinshuai Dong, Songyao Jin, Boyang Sun, Roberto Legaspi, Kazushi Ikeda, Peter Spirtes, Kun Zhang

Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference

Authors: Benjamin Eysenbach, Vivek Myers, Ruslan Salakhutdinov, Sergey Levine

Robotics

BehaviorGPT: Smart Agent Simulation for Autonomous Driving with Next-Patch Prediction

Authors: Zikang Zhou, Hu Haibo, Xinhong Chen, Jianping Wang, Nan Guan, Kui Wu, Yung-hui Li, Yu-kai Huang, Chun Jason Xue

Simulated Humanoid Grasping on Diverse Objects

Authors: Zhengyi Luo, Jinkun Cao, Sammy Christen, Alexander Winkler, Kris Kitani, Weipeng Xu

Theory (Everything Else)

Analytically Computing Partial Information Decomposition

Authors: Chaitanya Goswami, Amanda Merkley

Theory (Game Theory)

Aggregating Quantitative Relative Judgments: From Social Choice to Ranking Prediction

Authors: Yixuan Xu, Hanrui Zhang, Yu Cheng, Vincent Conitzer

Bias Detection via Signaling

Authors: Yiling Chen, Tao Lin, Ariel Procaccia, Aaditya Ramdas, Itai Shapira

Efficient $Phi$-Regret Minimization with Low-Degree Swap Deviations in Extensive-Form Games

Authors: Brian Zhang, Ioannis Anagnostides, Gabriele Farina, Tuomas Sandholm

The Secretary Problem with Predicted Additive Gap

Authors: Alexander Braun, Sherry Sarkar

Theory (Reinforcement Learning And Planning)

Time Series

Con4m: Context-aware Consistency Learning Framework for Segmented Time Series Classification

Authors: Junru Chen, Tianyu Cao, Jing Xu, Jiahe Li, Zhilong Chen, Tao Xiao, Yang Yang

Trustworthy Machine Learning

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

Authors: Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, Min Lin

Improving Alignment and Robustness with Short Circuiting

Authors: Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, J. Zico Kolter, Matt Fredrikson, Dan Hendrycks

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

Authors: Weifeng Liu, Tianyi She, Jiawei Liu, Run Wang, Dongyu Yao, 子游 梁, Boheng Li

Rethinking LLM Memorization through the Lens of Adversarial Compression

Authors: Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary Lipton, J. Zico Kolter

Test-Time Adaptation Induces Stronger Accuracy and Agreement-on-the-Line

Authors: Eungyeup Kim, Mingjie Sun, Christina Baek, Aditi Raghunathan, J. Zico Kolter

Towards Calibrated Robust Fine-Tuning of Vision-Language Models

Authors: Changdae Oh, Hyesu Lim, Mijoo Kim, Dongyoon Han, Sangdoo Yun, Jaegul Choo, Alexander Hauptmann, Zhi-qi Cheng, Kyungwoo Song

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Nouha Dziri, Yejin Choi

Read More

Siemens Healthineers Adopts MONAI Deploy for Medical Imaging AI

Siemens Healthineers Adopts MONAI Deploy for Medical Imaging AI

3.6 billion. That’s about how many medical imaging tests are performed annually worldwide to diagnose, monitor and treat various conditions.

Speeding up the processing and evaluation of all these X-rays, CT scans, MRIs and ultrasounds is essential to helping doctors manage their workloads and to improving health outcomes.

That’s why NVIDIA introduced MONAI, which serves as an open-source research and development platform for AI applications used in medical imaging and beyond. MONAI unites doctors with data scientists to unlock the power of medical data to build deep learning models and deployable applications for medical AI workflows.

This week at the annual meeting of RSNA, the Radiological Society of North America, NVIDIA announced that Siemens Healthineers has adopted MONAI Deploy, a module within MONAI that bridges the gap from research to clinical production, to boost the speed and efficiency of integrating AI workflows for medical imaging into clinical deployments.

With over 15,000 installations in medical devices around the world, the Siemens Healthineers Syngo Carbon and syngo.via enterprise imaging platforms help clinicians better read and extract insights from medical images of many sources.

Developers typically use a variety of frameworks when building AI applications. This makes it a challenge to deploy their applications into clinical environments.

With a few lines of code, MONAI Deploy builds AI applications that can run anywhere. It is a tool for developing, packaging, testing, deploying and running medical AI applications in clinical production. Using it streamlines the process of developing and integrating medical imaging AI applications into clinical workflows.

.MONAI Deploy on the Siemens Healthineers platform has significantly accelerated the AI integration process, letting users port trained AI models into real-world clinical settings with just a few clicks, compared with what used to take months. This helps researchers, entrepreneurs and startups get their applications into the hands of radiologists more quickly.

“By accelerating AI model deployment, we empower healthcare institutions to harness and benefit from the latest advancements in AI-based medical imaging faster than ever,” said Axel Heitland, head of digital technologies and research at Siemens Healthineers. “With MONAI Deploy, researchers can quickly tailor AI models and transition innovations from the lab to clinical practice, providing thousands of clinical researchers worldwide access to AI-driven advancements directly on their syngo.via and Syngo Carbon imaging platforms.”

Enhanced with MONAI-developed apps, these platforms can significantly streamline AI integration. These apps can be easily provided and used on the Siemens Healthineers Digital Marketplace, where users can browse, select and seamlessly integrate them into their clinical workflows.

MONAI Ecosystem Boosts Innovation and Adoption

Now marking its five-year anniversary, MONAI has seen over 3.5 million downloads, 220 contributors from around the world, acknowledgements in over 3,000 publications, 17 MICCAI challenge wins and use in numerous clinical products.

The latest release of MONAI — v1.4 — includes updates that give researchers and clinicians even more opportunities to take advantage of the innovations of MONAI and contribute to Siemens Healthineers Syngo Carbon, syngo.via and the Siemens Healthineers Digital Marketplace.

The updates in MONAI v1.4 and related NVIDIA products include new foundation models for medical imaging, which can be customized in MONAI and deployed as NVIDIA NIM microservices. The following models are now generally available as NIM microservices:

  • MAISI (Medical AI for Synthetic Imaging) is a latent diffusion generative AI foundation model that can simulate high-resolution, full-format 3D CT images and their anatomic segmentations.
  • VISTA-3D is a foundation model for CT image segmentation that offers accurate out-of-the-box performance covering over 120 major organ classes. It also offers effective adaptation and zero-shot capabilities to learn to segment novel structures.

Alongside MONAI 1.4’s major features, the new MONAI Multi-Modal Model, or M3, is now accessible through MONAI’s VLM GitHub repo. M3 is a framework that extends any multimodal LLM with medical AI experts such as trained AI models from MONAI’s Model Zoo. The power of this new framework is demonstrated by the VILA-M3 foundation model that’s now available on Hugging Face, offering state-of-the-art radiological image copilot performance.

MONAI Bridges Hospitals, Healthcare Startups and Research Institutions

Leading healthcare institutions, academic medical centers, startups and software providers around the world are adopting and advancing MONAI, including:

  • German Cancer Research Center leads MONAI’s benchmark and metrics working group, which provides metrics for measuring AI performance and guidelines for how and when to use those metrics.
  • Nadeem Lab from Memorial Sloan Kettering Cancer Center (MSK) pioneered the cloud-based deployment of multiple AI-assisted annotation pipelines and inference modules for pathology data using MONAI.
  • University of Colorado School of Medicine faculty developed MONAI-based ophthalmology tools for detecting retinal diseases using a variety of imaging modalities. The university also leads some of the original federated learning developments and clinical demonstrations using MONAI.
  • MathWorks has integrated MONAI Label with its Medical Imaging Toolbox, bringing medical imaging AI and AI-assisted annotation capabilities to thousands of MATLAB users engaged in medical and biomedical applications throughout academia and industry.
  • GSK is exploring MONAI foundation models such as VISTA-3D and VISTA-2D for image segmentation.
  • Flywheel offers a platform, which includes MONAI for streamlining imaging data management, automating research workflows, and enabling AI development and analysis, that scales for the needs of research institutions and life sciences organizations.
  • Alara Imaging published its work on integrating MONAI foundation models such as VISTA-3D with LLMs such as Llama 3 at the 2024 Society for Imaging Informatics in Medicine conference.
  • RadImageNet is exploring the use of MONAI’s M3 framework to develop cutting-edge vision language models that utilize expert image AI models from MONAI to generate high-quality radiological reports.
  • Kitware is providing professional software development services surrounding MONAI, helping integrate MONAI into custom workflows for device manufacturers as well as regulatory-approved products.

Researchers and companies are also using MONAI on cloud service providers to run and deploy scalable AI applications. Cloud platforms providing access to MONAI include AWS HealthImaging, Google Cloud, Precision Imaging Network, part of Microsoft Cloud for Healthcare, and Oracle Cloud Infrastructure.

See disclosure statements about syngo.via, Syngo Carbon and products in the Digital Marketplace.

Read More

HadaCore: Tensor Core Accelerated Hadamard Transform Kernel

HadaCore: Tensor Core Accelerated Hadamard Transform Kernel

IBM: Krish Agarwal, Rishi Astra, Adnan Hoque, Mudhakar Srivatsa, Raghu Ganti
Meta: Less Wright, Sijia Chen

Quantization is a method for improving model inference speeds by compressing model weights and performing (faster) computation in lower precision data types. However, quantization can result in accuracy loss due to the presence of outliers. Recent works like QuaRot, SpinQuant, and FlashAttention-3 introduce methods to increase the numerical accuracy of INT4, INT8 and FP8 quantization in LLMs. These methods rely on Hadamard Transforms. In this blog, we present HadaCore, a Hadamard Transform CUDA kernel that achieves state-of-the-art performance on NVIDIA A100 and H100 GPUs. Our kernel achieves speedups of 1.1–1.4x and 1.0–1.3x, with a peak gain of 3.5x and 3.6x respectively, over Dao AI Lab’s Fast Hadamard Transform Kernel. We leverage a hardware-aware work decomposition that benefits from Tensor Core acceleration while maintaining quantization error reduction.

Figure 1: Speedup of HadaCore vs Dao AI Hadamard CUDA kernel. A peak gain of 3.46x on the A100 is achieved using 128 rotation by 8.4M elements.

Figure 1: Speedup of HadaCore vs Dao AI Hadamard CUDA kernel. A peak gain of 3.46x on the A100 is achieved using 128 rotation by 8.4M elements.

The HadaCore Kernel is publicly available.

Background

QuaRot and SpinQuant both propose methods to increase the numerical accuracy of INT4 and INT8 quantization in LLMs. Both methods rotate model activations since rotations are statistically likely to reduce the magnitude of outliers, as it “distributes” extreme values among other (less extreme) dimensions, and rotation is also an easily invertible operation using the inverse of the rotation matrix. These methods can also improve FP8 inference accuracy, such as in FlashAttention-3.

Figure 2. Transformer block showing online (red) and offline rotations (blue) in QuaRot

Figure 2. Transformer block showing online (red) and offline rotations (blue) in QuaRot

Applying these rotation matrices introduces model runtime overhead due to the online operations shown in Figure 2. These rotations can be applied through matrix multiplication, but the added overhead would diminish the benefits from quantization. Therefore, QuaRot and SpinQuant opt to use Walsh-Hadamard matrices, a special type of rotation matrix that can be applied faster than matrix multiplication using the Fast Walsh-Hadamard Transform algorithm. HadaCore is an optimized implementation of this algorithm for NVIDIA GPUs that support Tensor Cores.

Tensor Core Accelerated Hadamard Transform

HadaCore leverages NVIDIA Tensor Cores, which are specialized compute units on NVIDIA GPUs optimized for matrix multiplication. To achieve this, our kernel performs a hardware-aware work decomposition of the Fast Walsh-Hadamard algorithm. This work decomposition ensures that we can utilize the MMA PTX instructions that execute on the Tensor Core chip. HadaCore applies a 16×16 Hadamard transform to chunks of the input data. The computation can then be offloaded to the FP16 Tensor Core with usage of the mma.m16n8k16 instruction. The warp-level parallelism for HadaCore is shown below.

Figure 3: HadaCore Parallelization, 1x256 vectors (rows) being rotated by a size 256 Hadamard.

Figure 3: HadaCore Parallelization, 1×256 vectors (rows) being rotated by a size 256 Hadamard.

We process fragments of 256 elements in parallel using warp-level Tensor Core operations to achieve up to a 256-size Hadamard transform. For further sizes, we shuffle data between warps and repeat.

Microbenchmarks

We benchmark HadaCore against the Dao AI Lab Hadamard Kernel on both NVIDIA H100 and A100 GPUs across varying Hadamard and input tensor sizes.

Figure 4:  HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel

Figure 4: HadaCore Kernel Speedup on NVIDIA A100 over Dao AI Lab Fast Hadamard Kernel

Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline

Color coded Speedup Table for NVIDIA A100, Green = Speedup over Baseline

Figure 5:  HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel

Figure 5: HadaCore Kernel Speedup on NVIDIA H100 over Dao AI Lab Fast Hadamard Kernel

Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline

Color coded Speedup Table for NVIDIA H100, Green = Speedup over Baseline

We showcase our speedup as the input tensor size (labeled element count) in our charts increase. Element count is the number of elements in the target matrix we are rotating. For example, in multi-head attention:

The queries (Q), keys (K) and values (V) tensors are 4D tensors of size:

(batch_size, seq_len, n_heads, head_dim)

A Hadamard matrix of size head_dim is applied to these activation tensors, so we refer to this as using a Hadamard size of head_dim with an element count of:

batch_size*seq_len*n_heads*head_dim.

Common element counts for query rotations in an attention block:

Model Tokens Prefill Decoding
Llama-2 70b 33,554,432 elements

128 Hadamard size

(1 batch * 64 heads * 4096 tokens * 128 dimensional embeddings per head per token)

8192 elements

128 Hadamard size

(1 batch * 64 heads * 1 token * 128 dimensional embeddings per head per token)
Llama-3 8b 33,554,432 elements

128 Hadamard size

(1 batch * 32 heads * 8192 tokens * 128 dimensional embeddings per head per token)
4,096 elements

128 Hadamard size

(1 batch * 32 heads * 1 token * 128 dimensional embeddings per head per token)

HadaCore achieves 1.1–1.4x speedup on A100 and 1.0–1.3x speedup on H100 over Dao AI Lab’s Fast Hadamard kernel, with a peak gain of 3.5x and 3.6x, respectively. For smaller sizes on H100, HadaCore’s gain decreases. For future work, we plan to incorporate usage of Hopper specific features like TMA and WGMMA for improved H100 performance.

MMLU Benchmarks

We evaluated MMLU scores on a Llama 3.1-8B inference workload where the FlashAttention computation was performed in FP8. Newer generation NVIDIA Hopper GPUs come equipped with FP8 Tensor Cores that deliver substantial compute gain over FP16.

Our results show the benefit of using HadaCore for accuracy preservation when combined with optimizations such as FP8 FlashAttention.

Format Method Llama3.1-8B

Avg. 5-Shot MMLU Accuracy
Q, K, V: FP16

FlashAttention: FP16
N/A 65.38
Q, K, V: FP16

FlashAttention: FP8
No Hadamard 64.40
Q, K, V: FP8

FlashAttention: FP8
HadaCore 65.09
Q, K, V: FP8

FlashAttention: FP8
Dao AI Fast Hadamard Kernel 65.45

Table 1: MMLU scores for Llama3.1 8B with FP16 baseline and FP8 attention using Hadamard transforms, comparing an implementation with explicit Hadamard matrix multiplications vs. HadaCore (higher is better)

From the above MMLU scores, we note that for Llama3.1-8B inference with FP8 attention, HadaCore improves the quantization error introduced from computing attention in a lower precision.

Conclusion

We showcased our speedups achieved by moving the Fast-Walsh Hadamard algorithm into a CUDA kernel that leverages Tensor Core acceleration and achieves a peak speedup of 3.5x and 3.6x over the Dao AI Fast-Hadamard kernel on NVIDIA A100 and H100, respectively.

Further, we showed on the MMLU benchmark that rotating with HadaCore maintains similar quantization error reduction to the Fast-Hadamard kernel, while providing computational acceleration.

Future Work

We plan to implement a Triton version of our kernel and experiment with more advanced techniques such as kernel fusion to support fused Hadamard transform and quantization. Further, we plan to extend our kernel to support BF16 Tensor Core compute.

Read More

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling

Diffusion models have emerged as a powerful tool for generating high-quality images from textual descriptions. Despite their successes, these models often exhibit limited diversity in the sampled images, particularly when sampling with a high classifier-free guidance weight. To address this issue, we present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables, serving as abstract and intermediary representations for…Apple Machine Learning Research

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

Cohere Rerank 3.5 is now available in Amazon Bedrock through Rerank API

We are excited to announce the availability of Cohere’s advanced reranking model Rerank 3.5 through our new Rerank API in Amazon Bedrock. This powerful reranking model enables AWS customers to significantly improve their search relevance and content ranking capabilities. This model is also available for Amazon Bedrock Knowledge Base users. By incorporating Cohere’s Rerank 3.5 in Amazon Bedrock, we’re making enterprise-grade search technology more accessible and empowering organizations to enhance their information retrieval systems with minimal infrastructure management.

In this post, we discuss the need for Reranking, the capabilities of Cohere’s Rerank 3.5, and how to get started using it on Amazon Bedrock.

Reranking for advanced retrieval

Reranking is a vital enhancement to Retrieval Augmented Generation (RAG) systems that adds a sophisticated second layer of analysis to improve search result relevance beyond what traditional vector search can achieve. Unlike embedding models that rely on pre-computed static vectors, rerankers perform dynamic query-time analysis of document relevance, enabling more nuanced and contextual matching. This capability allows RAG systems to effectively balance between broad document retrieval and precise context selection, ultimately leading to more accurate and reliable outputs from language models while reducing the likelihood of hallucinations.

Existing search systems significantly benefit from reranking technology by providing more contextually relevant results that directly impact user satisfaction and business outcomes. Unlike traditional keyword matching or basic vector search, reranking performs an intelligent second-pass analysis that considers multiple factors, including semantic meaning, user intent, and business rules to optimize search result ordering. In ecommerce specifically, reranking helps surface the most relevant products by understanding nuanced relationships between search queries and product attributes, while also incorporating crucial business metrics like conversion rates and inventory levels. This advanced relevance optimization leads to improved product discovery, higher conversion rates, and enhanced customer satisfaction across digital commerce platforms, making reranking an essential component for any modern enterprise search infrastructure.

Introducing Cohere Rerank 3.5

Cohere’s Rerank 3.5 is designed to enhance search and RAG systems. This intelligent cross-encoding model takes a query and a list of potentially relevant documents as input, then returns the documents sorted by semantic similarity to the query. Cohere Rerank 3.5 excels in understanding complex information requiring reasoning and is able to understand the meaning behind enterprise data and user questions. Its ability to comprehend and analyze enterprise data and user questions across over 100 languages including Arabic, Chinese, English, French, German, Hindi, Japanese, Korean, Portuguese, Russian, and Spanish, makes it particularly valuable for global organizations in sectors such as finance, healthcare, hospitality, energy, government, and manufacturing.

One of the key advantages of Cohere Rerank 3.5 is its ease of implementation. Through a single Rerank API call in Amazon Bedrock, you can integrate Rerank into existing systems at scale, whether keyword-based or semantic. Reranking strictly improves first-stage retrievals on standard text retrieval benchmarks.

Cohere Rerank 3.5 is state of the art in the financial domain, as illustrated in the following figure.

Cohere Rerank 3.5 is also state of the art in the ecommerce domain, as illustrated in the following figure. Cohere’s ecommerce benchmarks revolve around retrieval on various products, including fashion, electronics, food, and more.

Products were structured as strings in a key-value pair format such as the following:

“Title”: “Title” 
“Description”: “Long-form description” “Type”: <Some categorical data> etc.....

Cohere Rerank 3.5 also excels in hospitality, as shown in the following figure. Hospitality benchmarks revolve around retrieval on hospitality experiences and lodging options.

Documents were structured as strings in a key-value pairs format such as the following:

“Listing Title”: “Rental unit in Toronto” “Location”: “171 John Street, Toronto, Ontario, Canada”

“Description”: “Escape to our serene villa with stunning downtown views....”

We see noticeable gains in project management performance across all types of issue tracking tasks, as illustrated in the following figure.

Cohere’s project management benchmarks span a variety of retrieval tasks, such as:

  • Search through engineering tickets from various project management and issue tracking software tools
  • Search through GitHub issues on popular open source repos

Get started with Cohere Rerank 3.5

To start using Cohere Rerank 3.5 with Rerank API and Amazon Bedrock Knowledge Bases, navigate to the Amazon Bedrock console, and click on Model Access on the left hand pane. Click on Modify Access, select Cohere Rerank 3.5, click Next and hit submit.

Get Started with Amazon Bedrock Rerank API

The Cohere Rerank 3.5 model, powered by the Amazon Bedrock Rerank API, allows you to rerank input documents directly based on their semantic relevance to a user query – without requiring a pre-configured knowledge base. The flexibility makes it a powerful tool for various use cases.

To begin, set up your environment by importing the necessary libraries and initializing Boto3 clients:

import boto3
import json
region = boto3.Session().region_name

bedrock_agent_runtime = boto3.client('bedrock-agent-runtime',region_name=region)

modelId = "cohere.rerank-v3-5:0"
model_package_arn = f"arn:aws:bedrock:{region}::foundation-model/{modelId}”

Next, define a main function that reorders a list of text documents by computing relevance scores based on the user query:

def rerank_text(text_query, text_sources, num_results, model_package_arn):
    response = bedrock_agent_runtime.rerank(
        queries=[
            {
                "type": "TEXT",
                "textQuery": {
                    "text": text_query
                }
            }
        ],
        sources=text_sources,
        rerankingConfiguration={
            "type": "BEDROCK_RERANKING_MODEL",
            "bedrockRerankingConfiguration": {
                "numberOfResults": num_results,
                "modelConfiguration": {
                    "modelArn": model_package_arn,
                }
            }
        }
    )
    return response['results']

For instance, imagine a scenario where you need to identify emails related to returning items from a multilingual dataset. The example below demonstrates this process:

example_query = "What emails have been about returning items?"

documents = [
    "Hola, llevo una hora intentando acceder a mi cuenta y sigue diciendo que mi contraseña es incorrecta. ¿Puede ayudarme, por favor?",
    "Hi, I recently purchased a product from your website but I never received a confirmation email. Can you please look into this for me?",
    "مرحبًا، لدي سؤال حول سياسة إرجاع هذا المنتج. لقد اشتريته قبل بضعة أسابيع وهو معيب",
    "Good morning, I have been trying to reach your customer support team for the past week but I keep getting a busy signal. Can you please help me?",
    "Hallo, ich habe eine Frage zu meiner letzten Bestellung. Ich habe den falschen Artikel erhalten und muss ihn zurückschicken.",
    "Hello, I have been trying to reach your customer support team for the past hour but I keep getting a busy signal. Can you please help me?",
    "Hi, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective.",
    "早上好,关于我最近的订单,我有一个问题。我收到了错误的商品",
    "Hello, I have a question about the return policy for this product. I purchased it a few weeks ago and it is defective."
]

Now, prepare the list of text sources that will be passed into the rerank_text() function:

text_sources = []
for text in documents:
    text_sources.append({
        "type": "INLINE",
        "inlineDocumentSource": {
            "type": "TEXT",
            "textDocument": {
                "text": text,
            }
        }
    })

You can then invoke rerank_text() by specifying the user query, the text resources, the desired number of top-ranked results, and the model ARN:

response = rerank_text(example_query, text_sources, 3, model_package_arn)
print(response)

The output generated by the Amazon Bedrock Rerank API with Cohere Rerank 3.5 for this query is:

[{'index': 4, 'relevanceScore': 0.1122397780418396},
 {'index': 8, 'relevanceScore': 0.07777658104896545},
 {'index': 2, 'relevanceScore': 0.0770234540104866}]

The relevance scores provided by the API are normalized to a range of [0, 1], with higher scores indicating higher relevance to the query. Here the 5th item in the list of documents is the most relevant. (Translated from German to English: Hello, I have a question about my last order. I received the wrong item and need to return it.)

You can also get started using Cohere Rerank 3.5 with Amazon Bedrock Knowledge Bases by completing the following steps:

  1. In the Amazon Bedrock console, choose Knowledge bases under Builder tools in the navigation pane.
  2. Choose Create knowledge base.
  3. Provide your knowledge base details, such as name, permissions, and data source.
  1. To configure your data source, specify the location of your data.
  2. Select an embedding model to convert the data into vector embeddings, and have Amazon Bedrock create a vector store in your account to store the vector data.

When you select this option (available only in the Amazon Bedrock console), Amazon Bedrock creates a vector index in Amazon OpenSearch Serverless (by default) in your account, removing the need to manage anything yourself.

  1. Review your settings and create your knowledge base.
  2. In the Amazon Bedrock console, choose your knowledge base and choose Test knowledge base.
  3. Choose the icon for additional configuration options for testing your knowledge base.
  4. Choose your model (for this post, Cohere Rerank 3.5) and choose Apply.

The configuration pane shows the new Reranking section menu with additional configuration options. The number of reranked source chunks returns the specified number of highest relevant chunks.

Conclusion

In this post, we explored how to use Cohere’s Rerank 3.5 model in Amazon Bedrock, demonstrating its powerful capabilities for enhancing search relevance and robust reranking capabilities for enterprise applications, enhancing user experience and optimizing information retrieval workflows. Start improving your search relevance today with Cohere’s Rerank model on Amazon Bedrock.

Cohere Rerank 3.5 in Amazon Bedrock is available in the following AWS Regions: in us-west-2 (US West – Oregon), ca-central-1 (Canada – Central), eu-central-1 (Europe – Frankfurt), and ap-northeast-1 (Asia Pacific – Tokyo).

Share your feedback to AWS re:Post for Amazon Bedrock or through your usual AWS Support contacts.

To learn more about Cohere Rerank 3.5’s features and capabilities, view the Cohere in Amazon Bedrock product page.


About the Authors

Karan Singh is a Generative AI Specialist for third-party models at AWS, where he works with top-tier third-party foundation model (FM) providers to develop and execute joint Go-To-Market strategies, enabling customers to effectively train, deploy, and scale FMs to solve industry specific challenges. Karan holds a Bachelor of Science in Electrical and Instrumentation Engineering from Manipal University, a master’s in science in Electrical Engineering from Northwestern University and is currently an MBA Candidate at the Haas School of Business at University of California, Berkeley.

James Yi is a Senior AI/ML Partner Solutions Architect at Amazon Web Services. He spearheads AWS’s strategic partnerships in Emerging Technologies, guiding engineering teams to design and develop cutting-edge joint solutions in generative AI. He enables field and technical teams to seamlessly deploy, operate, secure, and integrate partner solutions on AWS. James collaborates closely with business leaders to define and execute joint Go-To-Market strategies, driving cloud-based business growth. Outside of work, he enjoys playing soccer, traveling, and spending time with his family.

Read More

AWS DeepRacer: How to master physical racing?

AWS DeepRacer: How to master physical racing?

As developers gear up for re:Invent 2024, they again face the unique challenges of physical racing. What are the obstacles? Let’s have a look.

In this blog post, I will look at what makes physical AWS DeepRacer racing—a real car on a real track—different to racing in the virtual world—a model in a simulated 3D environment. I will cover the basics, the differences in virtual compared to physical, and what steps I have taken to get a deeper understanding of the challenge.

The AWS DeepRacer League is wrapping up. In two days, 32 racers will face off in Las Vegas for one last time. This year, the qualification has been all-virtual, so the transition from virtual to physical racing will be a challenge.

The basics

AWS DeepRacer relies on the racer training a model within the simulator, a 3D environment built around ROS and Gazebo, originally built on AWS RoboMaker.

The trained model is subsequently used for either virtual or physical races. The model comprises a convolutional neural network (CNN) and an action space translating class labels into speed and throttle movement. In the basic scenario involving a single camera, a 160 x 120 pixels, 8-bit grayscale image (similar to the following figure) is captured 15 times per second, passed through the neural network, and the action with the highest weight (probability) is executed.

The small piece of AI magic is that during model evaluation (racing) there’s no context; each image is processed independently of the image before it, and without knowledge of the state of the car itself. If you process the images in reverse order the results remain the same!

Virtual compared to physical

The virtual worlds are 3D worlds created in Gazebo, and the software is written in Python and C++ using ROS as the framework. As shown in the following image, the 3D simulation is fairly flat, with basic textures and surfaces. There is little or no reflections or shine, and the environment is as visually clean as you make it. Input images are captured 15 times per second.

Within this world a small car is simulated. Compared to a real car, the model is very basic and lacks quite a few of the things that make a real car work: There is no suspension, the tires are rigid cylinders, there is no Ackermann steering, and there are no differentials. It’s almost surprising that this car can drive at all. On the positive side the camera is perfect; irrespective of lighting conditions you get crisp clear pictures with no motion blur.

A typical virtual car drives at speeds between 0.5 and 4.0 meters per second, depending on the shape of the track. If you go too fast, it will often oversteer and spin out of the turn because of the relatively low grip.

In contrast, the real world is less perfect—simulation-to-real gap #1 is around visual noise created by light, reflections (if track is printed on reflective material), and background noise (such as if the barriers around the track are too low, and the car sees people and objects in the back). Input images are captured 30 times per second.

The car itself—based on the readily available WLToys A979—has all the things the model car doesn’t: proper tires, suspension, and differential. One problem is that the car is heavy—around 1.5 kg—and the placement of some components causes the center of gravity to be very high. This causes simulation-to-real gap #2: Roll and pitch during corners at high speeds cause the camera to rotate, confusing the neural network as the horizon moves.

Gap #3 comes from motion blur when the light is too dim; the blur can cause the dashed centerline to look like a solid line, making it hard to distinguish the centerline from the solid inner and outer lines, as shown in the following figure.

The steering geometry, the differentials, the lack of engineering precision of the A979, and the corresponding difficulty in calibrating it, causes gap #4. Even if the model wants to go straight, the car still pulls left or right, needing constant correction to stay on track. This is most noticeable when the car is unable to drive down the straights in a straight line.

The original AWS DeepRacer, without modifications, has a smaller speed range of about 2 meters per second. It has a better grip but suffers from the previously mentioned roll movements. If you go too fast, it will understeer and potentially roll over. Since 2023, the AWS pit-crews operate their fleets of AWS DeepRacers with shock spacers to stiffen the suspension, reduce the roll, and increase the max effective speed.

Four questions

Looking at the sim-to-real gaps there are four questions that we want to explore:

  • How can we train the model to better handle the real world? This includes altering the simulator to close some of the gaps, combined with adapting reward function, action space, and training methodology to make better use of this simulator.
  • How can we better evaluate what the car does, and why? In the virtual world, we can perform log analysis to investigate; in the real world this has not yet been possible.
  • How can we evaluate our newly trained models? A standard AWS DeepRacer track, with its size of 8 meters x 6 meters, is prohibitively large. Is it possible to downscale the track to fit in a home?
  • Will a modified car perform better? Upgrade my AWS DeepRacer with better shocks? Add ball bearings and shims to improve steering precision? Or build a new lighter car based on a Raspberry Pi?

Solutions

To answer these questions, some solutions are required to support the experiments. The following assumes that you’re using Deepracer-for-Cloud to run the training locally or in an Amazon Elastic Compute Cloud (Amazon EC2) instance. We won’t go into the details but provide references that will enable you to try things out on your own.

Customized simulator

The first thing to look at is how you can alter the simulator. The simulator code is available, and modifying it doesn’t require too many skills. You can alter the car and the physics of the world or adjust the visual environment.

Change the environment

Changing the environments means altering the 3D world. This can be done by altering the features in a pre-existing track by adding or removing track parts (such as lines), changing lighting, adding background features (such as walls or buildings), swapping out textures, and so on. Making changes to the world will require building a new Docker image, which can take quite some time, but there are ways to speed that up. Going a step further, it’s also possible to make the world programmatically (command line or code) alterable during run-time.

The starting point are the track COLLADA (.dae) files found in the meshes folder. You can import it into Blender (shown in the following figure), make your changes, and export the file again. Note that lights and camera positions from Blender aren’t considered by Gazebo. To alter the lighting conditions, you will have to alter the .world file in worlds—the files are XML files in sdformat.

See Custom Tracks for some examples of tuned tracks.

Car and physics

The competition cars owned by AWS can’t be altered, so the objective of tuning the car in the simulator is to make it behave in ways more similar to the real one. Trained neural networks have an embedded expectation of what will happen next; which means that the simulated car learned that by taking a specific action, it would get a turn of a given radius. If the simulator car steers more or less than the physical one in a given situation, the outcome becomes unpredictable.

Lack of Ackermann steering, no differentials, but wheels that can deflect up to 30 degrees—real wheels only go to a bit more than 20 degrees outwards and less than that inwards. My experience is that the real car, surprisingly enough, still has a shorter turning radius than the virtual one.

The car models are found in the urdf folder. There are three different cars, relating to the different versions of physics, which you configure in your actions space (model_metadata.json). Today, only the deepracer (v3 and v4 physics) and deepracer_kinematics (v5 physics) models are relevant. There are variant models for single camera and for stereo camera, both with and without the LIDAR.

Each physics version is different; the big question is what impact, if any, each version has on the behavior of the physical car.

  • Version 3: Steering and throttle is managed through a PID controller, making speed and steering changes smooth (and slow). The simulation environment runs at all times—including during image processing and inference—leading to a higher latency between image capture and action taking effect.
  • Version 4: Steering and throttle is managed through a PID controller, but the world is put on hold during inference, reducing the latency.
  • Version 5: Steering and throttle is managed through a position and velocity controller, and the world is put on hold during inference, almost eliminating latency. (This is very unnatural; the car can take alternating 30 degree left and right turns and will go almost straight ahead.)

The PID controller for v3 and v4 can be changed in the racecar control file. By changing the P, I, and D values, you can tune how fast or how slow the car accelerates and steers.

You can also tune the friction. In our simulator, friction is defined for the wheels, not the surfaces that the car drives on. The values (called mu and mu2) are found in racecar.gazebo; increasing them (once per tire!) will allow the car to drive faster without spinning.

Finally, I implemented an experimental version of the Ackermann steering geometry including differentials. Why? When turning, a car’s wheels follow two circles with the same center point, the inner one is having a smaller radius than the outer one. In short, the inner wheels will have to steer more (larger curvature), but rotate slower (smaller circumference) than the outer wheels.

Customized car software

The initial work to create an altered software stack for the original AWS DeepRacer started in 2022. The first experiments included operating the AWS DeepRacer with an R/C controller and capturing the camera images and IMU data to create an in-car video. There was a lot to learn about ROS2, including creating a custom node for publishing IMU sensor data and capturing and creating videos on the fly. During the Berlin Summit in 2022, I also got to give my modified car a spin on the track!

In the context of physical racing, the motivation for customizing the car software is to obtain more information—what does the car do, and why. Watching the following video, you can clearly see the rolling movement in the turns, and the blurring of certain parts of the image discussed earlier.

The work triggered a need to alter several of the open source AWS DeepRacer packages, and included work such as optimizing the performance from camera to inference through compressing images and enabling GPU and compute stick acceleration of the inference. This turned into several scripts comprising all the changes to the different nodes and creating an upgraded software package that could be installed on an original AWS DeepRacer car.

The work evolved, and a logging mechanism using ROS Bag allowed us to analyze not only pictures, but also the actions that the car took. Using the deepracer-viz library of Jochem Lugtenburg, a fellow AWS DeepRacer community leader, I added a GradCam overlay on the video feed (shown in the following video), which gives a better understanding of what’s going on.

The outcome of this has evolved into the community AWS DeepRacer Custom Car repository, which allows anyone to upgrade their AWS DeepRacer with improved software with two commands and without having to compile the modules themselves!

Benefits are:

  • Performance improvement by using compressed image transport for the main processing pipeline.
  • Inference using OpenVINO with Intel GPU (original AWS DeepRacer), OpenVino with Myriad Neural Compute Stick (NCS2), or TensorFlow Lite.
  • Model Optimizer caching, speeding up switching of models.
  • Capture in-car camera and inference results to a ROS Bag for logfile analysis.
  • UI tweaks and fixes.
  • Support for Raspberry Pi4, enabling us to create the DeepRacer Pi!

Testing on a custom track

Capturing data is great, but you need a way to test it all—bringing models trained in a customized environment onto a track to see what works and what doesn’t.

The question turned out to be: How hard is it to make a track that has the same design as the official tracks, but that takes up less space than the 8m x 6m of the re:Invent 2018 track? After re:Invent 2023, I started to investigate. The goal was to create a custom track that would fit in my garage with a theoretical maximum size of 5.5m x 4.5m. The track should be printable on vinyl in addition to being available in the Simulator for virtual testing.

After some trial and error, it proved to be quite straightforward, even if it requires multiple steps, starting in a Jupyter Notebook, moving into a vector drawing program (Inkscape), and finalizing in Blender (to create the simulator meshes).

The trapezoid track shown in the following two figures (center line and final sketch) is a good example of how to create a brand new track. The notebook starts with eight points in an array and builds out the track step by step, adding the outer line, center line, and color.

In the end I chose to print a narrower version of Trapezoid—Trapezoid Narrow, shown in the following figure—to fit behind my garage, with dimensions of 5.20m x 2.85m including the green borders around the track. I printed it on PVC with a thickness 500 grams per square meter. The comparatively heavy material was a good choice. It prevents folds and wrinkles and generally ensures that the track stays in place even when you walk on it.

Around the track, I added a boundary of mesh PVC mounted on some 20 x 20 centimeter aluminum poles. Not entirely a success, because the light shone through and I needed to add a lining of black fleece. The following image shows the completed track before the addition of black fleece.

Experiments and conclusions

re:Invent is just days away. Experiments are still running, and because I need to fight my way through the Wildcard race, this is not the time to include all the details. Let’s just say that things aren’t always as straightforward as expected.

As a preview of what’s going on, I’ll end this post with the latest iteration of the in-car video, showing a AWS DeepRacer Pi doing laps in the garage. Check back after re:Invent for the big reveal!


About the author

Lars Lorentz Ludvigsen is a technology enthusiast who was introduced to AWS DeepRacer in late 2019 and was instantly hooked. Lars works as a Managing Director at Accenture where he helps clients to build the next generation of smart connected products. In addition to his role at Accenture, he’s an AWS Community Builder who focuses on developing and maintaining the AWS DeepRacer community’s software solutions.

Read More