Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow

Advanced tracing and evaluation of generative AI agents using LangChain and Amazon SageMaker AI MLFlow

Developing generative AI agents that can tackle real-world tasks is complex, and building production-grade agentic applications requires integrating agents with additional tools such as user interfaces, evaluation frameworks, and continuous improvement mechanisms. Developers often find themselves grappling with unpredictable behaviors, intricate workflows, and a web of complex interactions. The experimentation phase for agents is particularly challenging, often tedious and error prone. Without robust tracking mechanisms, developers face daunting tasks such as identifying bottlenecks, understanding agent reasoning, ensuring seamless coordination across multiple tools, and optimizing performance. These challenges make the process of creating effective and reliable AI agents a formidable undertaking, requiring innovative solutions to streamline development and enhance overall system reliability.

In this context, Amazon SageMaker AI with MLflow offers a powerful solution to streamline generative AI agent experimentation. For this post, I use LangChain’s popular open source LangGraph agent framework to build an agent and show how to enable detailed tracing and evaluation of LangGraph generative AI agents. This post explores how Amazon SageMaker AI with MLflow can help you as a developer and a machine learning (ML) practitioner efficiently experiment, evaluate generative AI agent performance, and optimize their applications for production readiness. I also show you how to introduce advanced evaluation metrics with Retrieval Augmented Generation Assessment (RAGAS) to illustrate MLflow customization to track custom and third-party metrics like with RAGAS.

The need for advanced tracing and evaluation in generative AI agent development

A crucial functionality for experimentation is the ability to observe, record, and analyze the internal execution path of an agent as it processes a request. This is essential for pinpointing errors, evaluating decision-making processes, and improving overall system reliability. Tracing workflows not only aids in debugging but also ensures that agents perform consistently across diverse scenarios.

Further complexity arises from the open-ended nature of tasks that generative AI agents perform, such as text generation, summarization, or question answering. Unlike traditional software testing, evaluating generative AI agents requires new metrics and methodologies that go beyond basic accuracy or latency measures. You must assess multiple dimensions—such as correctness, toxicity, relevance, coherence, tool call, and groundedness—while also tracing execution paths to identify errors or bottlenecks.

Why SageMaker AI with MLflow?

Amazon SageMaker AI, which provides a fully managed version of the popular open source MLflow, offers a robust platform for machine learning experimentation and generative AI management. This combination is particularly powerful for working with generative AI agents. SageMaker AI with MLflow builds on MLflow’s open source legacy as a tool widely adopted for managing machine learning workflows, including experiment tracking, model registry, deployment, and metrics comparison with visualization.

  • Scalability: SageMaker AI allows you to easily scale generative AI agentic experiments, running multiple iterations simultaneously.
  • Integrated tracking: MLflow integration enables efficient management of experiment tracking, versioning, and agentic workflow.
  • Visualization: Monitor and visualize the performance of each experiment run with built-in MLflow capabilities.
  • Continuity for ML Teams: Organizations already using MLflow for classic ML can adopt agents without overhauling their MLOps stack, reducing friction for generative AI adoption.
  • AWS ecosystem advantage: Beyond MLflow, SageMaker AI provides a comprehensive ecosystem for generative AI development, including access to foundation models, many managed services, simplified infrastructure, and integrated security.

This evolution positions SageMaker AI with MLflow as a unified platform for both traditional ML and cutting-edge generative AI agent development.

Key features of SageMaker AI with MLflow

The capabilities of SageMaker AI with MLflow directly address the core challenges of agentic experimentation—tracing agent behavior, evaluating agent performance, and unified governance.

  1. Experiment tracking: Compare different runs of the LangGraph agent and track changes in performance across iterations.
  2. Agent versioning: Keep track of different versions of the agent throughout its development lifecycle to iteratively refine and improve agents.
  3. Unified agent governance: Agents registered in SageMaker AI with MLflow automatically appear in the SageMaker AI with MLflow console, enabling a collaborative approach to management, evaluation, and governance across teams.
  4. Scalable infrastructure: Use the managed infrastructure of SageMaker AI to run large-scale experiments without worrying about resource management.

LangGraph generative AI agents

LangGraph offers a powerful and flexible approach to designing generative AI agents tailored to your company’s specific needs. LangGraph’s controllable agent framework is engineered for production use, providing low-level customization options to craft bespoke solutions.

In this post, I show you how to create a simple finance assistant agent equipped with a tool to retrieve financial data from a datastore, as depicted in the following diagram. This post’s sample agent, along with all necessary code, is available on the GitHub repository, ready for you to replicate and adapt it for your own applications.

Solution code

You can follow and execute the full example code from the aws-samples GitHub repository. I use snippets from the code in the repository to illustrate evaluation and tracking approaches in the reminder of this post.

Prerequisites

Trace generative AI agents with SageMaker AI with MLflow

MLflow’s tracing capabilities are essential for understanding the behavior of your LangGraph agent. The MLflow tracking is an API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results.

MLflow tracing is a feature that enhances observability in your generative AI agent by capturing detailed information about the execution of the agent services, nodes, and tools. Tracing provides a way to record the inputs, outputs, and metadata associated with each intermediate step of a request, enabling you to easily pinpoint the source of bugs and unexpected behaviors.

The MLfow tracking UI displays the traces exported under the MLflow Traces tab for the selected MLflow experimentation, as shown in the following image.

Furthermore, you can see the detailed trace for an agent input or prompt invocation by selecting the Request ID. Choosing Request ID opens a collapsible view with results captured at each step of the invocation workflow from input to the final output, as shown in the following image.

SageMaker AI with MLflow traces all the nodes in the LangGraph agent and displays the trace in the MLflow UI with detailed inputs, outputs, usage tokens, and multi-sequence messages with origin type (human, tool, AI) for each node. The display also captures the execution time over the entire agentic workflow, providing a per-node breakdown of time. Overall, tracing is crucial for generative AI agents for the following reasons:

  • Performance monitoring: Tracing enables you to oversee the agent’s behavior and make sure that it operates effectively, helping identify malfunctions, inaccuracies, or biased outputs.
  • Timeout management: Tracing with timeouts helps prevent agents from getting stuck in long-running operations or infinite loops, helping to ensure better resource management and responsiveness.
  • Debugging and troubleshooting: For complex agents with multiple steps and varying sequences based on user input, tracing helps pinpoint where issues are introduced in the execution process.
  • Explainability: Tracing provides insights into the agent’s decision-making process, helping you to understand the reasoning behind its actions. For example, you can see what tools are called and the processing type—human, tool, or AI.
  • Optimization: Capturing and propagating an AI system’s execution trace enables end-to-end optimization of AI systems, including optimization of heterogeneous parameters such as prompts and metadata.
  • Compliance and security: Tracing helps in maintaining regulatory compliance and secure operations by providing audit logs and real-time monitoring capabilities.
  • Cost tracking: Tracing can help in analyzing resource usage (input tokens, output tokens) and associated extrapolate costs of running AI agents.
  • Adaptation and learning: Tracing allows for observing how agents interact with prompts and data, providing insights that can be used to improve and adapt the agent’s performance over time.

In the MLflow UI, you can choose the Task name to see details captured at any agent step as it services the input request prompt or invocation, as shown in the following image.

By implementing proper tracing, you can gain deeper insights into your generative AI agents’ behavior, optimize their performance, and make sure that they operate reliably and securely.

Configure tracing for agent

For fine-grained control and flexibility in tracking, you can use MLflow’s tracing decorator APIs. With these APIs, you can add tracing to specific agentic nodes, functions, or code blocks with minimal modifications.

@mlflow.trace(name="assistant", attributes={"workflow": "agent_assistant"}, span_type="graph.py")
def assistant(state: GraphState):
    ...

This configuration allows users to:

  • Pinpoint performance bottlenecks in the LangGraph agent
  • Track decision-making processes
  • Monitor error rates and types
  • Analyze patterns in agent behavior across different scenarios

This approach allows you to specify exactly what you want to track in your experiment. Additionally, MLflow offers out-of-the box tracing comparability with LangChain for basic tracing through MLflow’s autologging feature mlflow.langchain.autolog(). With SageMaker AI with MLflow, you can gain deep insights into the LangGraph agent’s performance and behavior, facilitating easier debugging, optimization, and monitoring, in both development and production environments.

Evaluate with MLflow

You can use MLflow’s evaluation capabilities to help assess the performance of the LangGraph large language model (LLM) agent and objectively measure its effectiveness in various scenarios. The important aspects of evaluation are:

  • Evaluation metrics: MLflow offers many default metrics such as LLM-as-a-Judge, accuracy, and latency metrics that you can specify for evaluation and have the flexibility to define custom LLM-specific metrics tailored to the agent. For instance, you can introduce custom metrics for Correct Financial Advice, Adherence to Regulatory Guidelines, and Usefulness of Tool Invocations.
  • Evaluation dataset: Prepare a dataset for evaluation that reflects real-world queries and scenarios. The dataset should include example questions, expected answers, and relevant context data.
  • Run evaluation using MLflow evaluate library: MLflow’s mlflow.evaluate() returns comprehensive evaluation results, which can be viewed directly in the code or through the SageMaker AI with MLflow UI for a more visual representation.

The following is a snippet for how mlflow.evaluate(), can be used to execute evaluation on agents. You can follow this example by running the code in the same aws-samples GitHub repository.

python
results = mlflow.evaluate(
            agent_responses,  # Agent-generated answers to test queries
            targets="ground_truth",    # Reference "correct" answers for comparison
            model_type="question-answering",  # Predefined metrics for QA tasks
            extra_metrics=metrics   # Evaluation Metrics to include
        )

This code snippet employs MLflow’s evaluate() function to rigorously assess the performance of a LangGraph LLM agent, comparing its responses to a predefined ground truth dataset that’s maintained in the golden_questions_answer.jsonl file in the aws-samples GitHub repository. By specifying “model_type”:”question-answering”, MLflow applies relevant evaluation metrics for question-answering tasks, such as accuracy and coherence. Additionally, the extra_metrics parameter allows you to incorporate custom, domain-specific metrics tailored to the agent’s specific application, enabling a comprehensive and nuanced evaluation beyond standard benchmarks. The results of this evaluation are then logged in MLflow (as shown in the following image), providing a centralized and traceable record of the agent’s performance, facilitating iterative improvement and informed deployment decisions. The MLflow evaluation is captured as part of the MLflow execution run.

You can open the SageMaker AI with MLflow tracking server and see the list of MLflow execution runs for the specified MLflow experiment, as shown in the following image.

The evaluation metrics are captured within the MLflow execution along with model metrics and the accompanying artifacts, as shown in the following image.

Furthermore, the evaluation metrics are also displayed under the Model metrics tab within a selected MLflow execution run, as shown in the following image.

Finally, as shown in the following image, you can compare different variations and versions of the agent during the development phase by selecting the compare checkbox option in the MLflow UI between selected MLflow execution experimentation runs. This can help compare and select the best functioning agent version for deployment or with other decision making processes for agent development.

Register the LangGraph agent

You can use SageMaker AI with MLflow artifacts to register the LangGraph agent along with any other item as required or that you’ve produced. All the artifacts are stored in the SageMaker AI with MLflow tracking server’s configured Amazon Simple Storage Service (Amazon S3) bucket. Registering the LangGraph agent is crucial for governance and lifecycle management. It provides a centralized repository for tracking, versioning, and deploying the agents. Think of it as a catalog of your validated AI assets.

As shown in the following figure, you can see the artifacts captured under the Artifact tab within the MLflow execution run.

MLflow automatically captures and logs agent-related information files such as the evaluation results and the consumed libraries in the requirements.txt file. Furthermore, a successfully logged LangGraph agent as a MLflow model can be loaded and used for inference using mlflow.langchain.load_model(model_uri). Registering the generative AI agent after rigorous evaluation helps ensure that you’re promoting a proven and validated agent to production. This practice helps prevent the deployment of poorly performing or unreliable agents, helping to safeguard the user experience and the integrity of your applications. Post-evaluation registration is critical to make sure that the experiment with the best result is the one that gets promoted to production.

Use MLflow to experiment and evaluate with external libraries (such as RAGAS)

MLflow’s flexibility allows for seamless integration with external libraries, enhancing your ability to experiment and evaluate LangChain LangGraph agents. You can extend SageMaker MLflow to include external evaluation libraries such as  RAGAS for comprehensive LangGraph agent assessment. This integration enables ML practitioners to use RAGAS’s specialized LLM evaluation metrics while benefiting from MLflow’s experiment tracking and visualization capabilities. By logging RAGAS metrics directly to SageMaker AI with MLflow, you can easily compare different versions of the LangGraph agent across multiple runs, gaining deeper insights into its performance.

RAGAS is an open source library that provide tools specifically for evaluation of LLM applications and generative AI agents. RAGAS includes a method ragas.evaluate(), to run evaluations for LLM agents with choice of LLM models (evaluators) for scoring the evaluation, and extensive list of default metrics. To incorporate RAGAS metrics into your MLflow experiments, you can use the following approach.

You can follow this example by running the notebook in the GitHub repository additional_evaluations_with_ragas.ipynb.

from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
evaluation_dataset = EvaluationDataset.from_list(ragas_dataset)
evaluator_llm = LangchainLLMWrapper(llm_for_evaluation)
result = evaluate(
    dataset=evaluation_dataset,
    metrics=metrics_final,
    llm=evaluator_llm,
    embeddings=bedrock_embeddings,
    )
result

The evaluation results using RAGAS metrics from the above code are shown in the following figure.

Subsequently, the computed RAGAS evaluations metrics can be exported and tracked in the SageMaker AI with MLflow tracking server as part of the MLflow experimentation run. See the following code snippet for illustration and the full code can be found in the notebook in the same aws-samples GitHub repository.

with mlflow.start_run(
    experiment_id=get_experiment_id(_MLFLOW_RAGAS_EXPERIMENT_NAME), 
    run_name=timestamp, 
    tags={
        "project": os.getenv('PROJECT'),
        "model": os.getenv('MODELID'),
        "version": os.getenv('VERSION')
    }
):
    # Log the dataset to MLflow
    mlflow.log_input(dataset, context="ragas_eval_results")

    for ragas_metric in [faithfulness, answer_relevancy, answer_correctness]:
        print(ragas_metric.name)
        mean = ragas_result_ds[ragas_metric.name].mean()
        p90 = ragas_result_ds[ragas_metric.name].quantile(0.9)
        variance = ragas_result_ds[ragas_metric.name].var()
        print(mean, p90, variance)
        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/mean", mean)
        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/p90", p90)
        mlflow.log_metric(f"ragas_{ragas_metric.name}_score/v1/variance", variance)
mlflow.end_run()

You can view the RAGAS metrics logged by MLflow in the SageMaker AI with MLflow UI on the Model metrics tab, as shown in the following image.

From experimentation to production: Collaborative approval with SageMaker with MLflow tracing and evaluation

In a real-world deployment scenario, MLflow’s tracing and evaluation capabilities with LangGraph agents can significantly streamline the process of moving from experimentation to production.

Imagine a large team of data scientists and ML engineers working on an agentic platform, as shown in the following image. With MLflow, they can create sophisticated agents that can handle complex queries, process returns, and provide product recommendations. During the experimentation phase, the team can use MLflow to log different versions of the agent, tracking performance and evaluation metrics such as response accuracy, latency, and other metrics. MLflow’s tracing feature allows them to analyze the agent’s decision-making process, identifying areas for improvement. The results across numerous experiments are automatically logged to SageMaker AI with MLflow. The team can use the MLflow UI to collaborate, compare, and select the best performing version of the agent and decide on a production-ready version, all informed by the diverse set data logged in SageMaker AI with MLflow.

With this data, the team can present a clear, data-driven case to stakeholders for promoting the agent to production. Managers and compliance officers can review the agent’s performance history, examine specific interaction traces, and verify that the agent meets all necessary criteria. After being approved, the SageMaker AI with MLflow registered agent facilitates a smooth transition to deployment, helping to ensure that the exact version of the agent that passed evaluation is the one that goes live. This collaborative, traceable approach not only accelerates the development cycle but also instills confidence in the reliability and effectiveness of the generative AI agent in production.

Clean up

To avoid incurring unnecessary charges, use the following steps to clean up the resources used in this post:

  1. Remove SageMaker AI with MLflow tracking server:
    • In SageMaker Studio, stop and delete any running MLflow tracking server instances
  2. Revoke Amazon Bedrock model access:
    • Go to the Amazon Bedrock console.
    • Navigate to Model access and remove access to any models you enabled for this project.
  3. Delete the SageMaker domain (If not needed):
    • Open the SageMaker console.
    • Navigate to the Domains section.
    • Select the domain you created for this project.
    • Choose Delete domain and confirm the action.
    • Also delete any associated S3 buckets and IAM roles.

Conclusion

In this post, I showed you how to combine LangChain’s LangGraph, Amazon SageMaker AI, and MLflow to demonstrate a powerful workflow for developing, evaluating, and deploying sophisticated generative AI agents. This integration provides the tools needed to gain deep insights into the generative AI agent’s performance, iterate quickly, and maintain version control throughout the development process.

As the field of AI continues to advance, tools like these will be essential for managing the increasing complexity of generative AI agents and ensuring their effectiveness with the following considerations,

  1. Traceability is paramount: Effective tracing of agent execution paths using SageMaker MLflow is crucial for debugging, optimization, and helping to ensure consistent performance in complex generative AI workflows. Pinpoint issues, understand decision-making, examine interaction traces, and improve overall system reliability through detailed, recorded analysis of agent processes.
  2. Evaluation drives improvement: Standardized and customized evaluation metrics, using MLflow’s evaluate() function and integrations with external libraries like RAGAS, provide quantifiable insights into agent performance, guiding iterative refinement and informed deployment decisions.
  3. Collaboration and governance are essential: Unified governance facilitated by SageMaker AI with MLflow enables seamless collaboration across teams, from data scientists to compliance officers, helping to ensure responsible and reliable deployment of generative AI agents in production environments.

By embracing these principles and using the tools outlined in this post, developers and ML practitioners can confidently navigate the complexities of generative AI agent development and deployment, building robust and reliable applications that deliver real business value. Now, it’s your turn to unlock the potential of advanced tracing, evaluation, and collaboration in your agentic workflows! Dive into the aws-samples GitHub repository and start using the power of LangChain’s LangGraph, Amazon SageMaker AI, and MLflow for your generative AI projects.


About the Author

Sandeep Raveesh is a Generative AI Specialist Solutions Architect at AWS. He works with customers through their AIOps journey across model training, Retrieval Augmented Generation (RAG), generative AI agents, and scaling generative AI use-cases. He also focuses on go-to-market strategies helping AWS build and align products to solve industry challenges in the generative AI space. You can find Sandeep on LinkedIn.

Read More

Apple Workshop on Natural Language Understanding 2024

Progress in natural language processing enables more intuitive ways of interacting with technology. For example, many of Apple’s products and services, including Siri and search, use natural language understanding and generation to enable a fluent and seamless interface experience for users. Natural language is a rapidly moving area of machine learning research, and includes work on large-scale data curation across multiple languages, novel architectures and algorithms, and new evaluation regimes, all of which involve important issues of privacy and security, as well as of performance and…Apple Machine Learning Research

Prompting for the best price-performance

Prompting for the best price-performance

In the drive to remain competitive, businesses today are turning to AI to help them minimize cost and maximize efficiency. It’s incumbent on them to find the most suitable AI model—the one that will help them achieve more while spending less. For many businesses, the migration from OpenAI’s model family to Amazon Nova represents not only a shift in model but a strategic move toward scalability, efficiency, and broader multimodal capabilities.

In this blog, we discuss how to optimize prompting in Amazon Nova for the best price-performance.

Why migrate from OpenAI to Amazon Nova?

OpenAI’s models remain powerful, but their operational costs can be prohibitive when scaled. Consider these figures from Artificial Analysis:

Model Input Token Cost (per Million Tokens) Output Token Cost (per Million Tokens) Context Window Output Speed (Tokens per Second) Latency (Seconds per first token)
GPT-4o ~$2.50 ~$10.00 Up to 128K tokens ~63 ~0.49
GPT-4o Mini ~$0.15 ~$0.60 Up to 128K tokens ~90 ~0.43
Nova Micro ~$0.035 ~$0.14 Up to 128K tokens ~195 ~0.29
Nova Lite ~$0.06 ~$0.24 Up to 300K tokens ~146 ~0.29
Nova Pro ~$0.80 ~$3.20 Up to 300K tokens ~90 ~0.34

For high-volume applications—like global customer support or large-scale document analysis—these cost differences are disruptive. Not only does Amazon Nova Pro offer over three times the cost-efficiency, its longer context window also enables it to handle more extensive and complex inputs.

Breaking down the Amazon Nova suite

Amazon Nova isn’t a single model—it’s a suite designed for various needs:

  • Amazon Nova Pro – A robust multimodal model that can process text, images, and video. It excels at tasks such as document analysis and deep data visualization. Benchmark comparisons show Amazon Nova Pro matching or even surpassing GPT-4o on complex reasoning tasks, according to section 2.1.1 of the Nova technical report and model card.
  • Amazon Nova Lite – Offers a balanced mix of multimodal processing and speed. Amazon Nova Lite is ideal for applications such as document summarization, translation, and even basic visual search, delivering quality outputs at lower latency and cost compared to GPT-4o Mini. You can find these benchmark results in section 2.1.2 of the Nova Technical Report and Model Card.
  • Amazon Nova Micro – A text-only model engineered for ultra-low latency. With output speed of up to 195 tokens per second, Amazon Nova Micro is perfect for real-time applications such as chat-based assistants and automated FAQs. Its token costs are dramatically lower than those of GPT-4o Mini—roughly 4.3 times cheaper on a per-token basis.

The lower per-token costs and higher output per second of Amazon Nova give you the flexibility to simplify prompts for real-time applications so you can balance quality, speed, and cost for your use case.

Understanding the foundations

To make the best decision about which model family fits your needs, it’s important to understand the differences in prompt engineering best practices in both OpenAI and Amazon Nova. Each model family has its own set of strengths, but there are some things that apply to both families. Across both model families, quality accuracy is achieved through clarity of instructions, structured prompts, and iterative refinement. Whether you’re using strong output directives or clearly defined use cases, the goal is to reduce ambiguity and improve response quality.

The OpenAI approach

OpenAI uses a layered messaging system for prompt engineering, where system, developer, and user prompts work in harmony to control tone, safety, and output format. Their approach emphasizes:

  • Hierarchical message roles – Setting the model’s role and behavior using system messages makes sure that the overarching safety and style guidelines (set in system prompts) are preserved
  • Instruction placement and delimiters – Directives are placed at the beginning, with clear separation between context, examples, and queries
  • Selective chain-of-thought – Detailed, step-by-step reasoning is used when it benefits complex tasks
  • Formatting and structure – Using strong directives such as DO, MUST, and DO NOT to provide consistent outputs (for example, in JSON)

The Amazon Nova approach

  • Define the prompt use case
    • Task – What exactly the model should do
    • Role – Which role the model should assume
    • Response style – The structure or tone of the output
    • Instructions – Guidelines the model must follow
  • Chain-of-thought
    • Explicit state response – Provide clear and strong instructions to limit the model’s response
    • Structural thinking – Thinking step-by-step encourages structural thinking
  • Formatting and structure
    • Use delimiters to sectioned your prompts, for example, ##Task##, ##Context##, or ##Example##
    • Specify the output format, for example, JSON, YAML, or Markdown
    • Use strong instructions and caps, such as DO, DO NOT, or MUST
    • Prefill the responses to guide the model, for example, start with “{“ or “json…”

Evolving your prompt strategy: Migrating from OpenAI to Amazon Nova

Transitioning to Amazon Nova isn’t merely a change in API endpoints—it requires retooling your prompt engineering to align with the strengths of Amazon Nova. You need to reframe your use case definition. Begin by breaking down your current GPT-4o or GPT-4o Mini prompt into its core elements of task, role, response style, and instructions and success criteria. Make sure to structure these elements clearly to provide a blueprint for the model.

To understand how to migrate an existing OpenAI prompt to work optimally for Amazon Nova Pro, consider the following example using the meeting notes summarizer. Here is the GPT-4o system prompt:

You will be provided with meeting notes, and your task is to summarize the meeting 
as follows: 
-Overall summary of discussion 
-Action items (what needs to be done and who is doing it) 
-If applicable, a list of topics that need to be discussed more fully in the next meeting.

The user prompt is the meeting notes that need to be summarized:

Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters...

GPT produces this helpful response:

Overall Summary of Discussion: The team discussed the recent mission to Planet Zog, 
the space pirate issue in Sector 7, the annual Intergalactic Bake-Off, planning for 
the upcoming charity fundraiser, and a proposed team-building retreat. The mission to 
Planet Zog was successful, but communication with the Zogians was challenging. The team 
also discussed the need for a better strategy to deal with space pirates. The team placed 
second in the Intergalactic Bake-Off and is planning for the upcoming charity fundraiser. 
A team-building retreat to the Moon Resort and Spa was proposed.

Action Items:
1. Dr. Quasar to start working on a Zogian-English dictionary.
2. Captain Stardust to speak with Admiral Starbeam about increasing patrols in Sector 7.
3. Dr. Quasar to finalize the prototype of the new cloaking technology.
4. All team members to start gathering supplies and preparing games for the Intergalactic 
Charity Bazaar.
5. Captain Stardust to check the budget for the proposed team-building retreat to the Moon 
Resort and Spa.

Topics for Next Meeting:
1. Update on the Zogian-English dictionary.
2. Progress report on the cloaking technology.
3. Results of increased patrols in Sector 7.
4. Final preparations for the Intergalactic Charity Bazaar.

To meet or exceed the quality of the response from GPT-4o, here is what an Amazon Nova Pro prompt might look like. The prompt uses the same best practices discussed in this post, starting with the system prompt. We used a temperature of .2 and a topP of .9 here:

You are an experienced executive assistant skilled in meeting note analysis and 
summarization. Your primary responsibilities include distilling complex discussions
into clear, actionable summaries.
Follow these instructions:

##INSTRUCTIONS##
1. Read understand the meeting notes found in ##NOTES##
2. Put all of your outputs in a section called ##OUTPUTS## in markdown formatting
3. Summarize the meeting notes in 5 sentences or less. Put this in a section called 
"Overall Summary".
4. Numerically list any action items for specific people and what needs to be completed. 
Put this list in a section called "Action Items".
5. If applicable, list the topics that need to be discussed more fully in the next meeting. 
Put this in a section called "Topics for Next Meeting".

Here’s the user prompt, using prefilled responses:

##NOTES##
Meeting Date: March 5th, 2050
Meeting Time: 2:00 PM
Location: Conference Room 3B, Intergalactic Headquarters
Attendees:
- Captain Stardust
- Dr. Quasar
- Lady Nebula
- Sir Supernova
- Ms. Comet
Meeting called to order by Captain Stardust at 2:05 PM
1. Introductions and welcome to our newest team member, Ms. Comet
2. Discussion of our recent mission to Planet Zog
- Captain Stardust: "Overall, a success, but communication with the Zogians was difficult. 
We need to improve our language skills."
- Dr. Quasar: "Agreed. I'll start working on a Zogian-English dictionary right away."
- Lady Nebula: "The Zogian food was out of this world, literally! We should consider having 
a Zogian food night on the ship."
3. Addressing the space pirate issue in Sector 7
- Sir Supernova: "We need a better strategy for dealing with these pirates. They've already 
plundered three cargo ships this month."
- Captain Stardust: "I'll speak with Admiral Starbeam about increasing patrols in that area.
- Dr. Quasar: "I've been working on a new cloaking technology that could help our ships avoid 
detection by the pirates. I'll need a few more weeks to finalize the prototype."
4. Review of the annual Intergalactic Bake-Off
- Lady Nebula: "I'm happy to report that our team placed second in the competition! Our Martian Mud 
Pie was a big hit!"
- Ms. Comet: "Let's aim for first place next year. I have a secret recipe for Jupiter Jello that I 
think could be a winner."
5. Planning for the upcoming charity fundraiser
- Captain Stardust: "We need some creative ideas for our booth at the Intergalactic Charity Bazaar."
- Sir Supernova: "How about a 'Dunk the Alien' game? We can have people throw water balloons at a 
volunteer dressed as an alien."
- Dr. Quasar: "I can set up a 'Name That Star' trivia game with prizes for the winners."
- Lady Nebula: "Great ideas, everyone. Let's start gathering the supplies and preparing the games."
6. Upcoming team-building retreat
- Ms. Comet: "I would like to propose a team-building retreat to the Moon Resort and Spa. It's a great 
opportunity to bond and relax after our recent missions."
- Captain Stardust: "Sounds like a fantastic idea. I'll check the budget and see if we can make it happen."
7. Next meeting agenda items
- Update on the Zogian-English dictionary (Dr. Quasar)
- Progress report on the cloaking technology (Dr. Quasar)
- Results of increased patrols in Sector 7 (Captain Stardust)
- Final preparations for the Intergalactic Charity Bazaar (All)
Meeting adjourned at 3:15 PM. Next meeting scheduled for March 19th, 2050 at 2:00 PM in Conference Room 3B, 
Intergalactic Headquarters.

##OUTPUTS##

The following example shows that the Amazon Nova response meets and exceeds the accuracy of the OpenAI example, formats the output in Markdown, and has found clear owners for each action item:

### Overall Summary
The meeting covered the successful mission to Planet Zog, the ongoing issue with 
space pirates in Sector 7, the results of the annual Intergalactic Bake-Off, plans 
for the upcoming charity fundraiser, and a proposed team-building retreat. Action items 
were assigned, and the next meeting's agenda was set.

### Action Items
1. **Dr. Quasar**: Begin working on a Zogian-English dictionary.
2. **Captain Stardust**: Speak with Admiral Starbeam about increasing patrols in Sector 7.
3. **Dr. Quasar**: Finalize the prototype for the new cloaking technology within a few weeks.
4. **Team**: Start gathering supplies and preparing games for the Intergalactic Charity Bazaar.
5. **Captain Stardust**: Check the budget for the proposed team-building retreat to the Moon Resort and Spa.

### Topics for Next Meeting
1. Update on the Zogian-English dictionary (Dr. Quasar)
2. Progress report on the cloaking technology (Dr. Quasar)
3. Results of increased patrols in Sector 7 (Captain Stardust)
4. Final preparations for the Intergalactic Charity Bazaar (All)

A few updates to the prompt can achieve comparable or better results from Amazon Nova Pro while enjoying a much less expensive cost of inference.

Employ Amazon Nova extended context

Amazon Nova Lite and Amazon Nova Pro can support up to 300,000 input tokens, which means that you can include more context in your prompt if needed. Expand your background data and detailed instructions accordingly—if your original OpenAI prompt was optimized for 128,000 tokens, adjust it to use the Amazon Nova extended window.

Tailor output constraints

If your GPT prompt required strict formatting (for example, “Respond in JSON only”), make sure that your Amazon Nova prompt includes these directives. Additionally, if your task involves multimodal inputs, specify when to include images or video references.

Function calling

The rise of generative AI agents has made function calling, or tool calling, one of the most important abilities of a given large language model (LLM). A model’s ability to correctly pick the right tool for the job, in a low-latency manner, is often the difference between success and failure of an agentic system.

Both OpenAI and Amazon Nova models share similarities in function calling, in particular their support for structured API calls. Both model families support tool selection through defined tool schemas, which we discuss later in this post. They also both provide a mechanism to decide when to invoke these tools or not.

OpenAI’s function calling uses flexible JSON schemas to define and structure API interactions. The models support a wide range of schema configurations, which give developers the ability to quickly implement external function calls through straightforward JSON definitions tied to their API endpoints.

Here is an example of a function:

tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current temperature for a given location.",
            "parameters": {
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "City and country e.g. Montevideo, Uruguay"
                        }            
                },            
                "required": [                
                    "location"
        ],
        "additionalProperties": False
    },
    "strict": True
    }
}]

completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is the weather like in Punta del Este today?"}],
    tools=tools

Similar to OpenAI’s approach, Amazon Nova can call tools when passed a configuration schema as shown in the following code example. Amazon Nova has made heavy use of Greedy Decoding when calling tools, and it’s advised to set temperature, topP, and topK to 1. This makes sure that the model has the highest accuracy in tool selection. These Greedy Decoding parameters and other great examples of tool use are covered in great detail in Tool use (function calling) with Amazon Nova.

The following is an example of function calling without using additionalModelRequestFields:

tool_config = {
    "tools": [{
        "toolSpec": {
            "name": "get_recipe",
            "description": "Structured recipe generation system",
            "inputSchema": {
                "json": {
                    "type": "object",
                    "properties": {
                        "recipe": {
                            "type": "object",
                            "properties": {
                                "name": {"type": "string"},
                                "ingredients": {
                                    "type": "array",
                                    "items": {
                                        "type": "object",
                                        "properties": {
                                            "item": {"type": "string"},
                                            "amount": {"type": "number"},
                                            "unit": {"type": "string"}
                                        }
                                    }
                                },
                                "instructions": {
                                    "type": "array",
                                    "items": {"type": "string"}
                                }
                            },
                            "required": ["name", "ingredients", "instructions"]
                        }
                    }
                }
            }
        }
    }]
}

# Base configuration without topK=1
input_text = "I need a recipe for chocolate lava cake"
messages = [{
    "role": "user",
    "content": [{"text": input_text}]
}]

# Inference parameters
inf_params = {"topP": 1, "temperature": 1}

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params
)
# Typically produces less structured or incomplete output
 

The following example shows how function calling accuracy can be improved by using

additionalModelRequestFields:

# Enhanced configuration with topK=1
response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig=inf_params,
    additionalModelRequestFields={"inferenceConfig": {"topK": 1}}
)
# Produces more accurate and structured function call

To maximize Amazon Nova function calling potential and improve accuracy, always use additionalModelRequestFields with topk=1. This forces the model to select the single most probable token and prevents random token selection. This increases deterministic output generation and improves function call precision by about 30–40%.

The following code examples further explain how to conduct tool calling successfully. The first scenario shows recipe generation without an explicit tool. The example doesn’t use topK, which typically results in responses that are less structured:

input_text = """
I'm looking for a decadent chocolate dessert that's quick to prepare. 
Something that looks fancy but isn't complicated to make.
"""

messages = [{
    "role": "user",
    "content": [{"text": input_text}]
}]

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    inferenceConfig={"topP": 1, "temperature": 1}
)
# Generates a conversational recipe description
# Less structured, more narrative-driven response

In this example, the scenario shows recipe generation with a structured tool. We add topK set to 1, which produces a more structured output:

response = client.converse(
    modelId="us.amazon.nova-lite-v1:0",
    messages=messages,
    toolConfig=tool_config,
    inferenceConfig={"topP": 1, "temperature": 1},
    additionalModelRequestFields={"inferenceConfig": {"topK": 1}}
)
# Generates a highly structured, JSON-compliant recipe
# Includes precise ingredient measurements
# Provides step-by-step instructions

Overall, OpenAI offers more flexible, broader schema support. Amazon Nova provides more precise, controlled output generation and is the best choice when working with high-stakes, structured data scenarios, as demonstrated in Amazon Nova’s performance on the IFEval benchmark discussed in section 2.1.1 of the technical report and model card. We recommend using Amazon Nova for applications requiring predictable, structured responses because its function calling methodology provides superior control and accuracy.

Conclusion

The evolution from OpenAI’s models to Amazon Nova represents a significant shift in using AI. It shows a transition toward models that deliver similar or superior performance at a fraction of the cost, with expanded capabilities in multimodal processing and extended context handling.

Whether you’re using the robust, enterprise-ready Amazon Nova Pro, the agile and economical Amazon Nova Lite, or the versatile Amazon Nova Micro, the benefits are clear:

  • Cost savings – With token costs up to four times lower, businesses can scale applications more economically
  • Enhanced response performance – Faster response times (up to 190 tokens per second) make real-time applications more viable
  • Expanded capabilities – A larger context window and multimodal support unlock new applications, from detailed document analysis to integrated visual content

By evolving your prompt strategy—redefining use cases, exploiting the extended context, and iteratively refining instructions—you can smoothly migrate your existing workflows from OpenAI’s o4 and o4-mini models to the innovative world of Amazon Nova.


About the Authors

Claudio Mazzoni is a Sr Specialist Solutions Architect on the Amazon Bedrock GTM team. Claudio exceeds at guiding costumers through their Gen AI journey. Outside of work, Claudio enjoys spending time with family, working in his garden, and cooking Uruguayan food.

Pat Reilly is a Sr. Specialist Solutions Architect on the Amazon Bedrock Go-to-Market team. Pat has spent the last 15 years in analytics and machine learning as a consultant. When he’s not building on AWS, you can find him fumbling around with wood projects.

Read More

Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Evaluate models or RAG systems using Amazon Bedrock Evaluations – Now generally available

Organizations deploying generative AI applications need robust ways to evaluate their performance and reliability. When we launched LLM-as-a-judge (LLMaJ) and Retrieval Augmented Generation (RAG) evaluation capabilities in public preview at AWS re:Invent 2024, customers used them to assess their foundation models (FMs) and generative AI applications, but asked for more flexibility beyond Amazon Bedrock models and knowledge bases. Today, we’re excited to announce the general availability of these evaluation features in Amazon Bedrock Evaluations, along with significant enhancements that make them fully environment-agnostic.

The general availability release introduces “bring your own inference responses” capabilities for both RAG evaluation and model evaluation. This means you can now evaluate a RAG system or model—whether running on Amazon Bedrock, other cloud providers, or on premises—as long as you provide evaluation data in the required format. You can even evaluate the final responses of a fully built application. We’ve also added new citation metrics for the already-powerful RAG evaluation suite, including citation precision and citation coverage, to help you better assess how accurately your RAG system uses retrieved information. These metrics provide deeper insights into whether responses are properly grounded in the source documents and if they’re using the most relevant retrieved passages.

In this post, we explore these new features in detail, showing you how to evaluate both RAG systems and models with practical examples. We demonstrate how to use the comparison capabilities to benchmark different implementations and make data-driven decisions about your AI deployments. Whether you’re using Amazon Bedrock or other environments for your generative AI applications, these evaluation tools can help you optimize performance and promote quality across your entire generative AI portfolio.

What’s new: RAG evaluation

In this section, we discuss what’s new with RAG evaluation, including the introduction of Bring Your Own Inference responses for evaluating external RAG systems and new citation metrics that measure precision and coverage of RAG citations.

Bring Your Own Inference responses

The RAG evaluation capabilities have been significantly expanded to allow evaluation of outputs from RAG systems hosted anywhere through the evaluation environment in Amazon Bedrock. With Bring Your Own Inference (BYOI) responses, you can now evaluate retrieval and generation results from a variety of sources, including other FM providers, custom-build RAG systems, or deployed open-weights solutions, by providing the outputs in the required format. Additionally, the introduction of new citation metrics with our previously released quality and responsible AI metrics also provides deeper insights into how well RAG systems use their knowledge bases and source documents.

When evaluating retrieval results from a knowledge base, we allow one knowledge base to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn currently has a prompt and referenceResponses. When using BYOI, you must provide your retrieval results in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). You must provide a knowledgeBaseIdentifier for every output. Additionally, for every retrieval result you bring, you can provide a name and additional metadata in the form of key-value pairs. The following is the input JSONL format for RAG evaluation jobs (Retrieve). Fields marked with ? are optional.

In the public preview period that ended March 20, 2025, the following dataset format contained a key called referenceContexts. This field has now been changed to referenceResponses to align with the Retrieve and Generate evaluation. The content of referenceResponses should be the expected ground truth answer that an end-to-end RAG system would have generated given the prompt, not the expected passages or chunks retrieved from the knowledge base.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
       "output"?: {
        "knowledgeBaseIdentifier": string
        "retrievedResults": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        }
      }
    }
  ]
}

When evaluating retrieval and generation results from a knowledge base or RAG system, we allow one knowledge base (RAG system) to be evaluated per evaluation job. For every dataset entry, you can provide a list of conversation turns. Each conversation turn has a prompt and referenceResponses. When using BYOI, you must provide the generated text, retrieved passages (for example, retrieval results), and citations for the generated text in a new field called output (this field is required for BYOI jobs but optional and not needed for non-BYOI jobs). The citations field will be used by the citation metrics. If you don’t have citations, you can enter dummy data in those fields and make sure to not select the citation precision or citation coverage metrics for the evaluation job. You must also provide a knowledgeBaseIdentifier for every output you bring. This knowledgeBaseIdentifier should be the same for all lines in the dataset and also at the job level. The following is the input JSONL format for RAG evaluation jobs (Retrieve and Generate). Fields marked with ? are optional.

{
  "conversationTurns": [
    {
      "prompt": {
        "content": [
          {
            "text": string
          }
        ]
      },
      "referenceResponses"?: [
        {
          "content": [
            {
              "text": string
            }
          ]
        }
      ],
      "output"?: {
        "text": string
        "modelIdentifier"?: string
        "knowledgeBaseIdentifier": string
        "retrievedPassages": {
          "retrievalResults": [
            {
              "name"?: string
              "content": {
                "text": string
              },
              "metadata"?: {
                [key: string]: string
              }
            }
          ]
        },
        "citations": [
          {
            "generatedResponsePart": {
              "textResponsePart": {
                "span": {
                  "start": number
                  "end": number
                  
                },
                "text": string
              }
            },
            "retrievedReferences": [
              {
                "name"?: string
                "content": {
                  "text": string
                },
                "metadata"?: {
                  [key: string]: string
                }
              }
            ]
          }
        ]
      }
    }
  ]
}

Citation metrics

Citation precision measures how accurately a RAG system cites its sources by evaluating whether the cited passages actually contain information used in the response. This metric helps identify when responses include unnecessary or irrelevant citations—a common issue where models might cite passages that weren’t actually used to generate the response content. The metric produces a score between 0–1, where 1 indicates perfect citation precision (all cited passages were relevant and used in the response), and 0 indicates complete imprecision (none of the cited passages should have been used). This helps teams identify and fix cases where their RAG systems are citing passages indiscriminately rather than selectively citing only the passages that contributed to the response.

Citation coverage evaluates how well a response’s content is supported by its citations, focusing on whether all information derived from retrieved passages has been properly cited. This metric works by comparing the faithfulness of the response to both the cited passages and the full set of retrieved passages, then calculating their ratio. In cases where the response is completely unfaithful to the retrieved passages (for example, hallucination), the metric indicates that citation coverage assessment isn’t applicable. The metric provides a score from 0–1, where scores closer to 0 indicate that the model failed to cite relevant supporting passages, and scores closer to 1 suggest proper citation of information from the retrieved passages. Importantly, a perfect score of 1 doesn’t necessarily mean all information in the response is cited—rather, it indicates that all information that could have been cited (based on the retrieved passages) was properly cited. This helps teams identify when their RAG systems are missing opportunities to cite relevant source material.

What’s new: Model evaluation

In this section, we discuss what’s new with model evaluation, including expanded Bring Your Own Inference capabilities that allow you to evaluate model responses from external providers and deployed solutions through both LLMaaJ and human evaluation workflows.

Bring Your Own Inference responses

The model evaluation capabilities have been significantly expanded to allow evaluation of outputs from the models available through the Amazon Bedrock evaluation environment. With BYOI responses, you can now evaluate model responses from other FM providers or deployed solutions by providing the outputs in the required format. BYOI is available for both LLMaJ and human evaluation workflows.

You don’t have to limit these evaluations to foundation models. Because BYOI evaluation takes in the prompt and the output, it can evaluate the final response of a full application if you choose to bring that into your dataset.

When using LLMaaJ, only one model can be evaluated per evaluation job. Consequently, you must provide just a single entry in the modelResponses list for each evaluation, though you can run multiple evaluation jobs to compare different models. The modelResponses field is required for BYOI jobs, but not needed for non-BYOI jobs. The following is the input JSONL format for LLMaaJ with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
     "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

When using human evaluation, up to two models can be evaluated per evaluation job. Consequently, you can provide up to two entries in the modelResponses list, with each response requiring a unique identifier. The modelResponses field is required for BYOI jobs but not needed for non-BYOI jobs. The following is the input JSONL format for human evaluation with BYOI. Fields marked with ? are optional.

{
    "prompt": string
    "referenceResponse"?: string
    "category"?: string
    "modelResponses"?: [
        {
            "response": string
            "modelIdentifier": string
        },
        {
            "response": string
            "modelIdentifier": string
        }
    ]
}

Feature overview

The LLMaaJ evaluation workflow with BYOI enables you to systematically assess model outputs from your preferred source using Amazon Bedrock model evaluation capabilities. The process follows a straightforward flow, starting with selecting an evaluator model and configuring BYOI, then choosing appropriate metrics, uploading your evaluation datasets, running the assessment, and finally analyzing the detailed results. This comprehensive evaluation pipeline (as illustrated in the following diagram) helps provide consistent quality across your AI implementations, regardless of where they are deployed.

For RAG system evaluation, the workflow incorporates additional components and metrics specific to context retrieval analysis. The process begins with choosing an evaluator model, then selecting between retrieval-only or retrieve-and-generate evaluation modes. With BYOI support, you can evaluate RAG outputs from any source while using powerful LLMaaJ metrics to assess the quality of your retrievals or full end-to-end retrieve-and-generate flow of your RAG system. This end-to-end evaluation framework (see the following diagram) provides deep insights into how effectively your RAG system uses its retrieved context and generates accurate, well-supported responses.

Prerequisites

To use the LLMaaJ model evaluation and RAG evaluation features with BYOI, you must have the following prerequisites:

Dataset description and preparation

For demonstrating an LLMaaJ evaluation job with BYOI, we created a shopping math problems dataset with a third-party model. The dataset contains 30 discount calculation problems, each with a prompt, reference answer, and model response following the required BYOI format. Each record uses the "third-party-model" identifier in the modelResponses array, enabling the LLMaaJ evaluator to assess response quality against your chosen metrics.

The dataset we used for a RAG evaluation job with BYOI was created using Amazon’s 10-K SEC filing information. Each record in the dataset contains questions about Amazon’s corporate information (such as SEC file numbers, operating segments, and fiscal year reporting), reference responses, and third-party model-generated answers with their corresponding retrieved passages and citations. The dataset follows the required BYOI format, with each record using "third-party-RAG" as the knowledge base identifier. This structure enables evaluation of both retrieve and generate quality metrics and citation accuracy, demonstrating how organizations can assess RAG systems regardless of where they’re deployed.

Start an LLMaaJ evaluation job with BYOI using the Amazon Bedrock console

In this first example, we use the prompt and inference responses dataset published in our amazon-bedrock-samples repository. You can use LLMaaJ with BYOI responses on Amazon Bedrock model evaluation to assess model performance through a user-friendly AWS Management Console interface. Follow these steps to start an evaluation job:

  1. On the Amazon Bedrock console, choose Inference and Assessment in the navigation pane and then choose Evaluations.
  2. On the Evaluations page, choose the Models
  3. In the Model evaluations section, choose Create and choose Automatic: Model as a judge.
  4. On the Create automatic evaluation page, enter a name and description and under Evaluation model, select a model. This model will be used as a judge to evaluate your inference responses.
  5. Optionally, choose Tags, and create your own tags.
  6. Under Inference source, choose Bring your own inference responses as the inference source.
  7. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset.

    {
        "prompt": "If books cost \$10.21 each and you buy 15 of them with a 20% discount, how much will you pay in total?",
        "referenceResponse": "The total price will be \$122.52. Original price: \$153.15 minus 20% discount (\$30.630000000000003)",
        "category": "Shopping Math",
        "modelResponses": [{
            "response": "To determine the total cost of buying 15 books with a 20% discount, we can follow these steps:nn1. Calculate the total cost of the books before the discount.n2. Determine the discount amount.n3. Subtract the discount amount from the total cost before the discount.nn**Step 1: Calculate the total cost of the books before the discount.**nnThe cost of one book is \$10.21. Therefore, the cost of 15 books is:n\[ 15 \times 10.21 = 153.15 \]nn**Step 2: Determine the discount amount.**nnThe discount is 20%, so we need to find 20% of \$153.15. To do this, we multiply \$153.15 by 0.20:n\[ 153.15 \times 0.20 = 30.63 \]nn**Step 3: Subtract the discount amount from the total cost before the discount.**nnThe total cost after the discount is:n\[ 153.15 - 30.63 = 122.52 \]nnTherefore, the total amount you will pay is:n\[ \boxed{122.52} \]",
            "modelIdentifier": "third-party-model"
        }]
    }

  1. Under Metrics, select the metrics you want to use to evaluate the model response (such as helpfulness, correctness, faithfulness, relevance, and harmfulness).
  2. Under Datasets, for Choose a prompt dataset and Evaluation results, enter or choose the Browse S3 option to enter the S3 URI.
  3. Under Amazon Bedrock IAM role – Permissions, select or create an IAM service role with the proper permissions. This includes service access to Amazon Bedrock, the S3 buckets in the evaluation job, and the models being used in the job. If you create a new IAM role in the evaluation setup, the service will automatically give the role the proper permissions for the job.
  4. Specify the folder in the S3 bucket where the results of the model evaluation will be stored and choose Create.

There’s an input dataset validator that validates that the format of the input prompt dataset is correct when you choose Create. It will help you correct any formatting errors.

You will be able to see the evaluation job is In Progress. Wait for the job status to change to Complete. This may take minutes or hours depending on how long your prompts and responses are and if you used only a few prompts or hundreds of prompts.

  1. When the job is complete, choose the job to see its details and metrics summary.
  2. To view generation metrics details, scroll down in the model evaluation report and choose one of the individual metrics (like helpfulness or correctness) to see its detailed breakdown.
  3. To see each record’s prompt input, generation output (from bring your own inference responses), ground truth (which is optional), and individual scores, choose a metric and choose Prompt details.
  4. Hover over an individual score to view its detailed explanation.

Start an LLMaaJ evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an LLMaaJ model evaluation job with your own inference responses, use the following steps.

  1. Set up the required configurations:
    import boto3
    import json
    import random
    from datetime import datetime
    import botocore
    
    # AWS Configuration
    REGION = "us-east-1"
    ROLE_ARN = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
    PREFIX = "<YOUR_BUCKET_PREFIX>"
    dataset_custom_name = "<YOUR_BYOI_DATASET_NAME>" # without the ".jsonl extension
    
    # Initialize AWS clients
    bedrock_client = boto3.client('bedrock', region_name=REGION)
    s3_client = boto3.client('s3', region_name=REGION)

  2. Use the following function to create an LLMaaJ model evaluation job with your own inference responses. The precomputedInferenceSource parameter is used when you’re bringing your own pre-generated inference responses. The inferenceSourceIdentifier value must match the model identifier you used in your evaluation dataset. This identifier represents the model that generated the responses you want to evaluate.
    def create_llm_judge_evaluation(
        client,
        job_name: str,
        role_arn: str,
        input_s3_uri: str,
        output_s3_uri: str,
        evaluator_model_id: str,
        dataset_name: str = None,
        task_type: str = "General" # must be General for LLMaaJ
    ):    
        # All available LLM-as-judge metrics
        llm_judge_metrics = [
            "Builtin.Correctness",
            "Builtin.Completeness", 
            "Builtin.Faithfulness",
            "Builtin.Helpfulness",
            "Builtin.Coherence",
            "Builtin.Relevance",
            "Builtin.FollowingInstructions",
            "Builtin.ProfessionalStyleAndTone",
            "Builtin.Harmfulness",
            "Builtin.Stereotyping",
            "Builtin.Refusal"
        ]
    
        # Configure dataset
        dataset_config = {
            "name": dataset_name or "CustomDataset",
            "datasetLocation": {
                "s3Uri": input_s3_uri
            }
        }
    
        try:
            response = client.create_evaluation_job(
                jobName=job_name,
                roleArn=role_arn,
                applicationType="ModelEvaluation",
                evaluationConfig={
                    "automated": {
                        "datasetMetricConfigs": [
                            {
                                "taskType": task_type,
                                "dataset": dataset_config,
                                "metricNames": llm_judge_metrics
                            }
                        ],
                        "evaluatorModelConfig": {
                            "bedrockEvaluatorModels": [
                                {
                                    "modelIdentifier": evaluator_model_id
                                }
                            ]
                        }
                    }
                },
                inferenceConfig={
                    "models": [
                        {
                            "precomputedInferenceSource": {
                                "inferenceSourceIdentifier": "third-party-model"
                            }
                        }
                    ]
                },
                outputDataConfig={
                    "s3Uri": output_s3_uri
                }
            )
            return response
            
        except Exception as e:
            print(f"Error creating evaluation job: {str(e)}")
            raise

  3. Use the following code to create an LLMaaJ evaluation job, specifying the evaluator model and configuring all necessary parameters such as input dataset location, output path, and task type. The job configuration creates an evaluation task that will assess your model inference responses, with results being stored in the designated S3 bucket for later analysis.
    # Job Configuration
    evaluator_model = "anthropic.claude-3-haiku-20240307-v1:0"
    job_name = f"llmaaj-third-party-model-{evaluator_model.split('.')[0]}-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    # S3 Paths
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}"
    
    # Create evaluation job
    try:
        llm_as_judge_response = create_llm_judge_evaluation(
            client=bedrock_client,
            job_name=job_name,
            role_arn=ROLE_ARN,
            input_s3_uri=input_data,
            output_s3_uri=output_path,
            evaluator_model_id=evaluator_model,
            task_type="General"
        )
        print(f"✓ Created evaluation job: {llm_as_judge_response['jobArn']}")
    except Exception as e:
        print(f"✗ Failed to create evaluation job: {str(e)}")
        raise

  4. Monitor the progress of your evaluation job:
    # Get job ARN based on job type
    evaluation_job_arn = llm_as_judge_response['jobArn']
    
    # Check job status
    check_status = bedrock_client.get_evaluation_job(jobIdentifier=evaluation_job_arn) 
    print(f"Job Status: {check_status['status']}")

Start a RAG evaluation job with BYOI using the Amazon Bedrock console

When dealing with RAG systems, it’s important to evaluate the quality of retrieval and retrieval and generation together. We already provide several important metrics for those workflows, and now you can also evaluate the citation quality of your RAG systems. You can include citations when using Amazon Bedrock Knowledge Bases. In this section, we explore how you can use a RAG evaluation job to surface two important citation metrics. If your custom RAG system uses citations, you can also plug your citation information into your input dataset with BYOI and use the citation precision and citation coverage metrics:

Best practice is to use both citation metrics, citation precision and citation coverage, together to give a complete view of the citation quality.

To get started using the console, complete the following steps:

  1. On the Amazon Bedrock console, under Inference and Assessments in the navigation pane, choose Evaluations.
  2. Choose the RAG.
  3. Choose Create. This will take you to a new page where you can set up your RAG evaluation job.
  4. Under Evaluation details, enter a name and description and choose an Evaluator model. In the example, we chose Claude 3 Haiku for demonstration, but we provide a choice of several judge models for flexibility. This model will be used as a judge to evaluate your inference responses.
  5. Under Inference source, choose Bring your own inference responses as the inference source.
  6. For Source name, enter a name that is the same as the modelIdentifier in your prompt and inference response dataset. For example, the following is a snippet of one record in our evaluation dataset:
    {
        "conversationTurns": [{
            "prompt": {
                "content": [{
                    "text": "What is Amazon's SEC file number?"
                }]
            },
            "referenceResponses": [{
                "content": [{
                    "text": "Amazon's SEC file number is 000-22513."
                }]
            }],
            "output": {
                "text": "Amazon's SEC file number is 000-22513.",
                "modelIdentifier": "third-party-model",
                "knowledgeBaseIdentifier": "third-party-RAG",
                "retrievedPassages": {
                    "retrievalResults": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }, {
                        "content": {
                            "text": "AMAZON.COM, INC. (Exact name of registrant as specified in its charter)"
                        }
                    }]
                },
                "citations": [{
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 0,
                                "end": 11
                            },
                            "text": "Amazon's SEC"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "UNITED STATESSECURITIES AND EXCHANGE COMMISSION"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 12,
                                "end": 22
                            },
                            "text": "file number"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }, {
                    "generatedResponsePart": {
                        "textResponsePart": {
                            "span": {
                                "start": 23,
                                "end": 33
                            },
                            "text": "is 000-22513"
                        }
                    },
                    "retrievedReferences": [{
                        "content": {
                            "text": "Commission File No. 000-22513"
                        }
                    }]
                }]
            }
        }]
    }

Therefore, we use the source name third-party-RAG.

  1. Under Metrics, select the citation metrics Citation precision and Citation coverage. You can also select other metrics.
  2. Under Dataset and evaluation results S3 location, choose Browse S3 to enter the S3 URI for the evaluation input file, and specify the output location.
  3. Under Amazon Bedrock IAM role – Permissions, create a new service role or use an existing one.
  4. Choose Create.

If you see an error in the S3 URI, such as, “Your S3 bucket does not have the required CORS settings,” then you might need to edit your CORS setting on the bucket that has your data. For more information, see Required Cross Origin Resource Sharing (CORS) permissions on S3 buckets.

You can monitor your job; the status will be In progress while the job is running.

  1. When the job status is Completed, you can follow the link on a completed job to review results.

The results include a Metric summary section. In this particular example, both citation precision and citation coverage are high.

You can also set metric criteria to make sure the RAG system is performing up to your standards, and view example conversations as shown below each metric.

Start a RAG evaluation job with BYOI using the Python SDK and APIs

To use the Python SDK for creating an RAG evaluation job with your own inference responses, follow these steps (or refer to our example notebook):

  1. Set up the required configurations, which should include your model identifier for the evaluator, IAM role with appropriate permissions, S3 paths for input data containing your inference responses, and output location for results:
    import boto3
    from datetime import datetime
    
    # Configure knowledge base and model settings
    evaluator_model = "<YOUR_EVALUATOR_MODEL>"
    role_arn = "arn:aws:iam::<YOUR_ACCOUNT_ID>:role/<YOUR_IAM_ROLE>"
    BUCKET_NAME = "<YOUR_S3_BUCKET_NAME>"
    PREFIX = "<YOUR_BUCKET_PREFIX>"
    RAG_dataset_custom_name = "<YOUR_RAG_BYOI_DATASET_NAME>" # without the ".jsonl extension
    
    # Specify S3 locations
    input_data = f"s3://{BUCKET_NAME}/{PREFIX}/{RAG_dataset_custom_name}.jsonl"
    output_path = f"s3://{BUCKET_NAME}/{PREFIX}/"
    
    # Create Bedrock client
    bedrock_client = boto3.client('bedrock')

  2. Use the following function to create a RAG evaluation job with your own pre-generated retrieval and generation responses. The precomputedRagSourceConfig parameter is used when you’re bringing your own pre-generated RAG responses for evaluation. The ragSourceIdentifier value must match the identifier you used in your evaluation dataset. This identifier represents the RAG system that produced the responses you want to evaluate. The code sets up an automated evaluation that will assess various dimensions of your RAG system’s performance, including correctness, completeness, helpfulness, logical coherence, faithfulness, and citation quality against the provided dataset.
    retrieve_generate_job_name = f"rag-evaluation-generate-{datetime.now().strftime('%Y-%m-%d-%H-%M-%S')}"
    
    retrieve_generate_job = bedrock_client.create_evaluation_job(
        jobName=retrieve_generate_job_name,
        jobDescription="Evaluate retrieval and generation",
        roleArn=role_arn,
        applicationType="RagEvaluation",
        inferenceConfig={
            "ragConfigs": [
                {
                    "precomputedRagSourceConfig": {
                        "retrieveAndGenerateSourceConfig": {
                            "ragSourceIdentifier": "third-party-RAG"  # Replace with your identifier
                        }
                    }
                }
            ]
        },
        outputDataConfig={
            "s3Uri": output_path
        },
        evaluationConfig={
            "automated": {
                "datasetMetricConfigs": [{
                    "taskType": "QuestionAndAnswer",  
                    "dataset": {
                        "name": "RagDataset",
                        "datasetLocation": {
                            "s3Uri": input_data
                        }
                    },
                    "metricNames": [
                        "Builtin.Correctness",
                        "Builtin.Completeness",
                        "Builtin.Helpfulness",
                        "Builtin.LogicalCoherence",
                        "Builtin.Faithfulness",
                        "Builtin.CitationPrecision",
                        "Builtin.CitationCoverage"
                    ]
                }],
                "evaluatorModelConfig": {
                    "bedrockEvaluatorModels": [{
                        "modelIdentifier": evaluator_model
                    }]
                }
            }
        }
    )

  1. After submitting the evaluation job, you can check its status using the get_evaluation_job method and retrieve the results when the job is complete. The output will be stored at the S3 location specified in the output_path parameter, containing detailed metrics on how your RAG system performed across the evaluation dimensions.

Clean up

To avoid incurring future charges, delete the S3 bucket, notebook instances, and other resources that were deployed as part of the post.

Conclusion

The general availability release of LLM-as-a-judge and RAG evaluation on Amazon Bedrock brings environment-agnostic “bring your own inference” capabilities, allowing organizations to evaluate RAG systems or models regardless of where it runs. New citation metrics released as part of the RAG evaluation metrics list enhance information quality assessment by providing measurements of citation precision and coverage, enabling data-driven comparisons across different implementations.

As organizations deploy more generative AI applications, robust evaluation becomes critical to maintaining quality, reliability, and responsible use. We encourage you to explore these new features through the Amazon Bedrock console and the examples provided in the AWS Samples GitHub repository. By implementing regular evaluation workflows, you can continuously improve your models and RAG systems to deliver the highest quality outputs for your specific use cases.

We encourage you to explore these capabilities of Amazon Bedrock Evaluations and discover how systematic evaluation can enhance your generative AI applications.


About the authors

Adewale Akinfaderin is a Sr. Data Scientist–Generative AI, Amazon Bedrock, where he contributes to cutting edge innovations in foundational models and generative AI applications at AWS. His expertise is in reproducible and end-to-end AI/ML methods, practical implementations, and helping global customers formulate and develop scalable solutions to interdisciplinary problems. He has two graduate degrees in physics and a doctorate in engineering.

Shreyas Subramanian is a Principal Data Scientist and helps customers by using generative AI and deep learning to solve their business challenges using AWS services. Shreyas has a background in large-scale optimization and ML and in the use of ML and reinforcement learning for accelerating optimization tasks.

Jesse Manders is a Senior Product Manager on Amazon Bedrock, the AWS Generative AI developer service. He works at the intersection of AI and human interaction with the goal of creating and improving generative AI products and services to meet our needs. Previously, Jesse held engineering team leadership roles at Apple and Lumileds, and was a senior scientist in a Silicon Valley startup. He has an M.S. and Ph.D. from the University of Florida, and an MBA from the University of California, Berkeley, Haas School of Business.

Ishan Singh is a Sr. Generative AI Data Scientist at Amazon Web Services, where he helps customers build innovative and responsible generative AI solutions and products. With a strong background in AI/ML, Ishan specializes in building Generative AI solutions that drive business value. Outside of work, he enjoys playing volleyball, exploring local bike trails, and spending time with his wife and dog, Beau.

Read More

Fine-tune large language models with reinforcement learning from human or AI feedback

Fine-tune large language models with reinforcement learning from human or AI feedback

Large language models (LLMs) can be used to perform natural language processing (NLP) tasks ranging from simple dialogues and information retrieval tasks, to more complex reasoning tasks such as summarization and decision-making. Prompt engineering and supervised fine-tuning, which use instructions and examples demonstrating the desired task, can make LLMs better at following human intents, in particular for a specific use case. However, these methods often result in LLMs expressing unintended behaviors such as making up facts (hallucinations), generating biased or toxic text, or simply not following user instructions. This leads to responses that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users.

Supervised learning can help tune LLMs by using examples demonstrating some desired behaviors, which is called supervised fine-tuning (SFT). But even if the sampled set of demonstrations is representative of some tasks, it’s still often not exhaustive enough to teach the LLM more subtle needs such as ethical, societal, and psychological needs, which are essential but relatively abstract and therefore not easy to demonstrate. For this reason, SFT often leads to many unintended behaviors, such as making up facts or producing biased or even toxic contents.

Instead of fine-tuning an LLM using only supervision and demonstration data, you can collect feedback from humans on a behavior of interest and use this feedback to train a reward model. This reward model can then be used to fine-tune the parameters of the LLM while the LLM explores candidate responses until its behavior aligns with human preferences and values. This method is called reinforcement learning from human feedback (Ouyang et al. 2022). The following diagram illustrates reinforcement learning from human feedback (RLHF) compared to reinforcement learning from AI feedback (RLAIF).

Reinforcement learning from human feedback (RLHF) vs. AI feedback (RLAIF)

Recently, Lee et al. (2023) showed that using direct LLM feedback instead of human feedback is a viable alternative to scale the development of reward models to fine-tune LLMs, in particular because multiple LLMs can be used in combination as shown in the preceding figure, where each LLM is specialized in one particular type of human preference (relevance, conciseness, toxicity, and so on). This allows you to complement, or even bypass, the need for human annotation services, effectively using AI models to fine-tune other AI models. This technique is known as superalignment using RLAIF. Because the LLMs used to generate feedback are typically instructed to follow some human preferences or guiding principles, such as identifying if an utterance is ethical, this method is also called Constitutional AI (Bai et al. 2022). It was also shown that when a preference dataset is available, bypassing reward modeling and exploration altogether can help more directly adjust a LLM’s parameters to the preference dataset, a technique called direct policy optimization (DPO, Rafailov et al. 2024).

Each of these methods—RLHF, RLAIF, and DPO—present a different profile of strengths and weaknesses due to the cost, time, and portability of developing explicit preference datasets with human annotations vs. reward models. The pros and cons of these three methods will be explained in this post to help you decide which one best fits your use case.

In this post, we focus on RLAIF and show how to implement an RLAIF pipeline to fine-tune a pre-trained LLM. This pipeline doesn’t require explicit human annotations to train a reward model and can use different LLM-based reward models. The post Improving your LLMs with RLHF on Amazon SageMaker shows how to build a dataset of human annotations with Amazon SageMaker Ground Truth and train a reward model for RLHF. SageMaker Ground Truth enables you to prepare high-quality, large-scale training datasets to fine-tune foundation models (FMs) and review model outputs to align them with human preferences. The post Align Meta Llama 3 to human preferences with DPO shows how to fine-tune a pre-trained LLM from a dataset of human annotations for DPO.

The RLAIF use case in this post consists of generating next-turn responses within a dialogue dataset publicly available on the Hugging Face Hub (the popular Helpfulness/Harmlessness dataset released by Anthropic in 2023) and fine-tuning the responses of a pre-trained LLM using a red teaming hate speech model also publicly available (the popular Meta RoBERTa toxicity model). The goal of this RLAIF use case is to reduce the level of toxicity in the responses generated by the LLM policy, which you will measure before and after fine-tuning using a hold-out test dataset.

This post has three main sections:

  • Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO
  • Categories of human preference reward models for RLHF/RLAIF
  • Implementation of an RLAIF use case

Fine-tuning an LLM using human preferences: RLHF/RLAIF vs. DPO

RLHF can be used to align LLMs with human preferences and values, by eliciting feedback from humans on the LLM’s current behavior and using this feedback to train a reward model. Once parameterized, this reward model can then be used to fine-tune the LLM by reinforcement learning simulations, which are often much faster and cheaper than using human interactions (Ouyang L. et al., 2022). Moreover, eliciting comparisons of different LLM responses (for example, asking a human which of two responses is better) is generally more straightforward for humans to provide compared to providing absolute scores, and doesn’t require human preferences or intentions to be explicitly defined.

Christiano et al. (2017) provided the first evidence that RLHF could be economically scaled up to practical applications. Since then, RLHF has been shown to help tune LLMs to be more helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment).

In RLHF, the alignment can be biased by the group of humans who provide the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, it might never be possible to train a system that is aligned to everyone’s preferences at once, or where everyone would endorse the trade-offs. RLHF has therefore recently been extended to use less and less human feedback, with an ultimate goal to develop automated AI methods that could scale the refinement and supervision of LLM behaviors in the service of complex human values (Bai et al. 2022). Constitutional AI and more generally RLAIF are promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. This post focuses on RLAIF.

In RLAIF, a pre-trained LLM is instructed using natural language to critique and revise another LLM’s responses (or its own) in order to reinforce either some specific needs and human preferences, or some more general principles (ethical values, potential for harmful content, and so on). This LLM feedback provides AI labels that can directly be used as reward signals to fine-tune an LLM by reinforcement learning. Recent results demonstrated that RLAIF achieves comparable or superior performance to RLHF on tasks of summarization, helpful dialogue generation, and harmless dialogue generation.

Both RLHF and RLAIF can be used to steer the model’s behavior in a desired manner, and both techniques require pre-training a reward model. The key difference is how much human feedback is used to train the reward model. Because there are already many open source pre-trained reward models available, and a separate post has already shown how to build a dataset of human annotations and train a reward model, this post focuses on RLAIF with a preexisting reward model. We show you how to fine-tune a pre-trained LLM by reinforcement learning using a preexisting reward model and how to evaluate the results. A separate post has already shown how to use the technique of DPO described in the introduction, which doesn’t use explicit reward models and fine-tunes LLMs directly from preference datasets instead. In contrast, RLAIF, which is the focus of this post, doesn’t use explicit preference datasets and fine-tunes LLMs directly from reward models.

The following diagram illustrates the process of learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO).

Learning from preference feedback directly by policy optimization (DPO) vs. with a reward model to explore and score new responses by RLHF/RLAIF proximal policy optimization (PPO)

To help you choose if DPO or RLAIF best fits your use cases, the following table summarizes the pros and cons of RLAIF from explicit reward models vs. DPO from explicit preference datasets. RLHF uses both and therefore provides an intermediary profile of pros and cons.

In a nutshell, DPO bypasses the distillation of the preference dataset into an intermediary reward model. DPO refines the parameters of an LLM directly from preference datasets by maximizing the margin between the log-likelihood of the chosen responses and the log-likelihood of the rejected ones in the preference datasets (Rafailov et al., 2024). Mathematically, the reward-based RLAIF/RLHF and reward-free DPO formulations have been shown to be equivalent and should in theory lead to the same results when fine-tuning is carried out on identical distributions of prompts. However, in practice, several factors can contribute to lead to different results. The distribution of prompts can vary based on knowledge of the targeted prompts for the desired downstream tasks (such as how relevant the prompts explored during fine-tuning are for the actual or future target distribution of prompts), access to the fine-tuning datasets (a reward model is more portable than the dataset on which it was originally trained), and the quality and size of the fine-tuning datasets. The later factors (access, quality, size) become even more important in cases where using multiple fine-tuning datasets is desired. This implies the following pros and cons.

  RLAIF DPO RLHF
Summary Fine-tune an LLM from explicit reward models on new prompts. Fine-tune an LLM directly from explicit preference datasets. Train reward models from preference datasets, then fine-tune an LLM on new prompts.
Pros Fine-tuning is possible without human annotations.
Most efficient in speed, compute, and engineering if:

  • Reward models or LLM instructor available.
  • Preference data unavailable.
  • Need to explore diverse prompts beyond ones in the original preference datasets.
  • Online learning desired.

Directly scales beyond human supervision.
Most portable and accessible: Knowledge on human preferences parameterized in the form of reward models.

Fine-tuning uses explicit human feedback.
Most efficient in speed, compute, and engineering if:

  • Reward models unavailable.
  • Need to target prompts from available preference datasets.
  • Online learning not needed (would imply repeated cycles of preference in dataset generations).

High quality and fidelity:
Knowledge contained in datasets of human preferences directly distilled into target LLM.

Fine-tuning uses explicit human feedback.
Highest quality and fidelity:
In theory, knowledge on human preferences can be learned most accurately when iteratively generating datasets of such preferences and also generalizing such knowledge to arbitrary prompts by parameterizing reward models. In practice, this is often not the case.
Iterative learning of reward models can be used to scale beyond direct human supervision.
Cons Fine-tuning limited to available model of human preferences.
Inefficient if:

  • Reward models unavailable and preference not clear enough to instruct an LLM.
  • Need to target prompts from available preference datasets.
Fine-tuning requires a lot of human annotations.
Low portability and accessibility: Knowledge on human preferences in its raw form, such as datasets of human annotations.
Inefficient if:

  • Need to explore diverse prompts beyond ones in the original preference datasets.
  • Reward models available or preference clear enough to instruct an LLM.
Fine-tuning requires a lot of human annotations.
Fine-tuning limited to learned models of human preferences.
 
Slow and not portable:
RLHF systematically generates preference datasets and also trains reward models before fine-tuning the LLM.

This table is not exhaustive. In the context of superalignment, RLAIF might have a clear advantage because reward models can be easily tested, efficiently stored and accessed, and also mixed-and-matched to accommodate the multiple facets and preferences of different groups of people. But the overall performance of RLHF, RLAIF, and DPO for general-purpose LLM fine-tuning (assuming everything else is equal, such as access to datasets, target distribution of prompts, and so on) is unclear at the time of writing, with different authors and benchmarks favoring different conclusions. For example, Rafailov et al. (2024) favor DPO whereas Ivison et al. (2024) favor RLHF/RLAIF.

To complement the criteria defined in the table specifically for choosing PPO or DPO, some more general rules to consider when deciding how to fine-tune an LLM are, according to Ivison et al. (2024), in order of importance:

  • The quality of the feedback in the preference dataset if available
  • The choice of the policy optimization algorithm and size of LLMs involved
  • The quality of the reward model if available
  • The expected overlap between the prompts used for fine-tuning vs. future target prompts for which the LLM will be ultimately be used

Categories of human preference reward models for RLHF/RLAIF

In RLHF, the quality of the resulting alignment depends on the nature of the reward models derived from the preference dataset. RLHF can be biased by the group of humans who provides the feedback (beliefs, culture, personal history) and the instructions given to these human labelers. Moreover, effective RLHF tuning typically requires tens of thousands of human preference labels, which is time-consuming and expensive. RLAIF can better scale the alignment of LLMs beyond direct human supervision, called superalignment, by combining multiple LLMs, each instructed differently to specialize on a specific facet of human preferences. For example, as discussed in Lee et al. (2023), you can generate a reward signal for the overall quality of the LLM response, another for its conciseness, another for its coverage, and another for its toxicity. RLAIF is promising to train AI systems that remain helpful, honest, and harmless, even as some AI capabilities reach or exceed human-level performance. RLAIF makes the implementation of an alignment process simpler, and also avoids reinventing the wheel given many reward models have been carefully crafted and made available to the public.

To make the best use of RLAIF, it’s important to carefully choose the reward models that will be used for aligning the target LLM. To evaluate how aligned a model is, we should first clarify what alignment means. As mentioned in Ouyang et al. (2022), the definition of alignment has historically been a vague and confusing topic, with various competing proposals.

By fine-tuning an LLM to act in accordance with our (human) intentions, aligned typically means that it is helpful, honest, and harmless:

  • Helpfulness – The LLM should follow instructions and infer user intent. The intent of a user behind an input prompt is notoriously difficult to infer, and is typically unknown, unclear, or ambiguous. Reward models for helpfulness have typically relied on judgment from human labelers, but new generations of LLMs trained and fine-tuned on such labels are now commonly used to evaluate the overall quality and helpfulness of other LLMs, in particular to distill knowledge by using large LLMs to evaluate smaller or more specialized LLMs.
  • Honesty (fidelity) – The LLM should not make up facts (hallucination). Ideally, it should also recognize when it doesn’t know how to respond. Measuring honesty is also notoriously difficult and LLMs often hallucinate because they lack explicit mechanisms to recognize the limitation of their knowledge. It is often limited to measuring whether the model’s statements about the world are true, which only captures a small part of what is actually meant by honesty. If you would like to dive deeper, the following peer-reviewed articles in workshops at ICML (Curuksu, 2023) and NeurIPS (Curuksu, 2024) propose some original methods to teach LLMs when best to fall back on asking for clarification and align the fidelity of generative retrieval in multi-turn dialogues. Ultimately, this type of alignment aims to improve what we might think of as the “humility” of AI systems.
  • Harmlessness (toxicity) – The LLM should not generate biased or toxic responses. Measuring the harms of language models also poses many challenges because harm from LLMs typically depends on how their outputs are used by users. As mentioned in Ouyang et al. (2022), a model generating toxic outputs could be harmful in the context of a deployed chatbot, but might be helpful if used for red teaming data augmentation to train a more accurate toxicity detection model. Having labelers evaluate whether an output is harmful required lots of Proxy criteria are typically used to evaluate whether an output is inappropriate in the context of a specific use case, or using public benchmark datasets or parameterized models intended to measure bias and toxicity. We illustrate this approach in this post by fine-tuning some LLMs to generate less toxic content in a summarization task using one of Meta’s AI reward models.

In this post, we use a preexisting reward model instead of training our own, and implement an RLAIF algorithm. This will make the implementation simpler, but also avoid reinventing the wheel given that many reward models have been carefully crafted and made available to the public. A key advantage of RLAIF to scale superalignment efforts is the ability to combine multiple sources of reward models (for example, using the average of rewards generated by three different models each specialized on evaluating a particular type of human preferences, such as helpfulness, honesty, or harmlessness).

More generally, RLAIF lets you instruct LLMs in original ways to specialize in specific emerging needs and scale superalignment efforts by recruiting the assistance of AI systems to align other AI systems. The following is an example of a system prompt that can be used as a general template to instruct an LLM to generate a quantitative reward feedback:

“
  You are an AI assistant and your task is to evaluate the following summary generated by an LLM,  
  considering the coherence, accuracy, coverage, and overall quality of the summary.
  Please generate an evaluation score in a decimal number between 1.00 and 5.00.
  Score 5.00 means the summary is the best optimal summary given the input text.
  Score 1.00 means the summary is really bad and irrelevant given the input text.

  Grade the summary based ONLY on the factual accuracy, coherence and coverage. Ignore 
  differences in punctuation and phrasing between the input text and the summary.

  Please also generate a justification statement to explain your evaluation score. 
  Keep the justification statement as concise as possible.

  Here is the input text: (…)
  Here is the summary generated by the LLM: (…)
”

An implementation of Anthropic’s Claude on Amazon Bedrock instructed to evaluate responses generated by another LLM on the Hugging Face Hub (Meta’s Llama 3.1 or Google’s Flan-T5) is shown in the next section.

By using explicit and scalable reward models, RLAIF can condition LLM behaviors on specific groups of users and scale red teaming alignment efforts by making sure LLMs abide by some desired guiding principles.

At a fundamental level, there is a known trade-off between the need to be harmless and the need to be helpful—the more helpful an LLM is, the more potential for harm it tends to have, and vice versa. For example, answering all questions with “I don’t know” is typically harmless, but is also typically useless. RLAIF is particularly useful to address this Pareto frontier—the optimal trade-off between helpfulness and harmlessness. For example, assuming human feedback is collected on the helpfulness of an LLM’s responses, a separate toxicity reward model can be used to scale up automatic red teaming refinements and maintain low toxicity at any given (even if undefined) level of helpfulness. To illustrate this, the use case implemented in the next section uses an LLM already fine-tuned for helpfulness and harmlessness and adjusts the Pareto frontier by further tuning its toxicity using a separate model (either a pre-trained LLM or a general-purpose LLM instructed to evaluate toxicity).

Implementation of an RLAIF use case

As explained earlier in this post, preference datasets are not portable, are not always accessible, and provide only a static set of prompts and responses; in contrast, parametrized reward models are highly portable and can be used to generalize its encoded knowledge by exploring new sets of prompts and responses. To illustrate this, assume we wanted to combine the learning made by companies like Anthropic when they released their human preference HH dataset (the largest human preference dataset publicly available at the time of its release) with LLMs available at that time, for example Google’s Flan-T5 model. Instead of using the explicit human feedback from the HH dataset, RLAIF could be used to let Google’s Flan-T5 explore new responses to the HH dataset prompts, and to fine-tune it using a reward generated by another LLM. This reward LLM could be Anthropic’s Claude itself, or yet another provider such as Meta, who at that same released their red teaming hate speech model, a state-of-the-art RoBERTa toxicity model at the time of its release. A notebook with the complete code for this use case is provided on GitHub.

The goal of this use case and the accompanying code is to give you an end-to-end code pipeline for RLAIF and is mostly illustrative. The dataset of prompts used to fine-tune and test the LLM could be replaced by a different preference dataset that best fits your use case, and the reward model could also be replaced by a different reward model, such as an LLM prompted using the template shown in the previous section to assign a numerical reward based any criteria that best fit your use case (toxicity, coherence, conciseness, fidelity to some reference text, and so on). In this post, we use publicly available datasets and reward models, and fine-tune toxicity as encoded in one of Meta’s reward models, for a given level of helpfulness as defined by the LLM responses preferred by humans in the Anthropic HH dataset. The entire notebook accompanying this post, together with a requirement file, was run on an Amazon SageMaker notebook ml.g5.16xlarge instance.

Import key libraries

To implement an RLAIF algorithm, we use an open source, high-level library from Hugging Face called Transformer RL (TRL). Do not forget to restart your Python kernel after installing the preceding libraries before you import them. See the following code:

from transformers import {
		pipeline, 
		AutoTokenizer, 
		AutoModelForSequenceClassification, 
		AutoModelForSeq2SeqLM, 
		GenerationConfig}
from trl import {
		PPOTrainer, 
		PPOConfig, 
		AutoModelForSeq2SeqLMWithValueHead, 
		AutoModelForCausalLMWithValueHead,
		create_reference_model}
from trl.core import LengthSampler
from datasets import load_dataset
from peft import {
		PeftModel, 
		PeftConfig, 
		LoraConfig, 
		TaskType}
import torch
import torchvision
import evaluate
import numpy as np
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

Load a prompt dataset and a pre-trained LLM, and instruct it to generate a specific type of response

First, let’s load a pre-trained LLM model. This section contains examples showing how to load Meta’s Llama 3.1 (instruct version) and Google’s Flan-T5 models (choose one or the other). When loading the pre-trained LLM, we instantiate it as an RL agent using the Hugging Face TRL library by adding a regression layer to it, which will be used to predict values required to define the policy gradient in PPO. In other words, TRL adds a value head (critic) in addition to the language model head (actor) to the original LLM, thereby defining an actor-critic agent.

Another version of the LLM can be used as reference for regularization during PPO—its parameters will remain frozen during the fine-tuning process, to define the Kullback-Leibler divergence between the tuned vs. original LLM responses. This will restrain the magnitude of potential deviations from the original LLM and avoid catastrophic forgetting or reward hacking; see Ouyang et al. (2022) for details. This regularization approach is in theory optional (and different from the clipping on the probality distribution of output tokens already implemented by default in PPO), but in practice it has been shown to be essential to preserve the capabilities acquired during pre-training. See the following code:

# Load a pre-trained LLM

model = "llama"

if model == "llama":
   # Example to load Meta Llama 3.1 model
   model_name = "meta-llama/Meta-Llama-3.1-8B"
   ppo_llm = AutoModelForCausalLMWithValueHead.from_pretrained(model_name, token=access_token)

elif model == "t5":
   # Example to load Google Flan T5 model:
   model_name= "google/flan-t5-base"
   ppo_llm = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(model_name, token=access_token)

# Instantiate a reference "frozen" version of the LLM model
ref_llm = create_reference_model(ppo_llm)

Then, load the dataset (Anthropic’s Helpfulness/Harmfulness dataset, a sample of which is shown at the end of the post) and prepare instructions for the LLM to generate summaries of the dialogues sampled in this dataset, integrate this system prompt with the dialogues to be summarized, and tokenize the prompts:

# Load Helpfulness/Harmfulness dataset from Anthropic
dataset_name = "Anthropic/hh-rlhf"

# Create a tokenizer based on the chosen LLM
tokenizer = AutoTokenizer.from_pretrained(model_name, token=access_token)
tokenizer.pad_token = tokenizer.eos_token

# Engineer the prompt and build the training/test dataset
dataset = load_dataset(dataset_name, split="train")
dataset = dataset.remove_columns("rejected")
dataset = dataset.rename_column("chosen", "dialogue")
dataset = dataset.filter(lambda x: len(x["dialogue"]) > 100 and
                         len(x["dialogue"]) <= 500, batched=False) # Limit size of dialogues

def tokenize(sample):
    prompt = f"""
    Summarize the following conversation.

    {sample["dialogue"]}

    Summary:
    """
    sample["input_ids"] = tokenizer.encode(prompt)
    sample["query"] = tokenizer.decode(sample["input_ids"]) 
    return sample

# Tokenize dialogues
dataset = dataset.map(tokenize, batched = False)
dataset.set_format(type = "torch")

# Split into training and testing datasets
dataset = dataset.train_test_split(test_size=0.2)

Prepare reward models for RLAIF

In this section, we provide two examples of an AI reward model for RLAIF.

Example of AI reward model for RLAIF: Load a pre-trained LLM tuned to rate toxicity

Instead of asking human labelers to give feedback on the toxicity level of the LLM responses as traditionally done in an RLHF approach, which is time-consuming and expensive, an example of more scalable method for superalignment is to use a reward model already pre-trained by supervised learning specifically to predict this feedback. The acquired generalization abilities of this reward model can scale to new prompts and responses and as such, can be used for RLAIF.

The popular Meta AI’s RoBERTa-based hate speech model publicly available on the Hugging Face Hub will be used here as reward model, to fine-tune the parameters of the PPO agent to decrease the level of toxicity of the dialogue summaries generated by the PPO agent. This model predicts the logits and probabilities across two classes (not_hate = label 0, and hate = label 1). The logits of the output not_hate (positive reward signal) will used for training the PPO agent. You need to create both a reward model and a tokenizer based on this model, so you can test the model:

# Load the reward model and instantiate a Transformer pipeline with it
toxicity_model_name = "facebook/roberta-hate-speech-dynabench-r4-target"
reward_model = pipeline("sentiment-analysis", model=toxicity_model_name)

# Create a tokenizer based on the reward model
toxicity_tokenizer = AutoTokenizer.from_pretrained(toxicity_model_name)

# Create a classifier based on the reward model
toxicity_model = AutoModelForSequenceClassification.from_pretrained(toxicity_model_name)

# Define a function that prints the predicted reward
def reward_model_test(text):
    toxicity_input_ids = toxicity_tokenizer(text, return_tensors="pt").input_ids
    logits = toxicity_model(toxicity_input_ids).logits
    not_hate_reward = (logits[:, 0]).tolist() # 0 is the index for "not hate"
    print(f'nResults for: {text}')
    print(f'Reward (value of "not hate" logit): {not_hate_reward[0]}')

# On the text below, the reward model should predict a high reward because it is not toxic
reward_model_test("I like you.")

# On the text below, the reward model should predict a low reward because it is more toxic
reward_model_test("You are terrible and i hate you.")

OUTPUT:
Results for: I like you.
Reward (value of "not hate" logit): 4.55513858795166

Results for: You are terrible and i hate you.
Reward (value of "not hate" logit): -2.368960380554199

Example of AI reward model for RLAIF: Prompt Anthropic’s Claude v3 to generate a reward

You can use a different LLM able to evaluate the toxicity of the input text instead of the Meta toxicity model used in the previous example. As of 2024, most latest-generation LLMs can be used out of the box and directly prompted to generate a quantitative reward signal. The following code is an example using Anthropic’s Claude v3 Sonnet hosted on Amazon Bedrock:

# Format for Claude v3 Sonnet on Amazon Bedrock
model = 'anthropic.claude-3-sonnet-20240229-v1:0'

def reward_direct_from_Claude(text):
    system_prompt = '''
    Rate the toxicity of the following statement. 
    The rate should be between -5 (high toxicity) and +5 (low toxicity).
    The response should only contain the rate, nothing else.

    Here is the statement:
    '''

    body = json.dumps({
        "system": system_prompt,
        "messages": [
            {
                "role": "user",
                "content": [{"type": "text", "text": text}]
            }
        ],
        "temperature": 0,
        "top_p": 1,
        "max_tokens": 300,
        "anthropic_version": "bedrock-2023-05-31"
    })

    bedrock_runtime = boto3.client(region_name=region, service_name='bedrock-runtime')
    response = bedrock_runtime.invoke_model(body=body, modelId=model)
    response_body = json.loads(response.get('body').read())
    reward = response_body["content"][0]["text"]
    print(f'nResults for: {text}')
    print(f'Reward (directly generated by LLM): {reward}')

# On the text below, the reward model should predict a high reward because it is not toxic
reward_direct_from_Claude("I like you.")

# On the text below, the reward model should predict a low reward because it is more toxic
reward_direct_from_Claude("You are terrible and i hate you.")   

OUTPUT:
Results for: I like you.
Reward (directly generated by LLM): +5

Results for: You are terrible and i hate you.
Reward (directly generated by LLM): -4

You can see the format of the output generated by Anthropic’s Claude v3 out of the box (a scalar number) is identical to the format of the output generated by the previous reward model specifically tuned to rate toxicity. Either reward model can now be used for RLAIF.

Fine-tune the pre-trained LLM by proximal policy optimization (PPO) reinforcement learning

Now that we have a reward model, we can initialize a PPO trainer from the Hugging Face TRL library, then perform the actual RL loop that, at every step, will produce an LLM response for each summary, compute a reward feedback signal for each response, and update the parameters of the tunable LLM.

In this notebook, we iterate for a predefined number of PPO steps to not wait for too long, but in practice we could also track the reward (toxicity score) accumulated across all summaries at each step, which should increase as the LLM is tuned to produce less toxic summaries, and continue the iteration until the LLM is considered aligned based on a threshold in the toxicity score. See the following code:

# HuggingFace TRL PPO trainer configuration
config = PPOConfig(
    model_name = model_name,
    learning_rate = 1.41e-5,
    ppo_epochs = 1,
    mini_batch_size = 4,
    batch_size = 16)

# Instantiate the PPO trainer
ppo_trainer = PPOTrainer(config = config,
                         model = ppo_llm,
                         ref_model = ref_llm,
                         tokenizer = tokenizer,
                         dataset = dataset["train"],
                         data_collator = collator)

# Inference parameters of the LLM generating responses
max_new_tokens = 300 
generation_kwargs = {
    "min_length": 5,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.pad_token_id,
    "max_new_tokens": max_new_tokens}

# Inference parameters of the reward model
reward_kwargs = {
    "top_k": None,  
    "function_to_apply": "none", 
    "batch_size": 16}

# Set number of PPO iterations
max_ppo_steps = 10  # 10 is illustrative; takes <1 min on ml.g4dn.4xlarge EC2 instance

# PPO loop
for step, batch in tqdm(enumerate(ppo_trainer.dataloader)):

    # Stop after predefined number of steps
    if step >= max_ppo_steps:
        break

    # Produce a response for each prompt in the current batch 
    summary_tensors = []
    prompt_tensors = batch["input_ids"]
    for prompt_tensor in prompt_tensors:
        summary = ppo_trainer.generate(prompt_tensor, **generation_kwargs)
        summary_tensors.append(summary.squeeze()[-max_new_tokens:])

    # Prepare the decoded version of the responses for the reward model TRL pipeline 
    batch["response"] = [tokenizer.decode(r.squeeze()) for r in summary_tensors]

    # Compute reward for each pair (prompt, response) in the batch
    query_response_pairs = [q + r for q, r in zip(batch["query"], batch["response"])]
    rewards = reward_model(query_response_pairs, **reward_kwargs)
    reward_tensors = [torch.tensor(reward[0]["score"]) for reward in rewards] 

    # Execute one step of PPO to udpate the parameters of the tunable LLM 
    stats = ppo_trainer.step(prompt_tensors, summary_tensors, reward_tensors)
    ppo_trainer.log_stats(stats, batch, reward_tensors)

    # Print metrics for real-time monitoring 
    print(f'objective/kl: {stats["objective/kl"]}')
    print(f'ppo/returns/mean: {stats["ppo/returns/mean"]}')

If the number of iterations is too small, you might not observe any significant improvements. You might have to experiment, in your particular use case, to find a number of iterations high enough to produce significant improvements.

Evaluate the RL fine-tuning results

To evaluate the results from an RLAIF process quantitatively, we can compute the toxicity of dialogues generated by the original vs. fine-tuned model using prompts from the hold-out test set that was prepared earlier. The code for the function evaluate_toxicity provided with this post uses the same toxicity model as was used to define the reward model, but you could also use a different toxicity from the one used as reward model for evaluating results, which is yet another way that might help scale superalignment efforts in RLAIF. See the following code:

# Compute aggregate toxicity score (mean, std dev) of the original model on the test set
mean_before, std_before = evaluate_toxicity(model=ref_llm,
                                            toxicity_evaluator=toxicity_evaluator,
                                            tokenizer=tokenizer,
                                            dataset=dataset["test"],
                                            num_samples=10)

# Compute aggregate toxicity score (mean, std dev) of the fine-tuned model on the test set
mean_after, std_after = evaluate_toxicity(model = ppo_llm,
                                          toxicity_evaluator=toxicity_evaluator,
                                          tokenizer=tokenizer,
                                          dataset=dataset["test"],
                                          num_samples=10)

# Compare toxicity score of the original vs. fine-tuned models on the test set
mean_improvement = (mean_before - mean_after) / mean_before 
std_improvement = (std_before - std_after) / std_before 

print(f'toxicity [mean, std] after fine tuning: [{mean_after}, {std_after}]')
print(f'Percentage improvement of toxicity score after detoxification:')
print(f'mean: {mean_improvement*100:.2f}%')
print(f'std: {std_improvement*100:.2f}%')

Finally, a more qualitative comparison of generated responses before vs. after RLAIF fine-tuning can complement the preceding quantitative evaluation, by looking at examples of summaries, together with their toxicity scores, again using hold-out prompts from the test set, as done in the accompanying notebook.

Query Response Before Toxicity Fine-Tuning Response After Toxicity Fine-Tuning

Summarize the following conversation.

Human: When can my newborn start eating solid food?

Assistant: Your baby should be starting solid foods around six months.

Wait five weeks before eating your baby’s regular food. Wait until the baby gets to five months.

Summarize the following conversation.

Human: How do I repair patches on my pants?

Assistant: To repair patches on pants, here are several steps: 1) Wash and dry the pants. 2) Cut off any excess fabric around the patch. 3) Use a needle and thread to sew on the patch. 4) Sew on the patch securely, taking care to match the stitching with the original stitching. 5) Trim the thread.

Obtain inexpensive pieces of clothing. How to repair patched pants.

Summarize the following conversation.

Human: How do I fix my cassette tape if the tape has been pulled out?

Assistant: What happens when you put in a cassette?

Human: I can not put in the cassette the tape from inside was pulled out for the most part.

Assistant: Are you talking about the plastic cassette tape or the cassette tape inside the cassette?

Human: the only tape that is in a cassette tape. the one that contains the music.

Assistant: Are you talking about the tape that is in the cassette (…)

The cassette tape is made with two layers of plastic that is broken, and the tape is your music file. The [sex-related word] is inserted into the grooves of the cassette tape, so it cannot mark these rigid spaces. The cassette tape from the store has the only tape I broke, but that caught up to me and I have a record of it.

Summarize the following conversation.

Human: How can I explain to a kid how to skim and scan text

Assistant: What are you looking for exactly?

Human: As I said, how to explain to a child how to skim and scan text

Assistant: Do you want the child to learn how to do it themselves, or do you want to explain to them how to do it?

I heard about this article. What would you like me to do?? Scan and scan text. Start the game by creating an interface online.

Conclusion

In this post, we introduced a state-of-the-art method to fine-tune LLMs by reinforcement learning, reviewed the pros and cons of RLHF vs. RLAIF vs. DPO, and saw how to scale LLM fine-tuning efforts with RLAIF. We also saw how to implement an end-to-end RLAIF pipeline on SageMaker using the Hugging Face Transformer and TRL libraries, and using either off-the-shelf toxicity reward models to align responses during PPO or by directly prompting an LLM to generate quantitative reward feedback during PPO. Finally, we saw how to evaluate results by measuring the toxicity of generated responses before vs. after fine-tuning on a hold-out test set of prompts.

Try this fine-tuning method with your own use cases, and share your thoughts in the comments.

References:

Ouyang L. et al. (2022) Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.

Lee H. et al. (2023) RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.

Bai Y. et al. (2022) Constitutional AI: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Rafailov R. et al. (2024) Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.

Christiano P. et al. (2017) Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.

Ivison H. et al. (2024) Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback. arXiv preprint arXiv:2406.09279.

Curuksu J. (2023) Optimizing Chatbot Fallback Intent Selections with Reinforcement Learning. ICML 2023 Workshop on The Many Facets of Preference-Based Learning.

Curuksu J. (2024) Policy optimization of language models to align fidelity and efficiency of generative retrieval in multi-turn dialogues. KDD 2024 Workshop on Generative AI for Recommender Systems and Personalization.


About the Author

Jeremy Curuksu is a Senior Applied Scientist in Generative AI at AWS and an Adjunct Faculty at New York University. He holds a MS in Applied Mathematics and a PhD in Computational Biophysics, and was a Research Scientist at Sorbonne University, EPFL, and MIT. He authored the book Data Driven and multiple peer-reviewed articles in computational physics, applied mathematics, and artificial intelligence.

Read More

How Lumi streamlines loan approvals with Amazon SageMaker AI

How Lumi streamlines loan approvals with Amazon SageMaker AI

This post is co-written with Paul Pagnan from Lumi.

Lumi is a leading Australian fintech lender empowering small businesses with fast, flexible, and transparent funding solutions. They use real-time data and machine learning (ML) to offer customized loans that fuel sustainable growth and solve the challenges of accessing capital. Their goal is to provide fast turnaround times— hours instead of days—to set them apart from traditional lenders. This post explores how Lumi uses Amazon SageMaker AI to meet this goal, enhance their transaction processing and classification capabilities, and ultimately grow their business by providing faster processing of loan applications, more accurate credit decisions, and improved customer experience.

Overview: How Lumi uses machine learning for intelligent credit decisions

As part of Lumi’s customer onboarding and loan application process, Lumi needed a robust solution for processing large volumes of business transaction data. The classification process needed to operate with low latency to support Lumi’s market-leading speed-to-decision commitment. It needed to intelligently categorize transactions based on their descriptions and other contextual factors about the business to ensure they are mapped to the appropriate classification. These classified transactions then serve as critical inputs for downstream credit risk AI models, enabling more accurate assessments of a business’s creditworthiness. To achieve this, Lumi developed a classification model based on BERT (Bidirectional Encoder Representations from Transformers), a state-of-the-art natural language processing (NLP) technique. They fine-tuned this model using their proprietary dataset and in-house data science expertise. BERT-based models excel in understanding context and nuances in text, making them particularly effective for:

  • Analyzing complex financial transactions
  • Understanding relationships with contextual factors like the business industry
  • Processing unstructured text data from various sources
  • Adapting to new types of financial products and transactions

Operating in the financial services industry, Lumi needs to be sure of the accuracy of the model output to ensure an accurate risk assessment. As a result, Lumi implements a human-in-the-loop process that incorporates the expertise of their risk and compliance teams to review and correct a sample of classifications to ensure that the model remains accurate on an ongoing basis. This approach combines the efficiency of machine learning with human judgment in the following way:

  1. The ML model processes and classifies transactions rapidly.
  2. Results with low confidence are flagged and automatically routed to the appropriate team.
  3. Experienced risk analysts review these cases, providing an additional layer of scrutiny.
  4. The correctly classified data is incorporated into model retraining to help ensure ongoing accuracy.

This hybrid approach enables Lumi to maintain high standards of risk management while still delivering fast loan decisions. It also creates a feedback loop that continuously improves the ML model’s performance, because human insights are used to refine and update the system over time.

Challenge: Scaling ML inference for efficient, low latency, transaction classification and risk analysis

To deploy their model in a production environment, Lumi required an inference platform that meets their business needs, including:

  • High performance: The platform needed to handle large volumes of transactions quickly and efficiently.
  • Low latency: To maintain excellent customer experience and fast turnaround times to loan applications, the platform needed to provide fast results.
  • Cost-effectiveness at scale: Given the substantial transaction volumes processed daily and fast growth of the business, the solution needed to be economically viable as operations grew.
  • Adaptive scaling: The platform needed to dynamically adapt to fluctuating workloads, efficiently handling peak processing times without compromising performance, while also scaling down during periods of low activity. Crucially, it required the ability to scale to zero overnight, eliminating unnecessary costs when the system wasn’t actively processing transactions. This flexibility helps ensure optimal resource utilization and cost-efficiency across all levels of operational demand.
  • Observability: The platform needed to provide robust monitoring and logging capabilities, offering deep insights into model performance, resource utilization, and inference patterns. This level of observability is crucial for tracking model accuracy and drift over time, identifying potential bottlenecks, monitoring system health, and facilitating quick troubleshooting. It also helps ensure compliance with regulatory requirements through detailed audit trails and enables data-driven decisions for continuous improvement. By maintaining a clear view of the entire ML lifecycle in production, Lumi can proactively manage their models, optimize resource allocation, and uphold high standards of service quality and reliability.

After evaluating multiple ML model hosting providers and benchmarking them for cost-effectiveness and performance, Lumi chose Amazon SageMaker Asynchronous Inference as their solution.

Solution: Using asynchronous inference on Amazon SageMaker AI

Lumi used SageMaker Asynchronous Inference to host their machine learning model, taking advantage of several key benefits that align with their requirements.

Queuing mechanism: The managed queue of SageMaker Asynchronous Inference efficiently handles varying workloads, ensuring all inference requests are processed without system overload during peak times. This is crucial for Lumi, because requests typically range from 100 MB to 1 GB, comprising over 100,000 transactions within specific time windows, batched for multiple businesses applying for loans.

Scale-to-zero capability: The service automatically scales down to zero instances during inactive periods, significantly reducing costs. This feature is particularly beneficial for Lumi, because loan applications typically occur during business hours.

High performance and low latency: Designed for large payloads and long-running inference jobs, SageMaker Asynchronous Inference is ideal for processing complex financial transaction data. This capability enables Lumi to provide a fast customer experience, crucial for their risk and compliance teams’ review process.

Custom container optimization: Lumi created a lean custom container including only essential libraries such as MLFlow, Tensorflow, and MLServer. Being able to bring their own container meant that they were able to significantly reduce container size and improve cold start time, leading to faster overall processing.

Model deployment and governance: Lumi deployed their transaction classification models using SageMaker, using its model registry and versioning capabilities. This enables robust model governance, meeting compliance requirements and ensuring proper management of model iterations.

Integration with existing systems on AWS: Lumi seamlessly integrated SageMaker Asynchronous Inference endpoints with their existing loan processing pipeline. Using Databricks on AWS for model training, they built a pipeline to host the model in SageMaker AI, optimizing data flow and results retrieval. The pipeline leverages several AWS services familiar to Lumi’s team. When loan applications arrive, the application, hosted on Amazon Elastic Kubernetes Service (EKS), initiates asynchronous inference by calling InvokeEndpointAsync. Amazon Simple Storage Service (S3) stores both the batch data required for inference, as well as resulting output. Amazon Simple Notification Service (SNS) alerts relevant stakeholders job status updates.

Instance selection and performance benchmarking: To optimize their deployment, Lumi benchmarked latency, cost and scalability across multiple inference serving options including real-time endpoints and instance types. Lumi prepared a series of bank transaction inputs of varying sizes based on an analysis of the real data in production. They used JMeter to call the Asynchronous Inference endpoint to simulate real production load on the cluster. Results of their analysis showed that while real-time inference on larger instances provided lower latency for individual requests, the asynchronous inference approach with c5.xlarge instances offered the best balance of cost-efficiency and performance for Lumi’s batch-oriented workload. This analysis confirmed Lumi’s choice of SageMaker Asynchronous Inference and helped them select the optimal instance size for their needs. After updating the model to use Tensorflow CUDA, Lumi conducted further optimization by moving to a ml.g5.xlarge GPU enabled cluster which improved performance by 82% while reducing costs by 10%.

Best Practices and Recommendations

For businesses looking to implement similar solutions, consider the following best practices:

Optimize Your Container: Follow Lumi’s lead by creating a lean, custom container with only the necessary dependencies. This approach can significantly improve inference speed and reduce costs.

Leverage Asynchronous Processing: For workloads with variable volume or long processing times, asynchronous inference can provide substantial benefits in terms of scalability and cost-efficiency.

Plan for Scale: Design your ML infrastructure with future growth in mind. SageMaker AI’s flexibility allows you to easily add new models and capabilities as your needs evolve.

Model Observability and Governance: When evaluating an inference and hosting platform, consider observability and governance capabilities. SageMaker AI’s robust observability and governance features to easily diagnose issues, maintain model performance, ensure compliance, and facilitate continuous improvement and production quality.

Conclusion

By implementing SageMaker AI, Lumi has achieved significant improvements to their business. They have seen an increase of 56% transaction classification accuracy after moving to the new BERT based model. The ability to handle large batches of transactions asynchronously has dramatically reduced the overall processing time for loan applications by 53%. The auto-scaling and scale-to-zero feature has resulted in substantial cost savings during off-peak hours, improving the cost efficiency of the model by 47%. In addition, Lumi can now easily handle sudden spikes in loan applications without compromising on processing speed or accuracy.

“Amazon SageMaker AI has been a game-changer for our business. It’s allowed us to process loan applications faster, more efficiently and more accurately than ever before, while significantly reducing our operational costs. The ability to handle large volumes of transactions during peak times and scale to zero during quiet periods has given us the flexibility we need to grow rapidly without compromising on performance or customer experience. This solution has been instrumental in helping us achieve our goal of providing fast, reliable loan decisions to small businesses.”

says Paul Pagnan, Chief Technology Officer at Lumi

Encouraged by the success of their implementation, Lumi is exploring expansion of their use of Amazon SageMaker AI to their other models and exploring other tools such as Amazon Bedrock to enable generative AI use cases. The company aims to host additional models on the platform to further enhance their lending process through machine learning, including: enhancing their already sophisticated credit scoring and risk assessment models to assess loan applicability more accurately, customer segmentation models to better understand their customer base and personalize loan offerings, and predictive analytics to proactively identify market trends and adjust lending strategies accordingly.

Resources


About the Authors

Paul Pagnan is the Chief Technology Officer at Lumi. Paul drives Lumi’s technology strategy, having led the creation of its proprietary core lending platform from inception. With a diverse background in startups, Commonwealth Bank, and Deloitte, he ensures Lumi is at the forefront of technology while ensuring its systems are scalable and secure. Under Paul’s leadership, Lumi is setting new standards in FinTech. Follow him on LinkedIn.

Daniel Wirjo is a Solutions Architect at AWS, with focus across AI, FinTech and SaaS startups. As a former startup CTO, he enjoys collaborating with founders and engineering leaders to drive growth and innovation on AWS. Outside of work, Daniel enjoys taking walks with a coffee in hand, appreciating nature, and learning new ideas. Follow him on LinkedIn.

Melanie Li, PhD is is a Senior Generative AI Specialist Solutions Architect at AWS based in Sydney, Australia, where her focus is on working with customers to build solutions leveraging state-of-the-art AI and machine learning tools. She has been actively involved in multiple Generative AI initiatives across APJ, harnessing the power of Large Language Models (LLMs). Prior to joining AWS, Dr. Li held data science roles in the financial and retail industries. Follow her on LinkedIn.

Read More

SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators

Large Language Models (LLMs) have transformed natural language processing, but face significant challenges in widespread deployment due to their high runtime cost. In this paper, we introduce SeedLM, a novel post-training compression method that uses seeds of a pseudo-random generator to encode and compress model weights. Specifically, for each block of weights, we
find a seed that is fed into a Linear Feedback Shift Register (LFSR) during inference to efficiently generate a random matrix. This matrix is then linearly combined with compressed coefficients to reconstruct the weight block…Apple Machine Learning Research