February 2025 – Page 4

Reducing hallucinations in LLM agents with a verified semantic cache using Amazon Bedrock Knowledge Bases

Large language models (LLMs) excel at generating human-like text but face a critical challenge: hallucination—producing responses that sound convincing but are factually incorrect. While these models are trained on vast amounts of generic data, they often lack the organization-specific context and up-to-date information needed for accurate responses in business settings. Retrieval Augmented Generation (RAG) techniques help address this by grounding LLMs in relevant data during inference, but these models can still generate non-deterministic outputs and occasionally fabricate information even when given accurate source material. For organizations deploying LLMs in production applications—particularly in critical domains such as healthcare, finance, or legal services—these residual hallucinations pose serious risks, potentially leading to misinformation, liability issues, and loss of user trust.

To address these challenges, we introduce a practical solution that combines the flexibility of LLMs with the reliability of drafted, curated, verified answers. Our solution uses two key Amazon Bedrock services: Amazon Bedrock Knowledge Bases, a fully managed service that you can use to store, search, and retrieve organization-specific information for use with LLMs; and Amazon Bedrock Agents, a fully managed service that you can use to build, test, and deploy AI assistants that can understand user requests, break them down into steps, and execute actions. Similar to how a customer service team maintains a bank of carefully crafted answers to frequently asked questions (FAQs), our solution first checks if a user’s question matches curated and verified responses before letting the LLM generate a new answer. This approach helps prevent hallucinations by using trusted information whenever possible, while still allowing the LLM to handle new or unique questions. By implementing this technique, organizations can improve response accuracy, reduce response times, and lower costs. Whether you’re new to AI development or an experienced practitioner, this post provides step-by-step guidance and code examples to help you build more reliable AI applications.

Solution overview

Our solution implements a verified semantic cache using the Amazon Bedrock Knowledge Bases Retrieve API to reduce hallucinations in LLM responses while simultaneously improving latency and reducing costs. This read-only semantic cache acts as an intelligent intermediary layer between the user and Amazon Bedrock Agents, storing curated and verified question-answer pairs.

When a user submits a query, the solution first evaluates its semantic similarity with existing verified questions in the knowledge base. For highly similar queries (greater than 80% match), the solution bypasses the LLM completely and returns the curated and verified answer directly. When partial matches (60–80% similarity) are found, the solution uses the verified answers as few-shot examples to guide the LLM’s response, significantly improving accuracy and consistency. For queries with low similarity (less than 60%) or no match, the solution falls back to standard LLM processing, making sure that user questions receive appropriate responses.

This approach offers several key benefits:

Reduced costs: By minimizing unnecessary LLM invocations for frequently answered questions, the solution significantly reduces operational costs at scale
Improved accuracy: Curated and verified answers minimize the possibility of hallucinations for known user queries, while few-shot prompting enhances accuracy for similar questions.
Lower latency: Direct retrieval of cached answers provides near-instantaneous responses for known queries, improving the overall user experience.

The semantic cache serves as a growing repository of trusted responses, continuously improving the solution’s reliability while maintaining efficiency in handling user queries.

Solution architecture

The solution architecture in the preceding figure consists of the following components and workflow. Let’s assume that the question “What date will AWS re:invent 2024 occur?” is within the verified semantic cache. The corresponding answer is also input as “AWS re:Invent 2024 takes place on December 2–6, 2024.” Let’s walkthrough an example of how this solution would handle a user’s question.

1. Query processing:

a. User submits a question “When is re:Invent happening this year?”, which is received by the Invoke Agent function.

b. The function checks the semantic cache (Amazon Bedrock Knowledge Bases) using the Retrieve API.

c. Amazon Bedrock Knowledge Bases performs a semantic search and finds a similar question with an 85% similarity score.

2. Response paths: (Based on the 85% similarity score in step 1.c, our solution follows the strong match path)

a. Strong match (similarity score greater than 80%):

i. Invoke Agent function returns exactly the verified answer “AWS re:Invent 2024 takes place on December 2–6, 2024” directly from the Amazon Bedrock knowledge base, providing a deterministic response.

ii. No LLM invocation needed, response in less than 1 second.

b. Partial match (similarity score 60–80%):

i. The Invoke Agent function invokes the Amazon Bedrock agent and provides the cached answer as a few-shot example for the agent through Amazon Bedrock Agents promptSessionAttributes.

ii. If the question was “What’s the schedule for AWS events in December?”, our solution would provide the verified re:Invent dates to guide the Amazon Bedrock agent’s response with additional context.

iii. Providing the Amazon Bedrock agent with a curated and verified example might help increase accuracy.

c. No match (similarity score less than 60%):

i. If the user’s question isn’t similar to any of the curated and verified questions in the cache, the Invoke Agent function invokes the Amazon Bedrock agent without providing it any additional context from cache.

ii. For example, if the question was “What hotels are near re:Invent?”, our solution would invoke the Amazon Bedrock agent directly, and the agent would use the tools at its disposal to formulate a response.

3. Offline knowledge management:

a. Verified question-answer pairs are stored in a verified Q&A Amazon S3 bucket (Amazon Simple Storage Service), and must be updated or reviewed periodically to make sure that the cache contains the most recent and accurate information.

b. The S3 bucket is periodically synchronized with the Amazon Bedrock knowledge base. This offline batch process makes sure that the semantic cache remains up-to-date without impacting real-time operations.

Solution walkthrough

You need to meet the following prerequisites for the walkthrough:

An AWS account
Model access to Anthropic’s Claude Sonnet V1 and Amazon Titan Text Embedding V2
AWS Command Line Interface (AWS CLI) installed and configured with the appropriate credentials

Once you have the prerequisites in place, use the following steps to set up the solution in your AWS account.

Step 0: Set up the necessary infrastructure

Follow the “Getting started” instructions in the README of the Git repository to set up the infrastructure for this solution. All the following code samples are extracted from the Jupyter notebook in this repository.

Step 1: Set up two Amazon Bedrock knowledge bases

This step creates two Amazon Bedrock knowledge bases. The agent knowledge base stores Amazon Bedrock service documentation, while the cache knowledge base contains curated and verified question-answer pairs. This setup uses the AWS SDK for Python (Boto3) to interact with AWS services.

agent_knowledge_base = BedrockKnowledgeBase(
    kb_name=agent_knowledge_base_name,
    kb_description="Knowledge base used by Bedrock Agent",
    data_bucket_name=agent_bucket_name,
    chunking_strategy="FIXED_SIZE",
    suffix=f'{agent_unique_id}-f'
)

cache_knowledge_base = BedrockKnowledgeBase(
    kb_name=cache_knowledge_base_name,
    kb_description="Verified cache for Bedrock Agent System",
    data_bucket_name=cache_bucket_name,
    chunking_strategy="NONE",  # We do not want to chunk our question-answer pairs
    suffix=f'{cache_unique_id}-f'
)

This establishes the foundation for your semantic caching solution, setting up the AWS resources to store the agent’s knowledge and verified cache entries.

Step 2: Populate the agent knowledge base and associate it with an Amazon Bedrock agent

For this walkthrough, you will create an LLM Amazon Bedrock agent specialized in answering questions about Amazon Bedrock. For this example, you will ingest Amazon Bedrock documentation in the form of the User Guide PDF into the Amazon Bedrock knowledge base. This will be the primary dataset. After ingesting the data, you create an agent with specific instructions:

agent_instruction = """You are the Amazon Bedrock Agent. You have access to a 
knowledge base with information about the Amazon Bedrock service on AWS. 
Use it to answer questions."""

agent_id = agents_handler.create_agent(
    agent_name,
    agent_description,
    agent_instruction,
    [agent_foundation_model],
    kb_arns=[agent_kb_arn] # Associate agent with our Agent knowledge base
)

This setup enables the Amazon Bedrock agent to use the ingested knowledge to provide responses about Amazon Bedrock services. To test it, you can ask a question that isn’t present in the agent’s knowledge base, making the LLM either refuse to answer or hallucinate.

invoke_agent("What are the dates for reinvent 2024?", session_id="test")
# Response: Unfortunately, the dates for the AWS re:Invent 2024 conference have not 
# been announced yet by Amazon. The re:Invent conference is typically held in late 
# November or early December each year, but the specific dates for 2024 are not 
# available at this time. AWS usually announces the dates for their upcoming 
# re:Invent event around 6-9 months in advance.

Step 3: Create a cache dataset with known question-answer pairs and populate the cache knowledge base

In this step, you create a raw dataset of verified question-answer pairs that aren’t present in the agent knowledge base. These curated and verified answers serve as our semantic cache to prevent hallucinations on known topics. Good candidates for inclusion in this cache are:

Frequently asked questions (FAQs): Common queries that users often ask, which can be answered consistently and accurately.
Critical questions requiring deterministic answers: Topics where precision is crucial, such as pricing information, service limits, or compliance details.
Time-sensitive information: Recent updates, announcements, or temporary changes that might not be reflected in the main RAG knowledge base.

By carefully curating this cache with high-quality, verified answers to such questions, you can significantly improve the accuracy and reliability of your solution’s responses. For this walkthrough, use the following example pairs for the cache:

Q: 'What are the dates for reinvent 2024?'
A: 'The AWS re:Invent conference was held from December 2-6 in 2024.'

Q: 'What was the biggest new feature announcement for Bedrock Agents during reinvent 2024?'
A: 'During re:Invent 2024, one of the headline new feature announcements for Bedrock Agents was the custom orchestrator. This key feature allows users to implement their own orchestration strategies through AWS Lambda functions, providing granular control over task planning, completion, and verification while enabling real-time adjustments and reusability across multiple agents.'

You then format these pairs as individual text files with corresponding metadata JSON files, upload them to an S3 bucket, and ingest them into your cache knowledge base. This process makes sure that your semantic cache is populated with accurate, curated, and verified information that can be quickly retrieved to answer user queries or guide the agent’s responses.

Step 4: Implement the verified semantic cache logic

In this step, you implement the core logic of your verified semantic cache solution. You create a function that integrates the semantic cache with your Amazon Bedrock agent, enhancing its ability to provide accurate and consistent responses.

Queries the cache knowledge base for similar entries to the user question.
If a high similarity match is found (greater than 80%), it returns the cached answer directly.
For partial matches (60–80%), it uses the cached answer as a few-shot example for the agent.
For low similarity (less than 60%), it falls back to standard agent processing.

This simplified logic forms the core of the semantic caching solution, efficiently using curated and verified information to improve response accuracy and reduce unnecessary LLM invocations.

Step 5: Evaluate results and performance

This step demonstrates the effectiveness of the verified semantic cache solution by testing it with different scenarios and comparing the results and latency. You’ll use three test cases to showcase the solution’s behavior:

Strong semantic match (greater than 80% similarity)
Partial semantic match (60-80% similarity)
No semantic match (less than 60% similarity)

Here are the results:

Strong semantic match (greater than 80% similarity) provides the exact curated and verified answer in less than 1 second.

%%time
invoke_agent_with_verified_cache("What were some new features announced for Bedrock Agents during reinvent 2024?")

# Output:
# Cache semantic similarity log: Strong match with score 0.9176399
# CPU times: user 20.7 ms, sys: 442 μs, total: 21.1 ms
# Wall time: 440 ms

# During re:Invent 2024, one of the headline new feature announcements for Bedrock 
# Agents was the custom orchestrator. This key feature allows users to implement 
# their own orchestration strategies through AWS Lambda functions, providing 
# granular control over task planning, completion, and verification while enabling 
# real-time adjustments and reusability across multiple agents.

Partial semantic match (60–80% similarity) passes the verified answer to the LLM during the invocation. The Amazon Bedrock agent answers the question correctly using the cached answer even though the information is not present in the agent knowledge base.

%%time
invoke_agent_with_verified_cache("What are the newest features for Bedrock Agents?") 

# Output:
# Cache semantic similarity log: Partial match with score 0.6443664
# CPU times: user 10.4 ms, sys: 0 ns, total: 10.4 ms
# Wall time: 12.8 s

# One of the newest and most significant features for Amazon Bedrock Agents 
# announced during re:Invent 2024 was the custom orchestrator. This feature 
# allows users to implement their own orchestration strategies through AWS 
# Lambda functions, providing granular control over task planning, completion, 
# and verification. It enables real-time adjustments and reusability across 
# multiple agents, enhancing the flexibility and power of Bedrock Agents.

No semantic match (less than 60% similarity) invokes the Amazon Bedrock agent as usual. For this query, the LLM will either refuse to provide the information because it’s not present in the agent’s knowledge base, or will hallucinate and provide a response that is plausible but incorrect.

%%time
invoke_agent_with_verified_cache("Tell me about a new feature for Amazon Bedrock Agents")

# Output:
# Cache semantic similarity log: No match with score 0.532105
# CPU times: user 22.3 ms, sys: 579 μs, total: 22.9 ms
# Wall time: 13.6 s

# Amazon Bedrock is a service that provides secure and scalable compute capacity 
# for running applications on AWS. As for new features for the Bedrock Agents 
# component, I do not have any specific information on recent or upcoming new 
# features. However, AWS services are frequently updated with new capabilities, 
# so it's possible there could be new agent features released in the future to 
# enhance security, scalability, or integration with other AWS services. Without 
# being able to consult the Knowledge Base, I cannot provide details on any 
# particular new Bedrock Agent features at this time.

These results demonstrate the effectiveness of the semantic caching solution:

Strong matches provide near-instant, accurate, and deterministic responses without invoking an LLM.
Partial matches guide the LLM agent to provide a more relevant or accurate answer.
No matches fall back to standard LLM agent processing, maintaining flexibility.

The semantic cache significantly reduces latency for known questions and improves accuracy for similar queries, while still allowing the agent to handle unique questions when necessary.

Step 6: Resource clean up

Make sure that the Amazon Bedrock knowledge bases that you created, along with the underlying Amazon OpenSearch Serverless collections are deleted to avoid incurring unnecessary costs.

Production readiness considerations

Before deploying this solution in production, address these key considerations:

Similarity threshold optimization: Experiment with different thresholds to balance cache hit rates and accuracy. This directly impacts the solution’s effectiveness in preventing hallucinations while maintaining relevance.
Feedback loop implementation: Create a mechanism to continuously update the verified cache with new, accurate responses. This helps prevent cache staleness and maintains the solution’s integrity as a source of truth for the LLM.
Cache management and update strategy: Regularly refresh the semantic cache with current, frequently asked questions to maintain relevance and improve hit rates. Implement a systematic process for reviewing, validating, and incorporating new entries to help ensure cache quality and alignment with evolving user needs.
Ongoing tuning: Adjust similarity thresholds as your dataset evolves. Treat the semantic cache as a dynamic component, requiring continuous optimization for your specific use case.

Conclusion

This verified semantic cache approach offers a powerful solution to reduce hallucinations in LLM responses while improving latency and reducing costs. By using Amazon Bedrock Knowledge Bases, you can implement a solution that can efficiently serve curated and verified answers, guide LLM responses with few-shot examples, and gracefully fall back to full LLM processing when needed.

About the Authors

Dheer Toprani is a System Development Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in large language models, cloud infrastructure, and scalable data systems, focusing on building intelligent solutions that enhance automation and data accessibility across Amazon’s operations. Previously, he was a Data & Machine Learning Engineer at AWS, where he worked closely with customers to develop enterprise-scale data infrastructure, including data lakes, analytics dashboards, and ETL pipelines.

Chaithanya Maisagoni is a Senior Software Development Engineer (AI/ML) in Amazon’s Worldwide Returns and ReCommerce organization. He specializes in building scalable machine learning infrastructure, distributed systems, and containerization technologies. His expertise lies in developing robust solutions that enhance monitoring, streamline inference processes, and strengthen audit capabilities to support and optimize Amazon’s global operations.

Rajesh Nedunuri is a Senior Data Engineer within the Amazon Worldwide Returns and ReCommerce Data Services team. He specializes in designing, building, and optimizing large-scale data solutions. At Amazon, he plays a key role in developing scalable data pipelines, improving data quality, and enabling actionable insights for reverse logistics and ReCommerce operations. He is deeply passionate about generative AI and consistently seeks opportunities to implement AI into solving complex customer challenges.

Karam Muppidi is a Senior Engineering Manager at Amazon Retail, where he leads data engineering, infrastructure and analytics for the Worldwide Returns and ReCommerce organization. He has extensive experience developing enterprise-scale data architectures and governance strategies using both proprietary and native AWS platforms, as well as third-party tools. Previously, Karam developed big-data analytics applications and SOX compliance solutions for Amazon’s Fintech and Merchant Technologies divisions.

LLM continuous self-instruct fine-tuning framework powered by a compound AI system on Amazon SageMaker

Fine-tuning a pre-trained large language model (LLM) allows users to customize the model to perform better on domain-specific tasks or align more closely with human preferences. It is a continuous process to keep the fine-tuned model accurate and effective in changing environments, to adapt to the data distribution shift (concept drift) and prevent performance degradation over time. Continuous fine-tuning also enables models to integrate human feedback, address errors, and tailor to real-world applications. You can use supervised fine-tuning (SFT) and instruction tuning to train the LLM to perform better on specific tasks using human-annotated datasets and instructions. When you have user feedback to the model responses, you can also use reinforcement learning from human feedback (RLHF) to guide the LLM’s response by rewarding the outputs that align with human preferences.

Precise and responsible outputs from fine-tuned LLMs require big efforts from subject matter experts (SMEs). The manual annotation of extensive training data for fine-tuning by human SMEs and collecting user feedback to align LLM responses with human preferences are both resource-heavy and time-intensive. Also, the continuous fine-tuning process requires orchestrating the multiple steps of data generation, LLM training, feedback collection, and preference alignments with scalability, resiliency, and resource efficiency. To address these challenges, we present an innovative continuous self-instruct fine-tuning framework that streamlines the LLM fine-tuning process of training data generation and annotation, model training and evaluation, human feedback collection, and alignment with human preference. This framework is designed as a compound AI system to drive the fine-tuning workflow for performance improvement, versatility, and reusability.

In this post, we introduce the continuous self-instruct fine-tuning framework and its pipeline, and present how to drive the continuous fine-tuning process for a question-answer task as a compound AI system. We use DSPy (Declarative Self-improving Python) to demonstrate the workflow of Retrieval Augmented Generation (RAG) optimization, LLM fine-tuning and evaluation, and human preference alignment for performance improvement.

Overview of the continuous self-instruct fine-tuning framework

The continuous self-instruct fine-tuning framework drives a workflow to customize the foundation model (FM) using human-labeled training samples and human feedback after model inference. This workflow runs on a continuous basis to be adaptive to a changing environment. The following diagram illustrates the workflow.

The workflow consists of the following steps:

Self-instruct supervised fine-tuning – First, we use a human-labeled training dataset to adapt the FM to tasks in a specific domain. Instruction tuning is a popular approach in domain-specific LLM fine-tuning, which trains the FM to follow instructions for a specific task rather than generating the next texts. To address the challenges of the lack of human efforts for data labeling, annotation, and validation, we designed a self-instruct fine-tuning method to synthetically generate training labels by the LLM from a small volume of high-quality human-annotated samples. This process scales up the training dataset used for fine-tuning the FM into a custom LLM.
Human preference alignment – After the model is deployed in the production environment, the process moves into the human-in-the-loop workflow, in which we collect user feedback including satisfaction scores and comments on model response. The human feedback data is not only used for model performance and hallucination measurement, but is also used to further fine-tune the custom model in Step 1 through RLHF. Likewise, to address the challenges of lack of human feedback data, we use LLMs to generate AI grades and feedback that scale up the dataset for reinforcement learning from AI feedback (RLAIF). There are various techniques of preference alignment, including proximal policy optimization (PPO), direct preference optimization (DPO), odds ratio policy optimization (ORPO), group relative policy optimization (GRPO), and other algorithms, that can be used in this process.
Evaluation and continuous learning – The model customization and preference alignment is not a one-time effort. We need to keep monitoring and evaluating the model performance, and restart the process in case of concept shift or model decay.

The overall workflow consists of multiple steps of synthetic data generation, LLM training, feedback collection, preference alignment, and evaluation that involves multiple components and multiple LLMs. In the next section, we discuss using a compound AI system to implement this framework to achieve high versatility and reusability.

Compound AI system and the DSPy framework

With the rise of generative AI, scientists and engineers face a much more complex scenario to develop and maintain AI solutions, compared to classic predictive AI. The paper The Shift from Models to Compound AI Systems highlights that state-of-the-art AI results are increasingly obtained by compound systems with multiple components, not just monolithic models. Compound AI systems are systems that implement AI tasks by combining multiple interacting components. These components can include multiple calls to models, retrievers, or external tools. The following diagram compares predictive AI to generative AI.

The concept of a compound AI system enables data scientists and ML engineers to design sophisticated generative AI systems consisting of multiple models and components. You can use a module to incorporate prompt engineering and in-context learning to improve RAG performance, and also design a data architecture with tools to gather external data. You can also build an agentic architecture with multiple LLMs, fine-tune the model to achieve higher performance, and orchestrate the LLM access. Besides the efficiency in system design, the compound AI system also enables you to optimize complex generative AI systems, using a comprehensive evaluation module based on multiple metrics, benchmarking data, and even judgements from other LLMs. The optimization is on the holistic end-to-end solution, rather than on each component separately.

To efficiently build and optimize compound AI systems, we introduce DSPy, an open source Python framework for developers to build LLM applications using modular and declarative programming, whether you’re building simple classifiers, sophisticated RAG pipelines, or agentic workflows. It provides algorithms for optimizing LLMs’ prompts and weights, and automates the prompt tuning process, as opposed to the trial-and-error approach performed by humans. DSPy supports iteratively optimizing all prompts involved against defined metrics for the end-to-end compound AI solution.

The DSPy lifecycle is presented in the following diagram in seven steps. It separates the flow of your program (modules) from the parameters (language model prompts and weights) of each step. These modules define the system behavior in a portable, declarative way. The first four steps cover the DSPy programming stage, including defining your task and its constraints, exploring a few examples, and using that to inform your initial pipeline design. When your system works reasonably well, you can run the DSPy evaluation stage (Steps 5 and 6) to collect an initial development set, define your DSPy metric, and use these to iterate on your system more systematically. Afterwards, DSPy introduces new optimizers (compilers) in Step 7, with language model-driven algorithms to tune LLM prompts and weights, based on predefined evaluation metrics.

RAG pipeline with continuous fine-tuning in a compound AI system

In this post, we provide an example of a question-answer task, using a RAG pipeline along with the continuous self-instruct fine-tuning framework. We build this as a compound AI system and use DSPy to drive the RAG inference, prompt optimization, LLM fine-tuning, and performance evaluation. The overall workflow is shown in the following diagram.

The flow starts from a standard RAG pipeline, followed by a few optimizations on the prompts and the RAG retriever. Then we generate the synthetic training dataset from the RAG knowledge base to fine-tune the generator LLM using RAG for performance improvement. Lastly, we use a separate LLM to generate feedback on the fine-tuned model responses, and use it to conduct the preference alignment training by DPO and PPO. The question-answer outputs from each step are measured by the underlying LLM-as-a-judge evaluation module. In this way, we demonstrate the effectiveness of the compound AI system for the continuous optimizing of the pipeline through RAG optimization and the fine-tuning framework.

In the next sections, we demonstrate how to build this workflow, including the RAG pipeline, optimization, instruction fine-tuning, preference alignment, and model evaluation, into a compound AI system using an Amazon SageMaker notebook instance with the DSPy framework and LLMs on Amazon Bedrock. The code from this post and more examples are available in the GitHub repository.

Prerequisites

To create and run this compound AI system in your AWS account, complete the following prerequisites:

Create an AWS account if you don’t already have one.
Set up a SageMaker notebook instance.
Open JupyterLab in this newly created instance.
Clone the GitHub repository and follow the steps explained in the README.
Navigate to the cloned repository and open the notebook folder.
Enable access to models hosted on Amazon Bedrock. For this post, we enable Anthropic’s Claude 3 Sonnet, Mistral 7B, and Meta Llama 8B.

Dataset

For the question-answering task, we use the Contract Understanding Atticus Dataset (CUAD), an open legal contract review dataset created with dozens of legal experts from The Atticus Project, which consists of over 13,000 annotations. The synthetic data generation notebook automatically downloads the CUAD_v1 ZIP file and places it in the required folder named cuad_data.

In case of any issues, you can alternately download the dataset yourself by following the steps in the README file and store the dataset inside a folder within the SageMaker notebook instance, and use it to perform the steps in the next section.

Prepare question-answer pairs

The first step is to prepare question-answer pairs from the CUAD document by running synthetic data generation.

We use Anthropic’s Claude v3 Sonnet on Amazon Bedrock to synthetically generate question-answer pairs to infer the RAG pipeline in the compound AI system, to demonstrate the improved accuracy after RAG optimization and model fine-tuning. The generated datasets are in the format of question-answer pairs along with the context [context, question, answer] from the document. We use the question to infer the RAG pipeline and use the answer as ground truth to evaluate the inference accuracy. Additionally, the question-answer pairs are used as training samples for the model fine-tuning. The following is a sample dataset triplet with context and a question-answer pair.

Context (Snippet from PDF file)

Question

Answer

THIS STRATEGIC ALLIANCE AGREEMENT (“Agreement”) is made and entered into as of November 6, 2016 (the “Effective Date”) by

and between Dialog Semiconductor (UK) Ltd., a corporation organized under the laws of England and Wales, having its principal office at 100

Longwater Avenue, Green Park, Reading, RG2 6GP, United Kingdom (“DIALOG”) and Energous Corporation, a Delaware corporation, having its

principal office at 3590 North First Street, Suite 210, San Jose, CA 95134 (“ENERGOUS”)

What is the date of the contract?

November 6, 2016

Create a RAG pipeline

We implement a standard RAG pipeline with DSPy using the following components to create the vector database, set up context retrieval, and generate the answer:

Configure DSPy to use LLMs on Amazon Bedrock as the RAG generator model:

dsp_bedrock = dspy.Bedrock(region_name='us-west-2')
claude_sonnet_model_id = "anthropic.claude-3-sonnet-20240229-v1:0"
bedrock_sonnet = dspy.AWSAnthropic(aws_provider=dsp_bedrock,
                                   model=claude_sonnet_model_id,
                                   max_new_tokens=4096,
                                   max_tokens=4096)

Process the dataset to generate logical and syntactically readable chunks. The size and overlap percentage can be empirically determined based on the dataset. For more flexibility, you can generate multiple files from the dataset file and make one file one chunk.
To set up a RAG retriever, we select ChromaDB as a vector store, and use DSPy’s ChromadbRM module as the retriever model:

titan_embed_model_id = "amazon.titan-embed-text-v2:0"
bedrock_ef = AmazonBedrockEmbeddingFunction(session=session, 
                                            model_name=titan_embed_model_id)
collection_name = "contexts"
persist_dir = "cuad_db/"
rm = ChromadbRM(collection_name=collection_name,
                persist_directory=persist_dir,
                embedding_function=bedrock_ef,
                k=3)

Using these components, we orchestrate a DSPy RAG pipeline to clean the context, generate the answer, and use the LLM-as-a-judge to score the generated answer with respect to the ground truth:

class GenerateAnswer(dspy.Signature):
   """Answer questions with short factoid answers."""
   context = dspy.InputField(desc="may contain relevant facts")
   question = dspy.InputField()
   answer = dspy.OutputField(desc="often between 1 and 5 words")

class RAG(dspy.Module):
   def __init__(self, num_passages=3):
      super().__init__()
      self.retrieve = ChromadbRM("contexts", "./chroma", k=num_passages)
      self.generate_answer = dspy.ChainOfThought(GenerateAnswer)
   def forward(self, question):
      context = self.retrieve(question).passages
      context = [unicodedata.normalize("NFKD", r) for r in self.retrieve(question).passages]
      prediction = self.generate_answer(context=context, question=question)
      return dspy.Prediction(context=context, answer=prediction.answer)

RAG optimization with DSPy

The next step is to perform RAG optimization with DSPy. DSPy provides the Optimizer module, an algorithm that can tune the parameters of a DSPy program (the prompts and language model weights) to maximize the metrics you specify. It takes in a training set to bootstrap the selective training examples, and is based on a metric function that measures proximity to or matches against the ground truth. With these, we can compile the RAG pipeline module with a defined optimizer instance to conduct the optimization.

In this post, we use DSPy Optimizer to learn how to generate the prompt to improve the RAG response accuracy. Because our dataset size is low (fewer than 100 examples), we select the BootstrapFewShot teleprompter to compile the RAG prompts and overall pipeline, and use the synthetic dataset with ground truth and the LLM-as-a-judge metric function we defined in the previous sections:

def validate_context_and_answer(example, pred, trace=None):
   answer_EM = dspy.evaluate.answer_exact_match(example, pred)
   answer_PM = dspy.evaluate.answer_passage_match(example, pred)
   answer_LLMJudge = factuality_metric(example, pred)
   return answer_LLMJudge or answer_EM or answer_PM

rag_lm = RAG()
teleprompter = BootstrapFewShot(metric=validate_context_and_answer)
compiled_rag = teleprompter.compile(rag_lm, trainset=trainset)

The context retrieval is crucial to the overall RAG accuracy. To evaluate the RAG optimization we’ve described, we create a retriever evaluation by the LLM-as-a-judge to understand how well the retriever is able to pull out the relevant chunks for the incoming user question. The LLM judge is defined in the RetrievalJudge class:

class RetrievalJudge(dspy.Signature):
   """Judge given the question to be answered, check if the groundtruth answer can be derived from the predicted context.  Answer either Retrieved[True] or Retrieved[False]"""
   context = dspy.InputField(desc="Context for the prediction")
   question = dspy.InputField(desc="Question to be answered")
   groundtruth_answer = dspy.InputField(desc="groundtruth answer for the question")
   retrieval_correctness = dspy.OutputField(desc="Can the groundtruth answer be derived from the predicted context?", prefix="Retrieved[True/False]:")

retrieval_judge = dspy.ChainOfThought(RetrievalJudge)

Then we define the metric to measure the retrieval by using the RetrievalJudge, and use the DSPy Evaluate module to generate the accuracy score for retrieval:

def retrieval_metric(example, pred):
   retrieval = retrieval_judge(question=example.question, groundtruth_answer=example.answer, context=pred.context)
   llm_retriever_ans = bool("Retrieved[True]" in retrieval.retrieval_correctness
                            or '100% True' in retrieval.retrieval_correctness
                            or '100% retrieved correct' in retrieval.retrieval_correctness
                            or 'True.' in retrieval.retrieval_correctness)
   return llm_retriever_ans

rag_retrieval_score = Evaluate(compiled_rag, num_threads = 1, metric=retrieval_metric)

Configure the continuous fine-tuning framework

After the RAG optimization, the compound AI system has the instruction tuning and preference alignment modules, driven by the continuous fine-tuning framework. This includes using the synthetically generated dataset to train the LLM to follow question-answer instructions by SFT, and generating feedback of RAG responses by AI (another LLM) used for RLAIF with PPO and preference alignment with DPO and ORPO. In this step, we use Parameter Efficient Fine-Tuning (PEFT) with Low-Rank Adaptation (LoRA) to reduce the requirement of compute resources and accelerate the training process.

At the time of writing, the DSPy Optimization module supports distillation of a prompt-based DSPy program into LLM weight updates using BootstrapFinetune, and does not yet support the fine-tuning methods we defined in the compound AI system. Therefore, we conducted the fine-tuning (instruction tuning and preference alignment) on a Meta Llama 3 8B model separately; refer to the following GitHub repository for more details. With the compound AI system design, we are able to take the fine-tuning results back into the DSPy pipeline, use the LLM-as-a-judge evaluation function to generate the accuracy scores, and benchmark with the standard and optimized RAG inferences. This demonstrates the flexibility and interoperability of the compound AI system, which allows us to seamlessly replace one module with an external component without requiring changes to the entire pipeline.

The following diagram illustrates the workflow.

Define an evaluation approach with DSPy

DSPy provides an Evaluate module for evaluating the compound AI system output by using user-defined metrics. In this post, we use LLM-as-a-judge to evaluate the system output and create the corresponding metrics for benchmarking the accuracy of standard RAG, optimized RAG, and fine-tuned models. Complete the following steps:

Load the dataset for evaluation in the Example data type. Examples are similar to Python dictionaries but with added utilities such as the dspy.Prediction as a return value. For example:

gt_answer = <ground truth of the answer>
pred_answer = <answer from RAG and/or fine-tuned model>
dspy_data = dspy.Example(gt_answer=gt_answer, pred_answer=pred_answer).with_inputs("gt_answer", "pred_answer")

Define the LLM-as-a-judge class to adjudicate whether the predicted answer semantically matches the ground truth of the answer. For example, the following FactualityJudge_1 class provides a score between 0 and 1; 0 means a complete mismatch and 1 means a perfect match.

class FactualityJudge_1(dspy.Signature):
   """Judge if the predicted answer is semantically match the groundtruth answer. Provide a score between 0 and 1, 0 means completely mismatch and 1 means perfectly match. In the response, only present the score, DO NOT add any preambles."""
   groundtruth_answer = dspy.InputField(desc="groundtruth answer")
   predicted_answer = dspy.InputField(desc="predicted answer")
   factually_correct = dspy.OutputField(desc="Is the predicted answer factually correct and semantically similar to the groundtruth answer?"))

Define the evaluation metrics from the LLM judge, using DSPy metrics, to mark whether the predicted answer is true or not. For example, the following function returns the accuracy score based on the output of FactualityJudge_1:

factualityJudge_1 = dspy.ChainOfThought(FactualityJudge_1)

def factuality_metric_1(gt_answer, pred_answer):
   pred_answer = gt_answer.pred_answer
   gt_answer = gt_answer.gt_answer
   factual_metrc = factualityJudge_1(groundtruth_answer=gt_answer, predicted_answer=pred_answer)
   llm_judge_ans = float(factual_metrc[0].factually_correct)
   print(f"llm_judge_ans = {llm_judge_ans}")
   return llm_judge_ans

metric_LLM_1 = factuality_metric_1

Use the dspy.Evaluate module to generate an accuracy score using the LLM-as-a-judge metrics defined in the previous step:

evaluate_llm_judge = Evaluate(devset= dspy_data, metric=metric_LLM_1, num_threads=1)

This evaluation process should be conducted on a continuous basis in the compound AI system driven by self-instruct fine-tuning, to make sure the overall performance remains stable despite the changes in the environment or the introduction of new data.

Benchmark RAG and LLM fine-tuning with DSPy

We benchmark the approaches presented in this post using the LLM-as-a-judge evaluation function defined in the previous section with the following settings.

The benchmarking is across five methods: standard RAG, optimized RAG, fine-tuning LLMs by instruction tuning, and fine-tuning LLMs by DPO and ORPO trained LLMs based on AIF. For each method, the LLM judge provides a decimal accuracy score in the range of 0 and 1.

The standard RAG uses Amazon Titan Text Embedding V2 for the embedding model, and Anthropic’s Claude 3 Haiku model for the generator model. The RAG compilation uses 32 question-answer pairs to optimize the prompts. The same dataset is used for inference. The fine-tuning by SFT, DPO, and ORPO are performed on the Meta Llama 3 8B FM, using training samples synthetically generated from CUAD document.

The results are presented in the following tables and charts. The different methods demonstrate different levels of improvement. The improvement is calculated in percentage by (accuracy of new method – accuracy of standard RAG)/(accuracy of standard RAG)*100%.

The optimized RAG by DSPy improved the accuracy and reduced the hallucination.

	Standard RAG with Claude 3 Haiku	RAG with Claude 3 Haiku optimized by DSPy	Improvement %
Accuracy by LLM Judge (0-1)	0.3969	0.6656	67.70%

	Standard RAG with Claude 3 Sonnet	RAG with Claude 3 Sonnet optimized by DSPy	Improvement %
Accuracy by LLM Judge (0-1)	0.3031	0.6375	110.33%

The custom LLM trained by SFT yielded higher accuracy than the standard RAG.

	Standard RAG with Claude 3 Haiku	SFT tuned Meta Llama 3 8B	Improvement %
Accuracy by LLM Judge (0-1)	0.3969	0.4813	21.26%

	Standard RAG with Claude 3 Sonnet	SFT tuned Meta Llama 3 8B	Improvement %
Accuracy by LLM Judge (0-1)	0.3031	0.4813	58.79%

The custom LLM through preference alignment from human and AI feedback (DPO and ORPO) further improved the model performance. The fine-tuned small size model (Meta Llama 3 8B) outperformed the standard RAG pipeline with the medium size (Anthropic’s Claude Haiku) and larger size (Anthropic’s Claude Sonnet) generator model, and was comparable with the prompt-optimized RAG using ground truth data.

	Standard RAG with Claude 3 Haiku	DPO tuned Meta Llama 3 8B	Improvement %	ORPO tuned Meta Llama 3 8B	Improvement %
Accuracy by LLM Judge (0-1)	0.3969	0.6719	69.29%	0.6812	71.63%

	Standard RAG with Claude 3 Sonnet	DPO tuned Meta Llama 3 8B	Improvement %	ORPO tuned Meta Llama 3 8B	Improvement %
Accuracy by LLM Judge (0-1)	0.3031	0.6719	121.68%	0.6812	124.74%

The following charts compare the accuracy across all tested methods.

The preceding results were generated from a small dataset (32 question-answer pairs). You can use a larger sample set with more question-answer pairs to conduct the benchmarking and compare your own results.

Clean up

Make sure to clean up the following resources to avoid incurring additional costs:

Delete Amazon Simple Storage Service (Amazon S3) buckets created for data storage and resource sharing.
Back up the Jupyter notebooks in the SageMaker notebook instance.
Shut down and delete the SageMaker notebook instance.

Cost considerations

Consider the following costs from the solution deployed on AWS:

You will incur charges for LLM inference on Amazon Bedrock. For more details, refer to Amazon Bedrock pricing.
You will incur charges for storing files in S3 buckets. For more details, refer to Amazon S3 pricing.
You will incur charges for your SageMaker notebook instance. For more details, refer to Amazon SageMaker pricing.

Conclusion

In this post, we presented the continuous self-instruct fine-tuning framework as a compound AI system implemented by the DSPy framework. The framework first generates a synthetic dataset from the domain knowledge base and documents for self-instruction, then drives model fine-tuning through SFT, and introduces the human-in-the-loop workflow to collect human and AI feedback to the model response, which is used to further improve the model performance by aligning human preference through reinforcement learning (RLHF/RLAIF).

We demonstrated the framework for a question-answer task with a RAG pipeline, which improved the end-to-end response accuracy. The workflow is implemented by the DSPy framework; the overall strategy is to use the dspy.Module to connect all the components (RAG pipeline, prompt optimization, LLMs fine-tuned by SFT and RLHF/RLAIF, performance evaluation) together into a compound AI system. Each module can be seamlessly maintained, updated, and replaced without affecting other components in the system. This robust and versatile system design strengthens control and trust through modular design, and increases flexibility and adaptability to changing environments and data sources.

You can implement this continuous fine-tuning framework for LLM performance improvement for your own business use cases, with a compound AI system that provides high flexibility and interoperability. For more details, follow the examples in our GitHub repository.

About the Authors

Yunfei Bai is a Principal Solutions Architect at AWS. With a background in AI/ML, data science, and analytics, Yunfei helps customers adopt AWS services to deliver business results. He designs AI/ML and data analytics solutions that overcome complex technical challenges and drive strategic objectives. Yunfei has a PhD in Electronic and Electrical Engineering. Outside of work, Yunfei enjoys reading and music.

Shayan Ray is an Applied Scientist at Amazon Web Services. His area of research is all things natural language (like NLP, NLU, and NLG). His work has been focused on conversational AI, task-oriented dialogue systems, and LLM-based agents. His research publications are on natural language processing, personalization, and reinforcement learning.

Jose Cassio dos Santos Junior is a Senior Data Scientist member of the MLU team. He is responsible for Curriculum Development for Advanced Modules. As a previous Senior Data Scientist on the AWS LATAM Professional Services Data Science team, he has over 20 years of experience working as a software engineer and more than 10 years of teaching experience at colleges and as an instructor for Linux certification preparation and Microsoft Innovation Center bootcamps. As a business process management expert, he participated in BPO projects for more than 7 years. He holds a Master’s degree in Computer Engineering, a Bachelor’s degree in Physics, and a Bachelor’s degree in Business Administration, specialized in IT Quantitative Methods.

Maximize your file server data’s potential by using Amazon Q Business on Amazon FSx for Windows

Organizations need efficient ways to access and analyze their enterprise data. Amazon Q Business addresses this need as a fully managed generative AI-powered assistant that helps you find information, generate content, and complete tasks using enterprise data. It provides immediate, relevant information while streamlining tasks and accelerating problem-solving.

Amazon FSx for Windows File Server is a fully managed Windows file system that provides high-performance file storage for Windows-based applications. You can use Amazon FSx to lift and shift your on-premises Windows file server workloads to the cloud, taking advantage of the scalability, durability, and cost-effectiveness of AWS while maintaining full compatibility with your existing Windows applications and tooling.

Amazon Q Business is designed to be secure and private, seamlessly integrating with your existing identity provider (IdP). It works directly with your identities, roles, and permission sets, making sure users can’t access data they are not authorized to. Additionally, Amazon Q Business seamlessly integrates with multiple enterprise data stores, including FSx for Windows File Server, enabling you to index documents from file server systems and perform tasks such as summarization, Q&A, or data analysis of large numbers of files effortlessly.

In this post, we demonstrate how to use the Amazon Q connector for FSx for Windows File Server, explore a practical use case, and provide step-by-step instructions to help you get started and gain insights out of your data stored in FSx for Windows File Server.

Overview of the Amazon Q data source connector

A data source connector is a mechanism for integrating and synchronizing data from multiple repositories, including Microsoft SharePoint, Salesforce, Amazon Simple Storage Service (Amazon S3) buckets, and even your internal FSx for Windows File Server into one container index. Amazon Q Business offers multiple data source connectors that can connect to your data sources and help you create your generative AI solution with minimal configuration. For a list of supported connectors, see Supported connectors.

Supported document types

Amazon Q boasts impressive versatility, supporting a wide range of document types stored at various places in your environment, including Windows Share (FSX for Windows File Server). Amazon Q can ingest and understand common formats like plaintext, PDF, HTML, XML, and JSON to Microsoft formats like Excel, Word, and PowerPoint. This provides a comprehensive search experience for your enterprise users.

Secure access with supported authentication types

Security is job zero at AWS, and Amazon Q has been built keeping that in mind. It supports a variety of authentication types, seamlessly integrating with your existing identity management systems. Whether you use single sign-on (SSO) or a custom authentication solution, Amazon Q can adapt to your specific needs.

Fine-grained control with ACLs and identity crawling

For organizations with highly sensitive data, Amazon Q offers an extra layer of security. Amazon Q Business supports crawling access control lists (ACLs) for document security by default. When you connect an Amazon FSx (Windows) data source to Amazon Q Business, it crawls ACL information attached to a document (user and group information) from the directory service of the Amazon FSx instance.

Overview of solution

The following diagram shows a high-level architecture of how AWS Managed Active Directory users, through AWS IAM Identity Center, can access and interact with an Amazon Q Business application. This enables an authenticated user to securely and privately interact with the application and gain insights from the enterprise data stored in FSx for Windows File Server, using the Amazon Q Business web experience from their web browser.

In this post, we walk you through the process of integrating Amazon Q Business with FSx for Windows File Server to extract meaningful insights from your file system using natural language processing (NLP). This solution enables you to interact with your file system data using conversational AI, making information discovery more intuitive and efficient.

To set up your Amazon Q Business application, complete the following high-level steps:

Create a new Amazon Q application.
Select the retriever.
Add a data source (FSx for Windows File Server).
Synchronize your file system data.

Lastly, we demonstrate the application functionality by testing its access for two different users.

Prerequisites

To implement this solution, you should have an AWS account with administrative privileges.

Follow the instructions in the GitHub repository’s README file to provision the infrastructure required for exploring the Amazon Q connector for FSx for Windows File Server.

Create an Amazon Q Business application

Complete the following steps to create a new Amazon Q Business application:

On the Amazon Q Business console, choose Applications in the navigation pane.
Choose Create application.

For Application name, enter a name (for example, anycompany-filesystem-knowledgebase).
For Access management method, select AWS IAM Identity Center.

If you completed the prerequisites, then IAM Identity Center is already enabled, and you should see the instance ARN listed.

Under Quick start user, for Select user, choose your users.
Leave Select subscription as Q Business Pro.
For Application details, use the default values.
Choose Create.

In the next step, you will select the data source to retrieve and index the data.

Select the retriever

In this step, you select the retriever to connect data sources to the application. There are two options: use a native retriever or use Amazon Kendra. For this example, we use a native retriever.

On the application details page, under Q Recommendations, choose Data sources.

Choose Select retriever.

For Retrievers, select Native.
For Index provisioning, select Enterprise.
For Number of units, enter 1.
Choose Confirm.

Add a data source

Complete the following steps to add a data source:

On the application details page, choose Add data source.
Search for Amazon FSx and choose the plus sign next to Amazon FSX (Windows).

In the Name and description section, enter a name (for example, anycompany-filesystem-source) and an optional description.
In the Source section, for Amazon FSx file system ID, choose the file system ID you created as a prerequisite.
In the Authorization section, leave as default (ACLs are enabled for the connector).

In the Authentication section, for AWS Secrets Manager secret, choose the AWS Secrets Manager secret that holds the active directory credentials to communicate with Amazon FSx to crawl the file system (QBusiness-fsx-creds).
In the Configure VPC and security group, provide the following information:
- For Virtual Private Cloud (VPC), choose the virtual private cloud (VPC) created as a prerequisite (amazon-connector-for-win-fsx-blog-vpc).
- For Subnets, choose the private subnets that hold the FSx for Windows File System and active directory instance.
- For VPC security groups, choose your security group (<stack-name>-DefaultSecurityGroup).

In the IAM role section, provide the following information:
1. For IAM role¸ choose Create a new service role.
2. For Role name, enter a name for the role.
In the Sync scope section, provide the following information:
1. For Maximum file size, use the default option of 50 MB.
2. Under Regex patterns, you can add inclusion and exclusion patterns. For this post, we add the inclusion pattern for PDF file types, so the Amazon Q crawler will include PDF files.

In the Sync mode section, select Full sync.

Full sync is preferable for the first sync; for subsequent runs, you can choose only the modified data.

In the Sync run schedule section, for Frequency, choose Run on demand.

You also have the option to run the sync on a recurring basis like hourly or daily.

In the Tags section, you can optionally add tags.

In the Field mappings section, use the default field mappings selected.

The Amazon Q connector offers seven fields. Modifying field mappings and adding custom fields will be available after you create the application and retriever. For more information on the field mappings, refer to Amazon FSx (Windows) data source connector field mappings.

Choose Add data source.

Synchronize your file system data

When the data source is successfully created, a banner message appears. In the banner message (or on the data source details page), choose Sync now to sync your file system data.

You can monitor the status of the sync, which includes direct links to Amazon CloudWatch logs.

The sync can take a few minutes to a few hours to complete. Sync speeds are limited by factors such as remote repository throughput and throttling, network bandwidth, and the size of documents.

When the sync is complete, you should see the stats on the scan, which includes the number of items scanned and failed.

For this post, we have two active directory groups, ml-engineers and security-engineers. Each group has one user under them (John Doe and Jane Smith), and they have access to only one whitepaper based on their group (Choosing a generative AI service and AWS Security Incident Response Guide, respectively). The following diagram illustrates this access.

Validate the Amazon Q application functionality

Now that you have completed the setup, you can validate the application functionality by testing the access controls. We test the access of two users, John Doe and Jane Smith, who are users of the ml-engineers group and security-engineers group, respectively. You can retrieve the user name and password for each user from Secrets Manager. The secret name for John Doe is jdoe, and for Jane Smith, it’s jsmith.

On the application details page, in the Web experience settings section, choose the link for the deployed URL.

A successful login directs you to the Amazon Q Business chat interface. This window serves as the main workspace where users interact with the application, as shown in the following screenshot.

With the test configuration, John Doe has access to only one document: generative-ai-on-aws-how-to-choose.pdf. You can test the access controls by asking questions about this whitepaper through the chat interface. This restricted access demonstrates the effective implementation of document-level permissions.

For our first question, we ask What are the key factors to consider when choosing a generative AI service?

The following screenshot shows the response.

Next, we ask Does Amazon Bedrock provide an option to customize the model?

The response includes citations from Amazon Q with reference to the source data.

Testing confirms that John Doe successfully receives responses to questions about content from generative-ai-on-aws-how-to-choose.pdf. You can ask additional questions about generative AI services, such as:

What are the generative AI service offerings from AWS?
What is Amazon Q optimized for?
What are critical factors to consider when choosing an appropriate foundational model?

Next, we test access to the security incident response guide.

We ask What are the four phases of the AWS security incident response process?

When asking questions about security topics from aws-security-incident-response-guide.pdf, the system returns no results. This behavior validates that document indexing respects the configured access permissions, and users can only access content they’re authorized to view.

To validate access controls for the security-engineers user group, log in as Jane Smith.

You can test with questions about security incident response:

What are the key objectives of an AWS security incident response plan?
What are the four phases of the AWS security incident response process?
What are the recommended steps for containing and eradicating a security incident in AWS?
What types of data should be collected during an AWS security incident investigation?
What are the key considerations for recovering from an AWS security incident?

Troubleshooting

If you encounter issues during the setup or operation of your Amazon Q Business application with FSx for Windows File Server, refer to the detailed troubleshooting guide in the README file. The guide provides solutions for common configuration challenges and operational issues you might experience.

Clean up

To avoid ongoing charges, we recommend cleaning up the resources you created while following this guide. For step-by-step cleanup instructions, refer to the README file.

Conclusion

In this post, we provided an overview of the Amazon Q FSx connector and how you can use it for safe and seamless integration of generative AI assistance with your enterprise data source. By using Amazon Q in your organization, you can enable employees to be more data-driven, efficient, prepared, and productive. Lastly, we demonstrated how using simple NLP search through Amazon Q Business enhances your ability to discover insights from your enterprise data quicker and respond to your needs faster.

The Amazon Q Business application offers a compelling solution for organizations seeking to enhance their data-driven capabilities. By using its NLP and secure data source integration features, you can unlock the true value of your data and empower your teams to be more productive and efficient in their work.

To learn more about the Amazon Q connector for FSx for Windows File Server, refer to Connecting Amazon FSx (Windows) to Amazon Q Business.

About the Authors

Manjunath Arakere is a Senior Solutions Architect on the Worldwide Public Sector team at AWS, based in Atlanta, Georgia. He partners with AWS customers to design and scale well-architected solutions, supporting their cloud migrations and modernization initiatives. With extensive experience in the field, Manjunath specializes in migration strategies, application modernization, serverless, and Generative AI (GenAI). He is passionate about helping organizations leverage the full potential of cloud computing to drive innovation and operational efficiency. Outside of work, Manjunath enjoys outdoor runs, tennis, volleyball, and challenging his son in PlayStation soccer games.

Imtranur Rahman is an experienced Sr. Solutions Architect in WWPS team with 14+ years of experience. Imtranur works with large AWS Global SI partners and helps them build their cloud strategy and broad adoption of Amazon’s cloud computing platform. Imtranur specializes in Containers, Dev/SecOps, GitOps, microservices based applications, hybrid application solutions, application modernization and loves innovating on behalf of his customers. He is highly customer obsessed and takes pride in providing the best solutions through his extensive expertise.

Generate synthetic counterparty (CR) risk data with generative AI using Amazon Bedrock LLMs and RAG

Data is the lifeblood of modern applications, driving everything from application testing to machine learning (ML) model training and evaluation. As data demands continue to surge, the emergence of generative AI models presents an innovative solution. These large language models (LLMs), trained on expansive data corpora, possess the remarkable capability to generate new content across multiple media formats—text, audio, and video—and across various business domains, based on provided prompts and inputs.

In this post, we explore how you can use these LLMs with advanced Retrieval Augmented Generation (RAG) to generate high-quality synthetic data for a finance domain use case. You can use the same technique for synthetic data for other business domain use cases as well. For this post, we demonstrate how to generate counterparty risk (CR) data, which would be beneficial for over-the-counter (OTC) derivatives that are traded directly between two parties, without going through a formal exchange.

Solution overview

OTC derivatives are typically customized contracts between counterparties and include a variety of financial instruments, such as forwards, options, swaps, and other structured products. A counterparty is the other party involved in a financial transaction. In the context of OTC derivatives, the counterparty refers to the entity (such as a bank, financial institution, corporation, or individual) with whom a derivative contract is made.

For example, in an OTC swap or option contract, one entity agrees to terms with another party, and each entity becomes the counterparty to the other. The responsibilities, obligations, and risks (such as credit risk) are shared between these two entities according to the contract.

As financial institutions continue to navigate the complex landscape of CR, the need for accurate and reliable risk assessment models has become paramount. For our use case, ABC Bank, a fictional financial services organization, has taken on the challenge of developing an ML model to assess the risk of a given counterparty based on their exposure to OTC derivative data.

Building such a model presents numerous challenges. Although ABC Bank has gathered a large dataset from various sources and in different formats, the data may be biased, skewed, or lack the diversity needed to train a highly accurate model. The primary challenge lies in collecting and preprocessing the data to make it suitable for training an ML model. Deploying a poorly suited model could result in misinformed decisions and significant financial losses.

We propose a generative AI solution that uses the RAG approach. RAG is a widely used approach that enhances LLMs by supplying extra information from external data sources not included in their original training. The entire solution can be broadly divided into three steps: indexing, data generation, and validation.

Data indexing

In the indexing step, we parse, chunk, and convert the representative CR data into vector format using the Amazon Titan Text Embeddings V2 model and store this information in a Chroma vector database. Chroma is an open source vector database known for its ease of use, efficient similarity search, and support for multimodal data and metadata. It offers both in-memory and persistent storage options, integrates well with popular ML frameworks, and is suitable for a wide range of AI applications. It is particularly beneficial for smaller to medium-sized datasets and projects requiring local deployment or low resource usage. The following diagram illustrates this architecture.

Here are the steps for data indexing:

The sample CR data is segmented into smaller, manageable chunks to optimize it for embedding generation.
These segmented data chunks are then passed to a method responsible for both generating embeddings and storing them efficiently.
The Amazon Titan Text Embeddings V2 API is called upon to generate high-quality embeddings from the prepared data chunks.
The resulting embeddings are then stored in the Chroma vector database, providing efficient retrieval and similarity searches for future use.

Data generation

When the user requests data for a certain scenario, the request is converted into vector format and then looked up in the Chroma database to find matches with the stored data. The retrieved data is augmented with the user request and additional prompts to Anthropic’s Claude Haiku on Amazon Bedrock. Anthropic’s Claude Haiku was chosen primarily for its speed, processing over 21,000 tokens per second, which significantly outpaces its peers. Moreover, Anthropic’s Claude Haiku’s efficiency in data generation is remarkable, with a 1:5 input-to-output token ratio. This means it can generate a large volume of data from a relatively small amount of input or context. This capability not only enhances the model’s effectiveness, but also makes it cost-efficient for our application, where we need to generate numerous data samples from a limited set of examples. Anthropic’s Claude Haiku LLM is invoked iteratively to efficiently manage token consumption and help prevent reaching the maximum token limit. The following diagram illustrates this workflow.

Here are the steps for data generation:

The user initiates a request to generate new synthetic counterparty risk data based on specific criteria.
The Amazon Titan Text Embeddings V2 LLM is employed to create embeddings for the user’s request prompts, transforming them into a machine-interpretable format.
These newly generated embeddings are then forwarded to a specialized module designed to identify matching stored data.
The Chroma vector database, which houses previously stored embeddings, is queried to find data that closely matches the user’s request.
The identified matching data and the original user prompts are then passed to a module responsible for generating new synthetic data.
Anthropic’s Claude Haiku 3.0 model is invoked, using both the matching embeddings and user prompts as input to create high-quality synthetic data.
The generated synthetic data is then parsed and formatted into a .csv file using the Pydantic library, providing a structured and validated output.
To confirm the quality of the generated data, several statistical methods are applied, including quantile-quantile (Q-Q) plots and correlation heat maps of key attributes, providing a comprehensive validation process.

Data validation

When validating the synthetic CR data generated by the LLM, we employed Q-Q plots and correlation heat maps focusing on key attributes such as cp_exposure, cp_replacement_cost, and cp_settlement_risk. These statistical tools serve crucial roles in promoting the quality and representativeness of the synthetic data. By using the Q-Q plots, we can assess whether these attributes follow a normal distribution, which is often expected in many clinical and financial variables. By comparing the quantiles of our synthetic data against theoretical normal distributions, we can identify significant deviations that might indicate bias or unrealistic data generation.

Simultaneously, the correlation heat maps provide a visual representation of the relationships between these attributes and others in the dataset. This is particularly important because it helps verify that the LLM has maintained the complex interdependencies typically observed in real CR data. For instance, we would expect certain correlations between exposure and replacement cost, or between replacement cost and settlement risk. By making sure these correlations are preserved in our synthetic data, we can be more confident that analyses or models built on this data will yield insights that are applicable to real-world scenarios. This rigorous validation process helps to mitigate the risk of introducing artificial patterns or biases, thereby enhancing the reliability and utility of our synthetic CR dataset for subsequent research or modeling tasks.

We’ve created a Jupyter notebook containing three parts to implement the key components of the solution. We provide code snippets from the notebooks for better understanding.

Prerequisites

To set up the solution and generate test data, you should have the following prerequisites:

Python 3 must be installed on your machine
We recommend that an integrated development environment (IDE) that can run Jupyter notebooks be installed
You can also create a Jupyter notebook instance using Amazon SageMaker from AWS console and develop the code there.
You need to have an AWS account with access to Amazon Bedrock and the following LLMs enabled (be careful not to share the AWS account credentials):
- Amazon Titan Text Embeddings V2
- Anthropic’s Claude 3 Haiku

Setup

Here are the steps to setup the environment.

import sys!{sys.executable} -m pip install -r requirements.txt

The content of the requirements.txt is given here.

boto3
langchain
langchain-community
streamlit
chromadb==0.4.15
numpy
jq
langchain-aws
seaborn
matplotlib
scipy

The following code snippet will perform all the necessary imports.

from pprint import pprint 
from uuid import uuid4 
import chromadb 
from langchain_community.document_loaders import JSONLoader 
from langchain_community.embeddings import BedrockEmbeddings
from langchain_community.vectorstores import Chroma 
from langchain_text_splitters import RecursiveCharacterTextSplitter

Index data in the Chroma database

In this section, we show how indexing of data is done in a Chroma database as a locally maintained open source vector store. This index data is used as context for generating data.

The following code snippet shows the preprocessing steps of loading the JSON data from a file and splitting it into smaller chunks:

def load_using_jsonloaer(path):
    loader = JSONLoader(path,
                            jq_schema=".[]",
                            text_content=False)
    documents = loader.load()
    return documents

def split_documents(documents):
    doc_list = [item for item in documents]
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=0)
    texts = text_splitter.split_documents(doc_list)
    return texts

The following snippet shows how an Amazon Bedrock embedding instance is created. We used the Amazon Titan Embeddings V2 model:

def get_bedrock_embeddings():
    aws_region = "us-east-1"
    model_id = "amazon.titan-embed-text-v2:0" #look for latest version of model
    bedrock_embeddings = BedrockEmbeddings(model_id=model_id, region_name=aws_region)
    return bedrock_embeddings

The following code shows how the embeddings are created and then loaded in the Chroma database:

persistent_client = chromadb.PersistentClient(path="../data/chroma_index")
collection = persistent_client.get_or_create_collection("test_124")
print(collection)
    #     query the database
vector_store_with_persistent_client = Chroma(collection_name="test_124",
                                                 persist_directory="../data/chroma_index",
                                                 embedding_function=get_bedrock_embeddings(),
                                                 client=persistent_client)
load_json_and_index(vector_store_with_persistent_client)

Generate data

The following code snippet shows the configuration used during the LLM invocation using Amazon Bedrock APIs. The LLM used is Anthropic’s Claude 3 Haiku:

config = Config(
    region_name='us-east-1',
    signature_version='v4',
    retries={
        'max_attempts': 2,
        'mode': 'standard'
    }
)
bedrock_runtime = boto3.client('bedrock-runtime', config=config)
model_id = "anthropic.claude-3-haiku-20240307-v1:0" #look for latest version of model
model_kwrgs = {
    "temperature": 0,
    "max_tokens": 8000,
    "top_p": 1.0,
    "top_k": 25,
    "stop_sequences": ["company-1000"],
}
# Initialize the language model
llm = ChatBedrock(
    model_id=model_id,
    model_kwargs=model_kwrgs,
    client=bedrock_runtime,
)

The following code shows how the context is fetched by looking up the Chroma database (where data was indexed) for matching embeddings. We use the same Amazon Titan model to generate the embeddings:

def get_context(scenario):
    region_name = 'us-east-1'
    credential_profile_name = "default"
    titan_model_id = "amazon.titan-embed-text-v2:0"
    kb_context = []
    be = BedrockEmbeddings(region_name=region_name,
                           credentials_profile_name=credential_profile_name,
                           model_id=titan_model_id)

    vector_store = Chroma(collection_name="test_124", persist_directory="../data/chroma_index",
                      embedding_function=be)
    search_results = vector_store.similarity_search(scenario, k=3)
    for doc in search_results:
        kb_context.append(doc.page_content)
    return json.dumps(kb_context)

The following snippet shows how we formulated the detailed prompt that was passed to the LLM. We provided examples for the context, scenario, start index, end index, records count, and other parameters. The prompt is subjective and can be adjusted for experimentation.

# Create a prompt template
prompt_template = ChatPromptTemplate.from_template(
    "You are a financial data expert tasked with generating records "
    "representing company OTC derivative data and "
    "should be good enough for investor and lending ML model to take decisions "
    "and data should accurately represent the scenario: {scenario} n "
    "and as per examples given in context: "
    "and context is {context} "
    "the examples given in context is for reference only, do not use same values while generating dataset."
    "generate dataset with the diverse set of samples but record should be able to represent the given scenario accurately."
    "Please ensure that the generated data meets the following criteria: "
    "The data should be diverse  and realistic, reflecting various industries, "
    "company sizes, financial metrics. "
    "Ensure that the generated data follows logical relationships and correlations between features "
    "(e.g., higher revenue typically corresponds to more employees, "
    "better credit ratings, and lower risk). "
    "And Generate {count} records starting from index {start_index}. "
    "generate just JSON as per schema and do not include any text or message before or after JSON. "
    "{format_instruction} n"
    "If continuing, start after this record: {last_record}n"
    "If stopping, do not include this record in the output."
    "Please ensure that the generated data is well-formatted and consistent."
)

The following code snippet shows the process for generating the synthetic data. You can call this method in an iterative manner to generate more records. The input parameters include scenario, context, count, start_index, and last_record. The response data is also formatted into CSV format using the instruction provided by the following:

output_parser.get_format_instructions():

 def generate_records(start_index, count, scenario, context, last_record=""):
    try:
        response = chain.invoke({
            "count": count,
            "start_index": start_index,
            "scenario": scenario,
            "context": context,
            "last_record": last_record,
            "format_instruction": output_parser.get_format_instructions(),
            "data_set_class_schema": DataSet.schema_json()
        })
        
        return response
    except Exception as e:
        print(f"Error in generate_records: {e}")
        raise e

Parsing the output generated by the LLM and representing it in CSV was quite challenging. We used a Pydantic parser to parse the JSON output generated by the LLM, as shown in the following code snippet:

class CustomPydanticOutputParser(PydanticOutputParser):
    def parse(self, text: str) -> BaseModel:
        # Extract JSON from the text
        try:
            # Find the first occurrence of '{'
            start = text.index('{')
            # Find the last occurrence of '}'
            end = text.rindex('}') + 1
            json_str = text[start:end]

            # Parse the JSON string
            parsed_json = json.loads(json_str)

            # Use the parent class to convert to Pydantic object
            return super().parse_with_cls(parsed_json)
        except (ValueError, json.JSONDecodeError) as e:
            raise ValueError(f"Failed to parse output: {e}")

The following code snippet shows how the records are generated in an iterative manner with 10 records in each invocation to the LLM:

def generate_full_dataset(total_records, batch_size, scenario, context):
    dataset = []
    total_generated = 0
    last_record = ""
    batch: DataSet = generate_records(total_generated,
                                      min(batch_size, total_records - total_generated),
                                      scenario, context, last_record)
    # print(f"batch: {type(batch)}")
    total_generated = len(batch.records)
    dataset.extend(batch.records)
    while total_generated < total_records:
        try:
            batch = generate_records(total_generated,
                                     min(batch_size, total_records - total_generated),
                                     scenario, context, batch.records[-1].json())
            processed_batch = batch.records

            if processed_batch:
                dataset.extend(processed_batch)
                total_generated += len(processed_batch)
                last_record = processed_batch[-1].start_index
                print(f"Generated {total_generated} records.")
            else:
                print("Generated an empty or invalid batch. Retrying...")
                time.sleep(10)
        except Exception as e:
            print(f"Error occurred: {e}. Retrying...")
            time.sleep(5)

    return dataset[:total_records]  # Ensure exactly the requested number of records

Verify the statistical properties of the generated data

We generated Q-Q plots for key attributes of the generated data: cp_exposure, cp_replacement_cost, and cp_settlement_risk, as shown in the following screenshots. The Q-Q plots compare the quantiles of the data distribution with the quantiles of a normal distribution. If the data isn’t skewed, the points should approximately follow the diagonal line.

As the next step of verification, we created a corelation heat map of the following attributes: cp_exposure, cp_replacement_cost, cp_settlement_risk, and risk. The plot is perfectly balanced with the diagonal elements showing a value of 1. The value of 1 indicates the column is perfectly co-related to itself. The following screenshot is the correlation heatmap.

Clean up

It’s a best practice to clean up the resources you created as part of this post to prevent unnecessary costs and potential security risks from leaving resources running. If you created the Jupyter notebook instance in SageMaker please complete the following steps:

Save and shut down the notebook:

# First save your work
# Then close all open notebooks by clicking File -> Close and Halt

Clear the output (if needed before saving):

# Option 1: Using notebook menu
# Kernel -> Restart & Clear Output

# Option 2: Using code
from IPython.display import clear_output
clear_output()

Stop and delete the Jupyter notebook instance created in SageMaker:

# Option 1: Using aws cli
# Stop the notebook instance when not in use
aws sagemaker stop-notebook-instance --notebook-instance-name <your-notebook-name>

# If you no longer need the notebook instance
aws sagemaker delete-notebook-instance --notebook-instance-name <your-notebook-name>

# Option 2: Using Sagemager Console
# Amazon Sagemaker -> Notebooks
# Select the Notebook and click Actions drop-down and hit Stop.
Click Actions drop-down and hit Delete

Responsible use of AI

Responsible AI use and data privacy are paramount when using AI in financial applications. Although synthetic data generation can be a powerful tool, it’s crucial to make sure that no real customer information is used without proper authorization and thorough anonymization. Organizations must prioritize data protection, implement robust security measures, and adhere to relevant regulations. Additionally, when developing and deploying AI models, it’s essential to consider ethical implications, potential biases, and the broader societal impact. Responsible AI practices include regular audits, transparency in decision-making processes, and ongoing monitoring to help prevent unintended consequences. By balancing innovation with ethical considerations, financial institutions can harness the benefits of AI while maintaining trust and protecting individual privacy.

Conclusion

In this post, we showed how to generate a well-balanced synthetic dataset representing various aspects of counterparty data, using RAG-based prompt engineering with LLMs. Counterparty data analysis is imperative for making OTC transactions between two counterparties. Because actual business data in this domain isn’t easily available, using this approach you can generate synthetic training data for your ML models at minimal cost often within minutes. After you train the model, you can use it to make intelligent decisions before entering into an OTC derivative transaction.

For more information about this topic, refer to the following resources:

About the Authors

Santosh Kulkarni is a Senior Moderation Architect with over 16 years of experience, specialized in developing serverless, container-based, and data architectures for clients across various domains. Santosh’s expertise extends to machine learning, as a certified AWS ML specialist. Currently, engaged in multiple initiatives leveraging AWS Bedrock and hosted Foundation models.

Joyanta Banerjee is a Senior Modernization Architect with AWS ProServe and specializes in building secure and scalable cloud native application for customers from different industry domains. He has developed an interest in the AI/ML space particularly leveraging Gen AI capabilities available on Amazon Bedrock.

Mallik Panchumarthy is a Senior Specialist Solutions Architect for generative AI and machine learning at AWS. Mallik works with customers to help them architect efficient, secure and scalable AI and machine learning applications. Mallik specializes in generative AI services Amazon Bedrock and Amazon SageMaker.

Training code generation models to debug their own outputs

Using large language models to generate training data and updating models through both fine tuning and reinforcement learning improves the success rate of code generation by 39%.Read More

Turbocharging premium audit capabilities with the power of generative AI: Verisk’s journey toward a sophisticated conversational chat platform to enhance customer support

This post is co-written with Sajin Jacob, Jerry Chen, Siddarth Mohanram, Luis Barbier, Kristen Chenowith, and Michelle Stahl from Verisk.

Verisk (Nasdaq: VRSK) is a leading data analytics and technology partner for the global insurance industry. Through advanced analytics, software, research, and industry expertise across more than 20 countries, Verisk helps build resilience for individuals, communities, and businesses. The company is committed to ethical and responsible AI development with human oversight and transparency. Verisk is using generative AI to enhance operational efficiencies and profitability for insurance clients while adhering to its ethical AI principles.

Verisk’s Premium Audit Advisory Service (PAAS®) is the leading source of technical information and training for premium auditors and underwriters. PAAS helps users classify exposure for commercial casualty insurance, including general liability, commercial auto, and workers’ compensation. PAAS offers a wide range of essential services, including more than 40,000 classification guides and more than 500 bulletins. PAAS now includes PAAS AI, the first commercially available interactive generative-AI chats specifically developed for premium audit, which reduces research time and empower users to make informed decisions by answering questions and quickly retrieving and summarizing multiple PAAS documents like class guides, bulletins, rating cards, etc.

In this post, we describe the development of the customer support process in PAAS, incorporating generative AI, the data, the architecture, and the evaluation of the results. Conversational AI assistants are rapidly transforming customer and employee support. Verisk has embraced this technology and developed its own PAAS AI, which provides an enhanced self-service capability to the PAAS platform.

The opportunity

The Verisk PAAS platform houses a vast array of documents—including class guides, advisory content, and bulletins—that aid Verisk’s customers in determining the appropriate rules and classifications for workers’ compensation, general liability, and commercial auto business. When premium auditors need accurate answers within this extensive document repository, the challenges they face are:

Overwhelming volume – The sheer volume of documents (advisories, bulletins, and so on) makes manual searching time-consuming and inefficient
Slow response times – Finding accurate information within this vast repository can be slow, hindering timely decision-making
Inconsistent quality of responses – Manual searches might yield irrelevant or incomplete results, leading to uncertainty and potential errors

To address this issue, Verisk PAAS AI is designed to alleviate the burden by providing round-the-clock support for business processing and delivering precise and quick responses to customer queries. This technology is deeply integrated into Verisk’s newly reimagined PAAS platform, using all of Verisk’s documentation, training materials, and collective expertise. It employs a retrieval augmented generation (RAG) approach and a combination of AWS services alongside proprietary evaluations to promptly answer most user questions about the capabilities of the Verisk PAAS platform.

When deployed at scale, this PAAS AI will enable Verisk staff to dedicate more time to complex issues, critical projects, and innovation, thereby enhancing the overall customer experience. Throughout the development process, Verisk encountered several considerations, key findings, and decisions that provide valuable insights for any enterprise looking to explore the potential of generative AI.

The approach

When creating an interactive agent using large language models (LLMs), two common approaches are RAG and model fine-tuning. The choice between these methods depends on the specific use case and available data. Verisk PAAS began developing a RAG pipeline for its PAAS AI and has progressively improved this solution. Here are some reasons why continuing with a RAG architecture was beneficial for Verisk:

Dynamic data access – The PAAS platform is constantly evolving, adding new business functions and technical capabilities. Verisk needed to make sure its responses are based on the most current information. The RAG approach allows access to continuously updated data, providing responses with the latest information without frequently retraining the model.
Multiple data sources – Besides data recency, another crucial aspect is the ability to draw from multiple PAAS resources to acquire relevant context. The ease of expanding the knowledge base without the need for fine-tuning new data sources makes the solution adaptable.
Reduced hallucinations – Retrieval minimizes the risk of hallucinations compared with free-form text generation because responses come directly from the provided excerpts. Verisk developed an evaluation tool to enhance response quality.
LLM linguistics – Although appropriate context can be retrieved from enterprise data sources, the underlying LLM manages the linguistics and fluency.
Transparency – Verisk aimed to consistently improve the PAAS AI’s response generation ability. A RAG architecture offered the transparency required in the context retrieval process, which would ultimately be used to generate user responses. This transparency helped Verisk identify areas where document restructuring was needed.
Data governance – With diverse users accessing the platform and differing data access permissions, data governance and isolation were critical. Verisk implemented controls within the RAG pipeline to restrict data access based on user permissions, helping to ensure that responses are delivered only to authorized users.

Although both RAG and fine-tuning have their pros and cons, RAG is the best approach for building a PAAS AI on the PAAS platform, given Verisk’s needs for real-time accuracy, explainability, and configurability. The pipeline architecture supports iterative enhancement as the use cases for the Verisk PAAS platform develop.

Solution overview

The following diagram showcases a high-level architectural data flow that highlights various AWS services used in constructing the solution. Verisk’s system demonstrates a complex AI setup, where multiple components interact and frequently call on the LLM to provide user responses. Employing the PAAS platform to manage these varied components was an intuitive decision.

The key components are as follows:

Amazon ElastiCache
Amazon Bedrock
Amazon OpenSearch Service
Snowflake in Amazon
Evaluation API
Feedback loop (implementation in progress)

Amazon ElastiCache

Verisk’s PAAS team determined that ElastiCache is the ideal solution for storing all chat history. This storage approach allows for seamless integration in conversational chats and enables the display of recent conversations on the website, providing an efficient and responsive user experience.

Amazon Bedrock

Anthropic’s Claude, available in Amazon Bedrock, played various roles within Verisk’s solution:

Response generation – When building their PAAS AI, Verisk conducted a comprehensive evaluation of leading LLMs, using their extensive dataset to test each model’s capabilities. Through Amazon Bedrock, Verisk gained streamlined access to multiple best-in-class foundation models (FMs), enabling efficient testing and comparison across key performance criteria. The Amazon Bedrock unified API and robust infrastructure provided the ideal platform to develop, test, and deploy LLM solutions at scale. After this extensive testing, Verisk found Anthropic’s Claude model consistently outperformed across key criteria. Anthropic’s Claude demonstrated superior language understanding in Verisk’s complex business domain, allowing more pertinent responses to user questions. Given the model’s standout results across Verisk PAAS platform use cases, it was the clear choice to power the PAAS AI’s natural language capabilities.
Conversation summarization – When a user asks a follow-up question, the PAAS AI can continue the conversational thread. To enable this, Verisk used Claude to summarize the dialogue to update the context from ElastiCache. The full conversation summary and new excerpts are input to the LLM to generate the next response. This conversational flow allows the PAAS AI to answer user follow-up questions and have a more natural, contextual dialogue, bringing Verisk PAAS closer to having a true AI assistant that can engage in useful, back-and-forth conversations with users.
Keyword extraction – Keywords are extracted from user questions and previous conversations to be used for creating the new summarized prompt and to be input to Verisk’s knowledge base retrievers to perform vector similarity search.

Amazon OpenSearch Service

Primarily used for the storage of text embeddings, OpenSearch facilitates efficient document retrieval by enabling rapid access to indexed data. These embeddings serve as semantic representations of documents, allowing for advanced search capabilities that go beyond simple keyword matching. This semantic search functionality enhances the system’s ability to retrieve relevant documents that are contextually similar to the search queries, thereby improving the overall accuracy and speed of data queries. Additionally, OpenSearch functions as a semantic cache for similarity searches, optimizing performance by reducing the computational load and improving response times during data retrieval operations. This makes it an indispensable tool in the larger PAAS ecosystem, where the need for quick and precise information access is paramount.

Snowflake in Amazon

The integration of Snowflake in the PAAS AI ecosystem helps provide scalable and real-time access to data, allowing Verisk to promptly address customer concerns and improve its services. By using Snowflake’s capabilities, Verisk can perform advanced analytics, including sentiment analysis and predictive modeling, to better understand customer needs and enhance user experiences. This continuous feedback loop is vital for refining the PAAS AI and making sure it remains responsive and relevant to user demands.

Structuring and retrieving the data

An essential element in developing the PAAS AI’s knowledge base was properly structuring and effectively querying the data to deliver accurate answers. Verisk explored various techniques to optimize both the organization of the content and the methods to extract the most relevant information:

Chunking – A key step in preparing the accumulated questions and answers was splitting the data into individual documents to facilitate indexing into OpenSearch Service. Rather than uploading large files containing multiple pages of content, Verisk chunked the data into smaller segments by document section and character lengths. By splitting the data into small, modular chunks focused on a single section of a document, Verisk could more easily index each document and had greater success in pulling back the correct context. Chunking the data also enabled straightforward updating and reindexing of the knowledge base over time.
Hybrid query – When querying the knowledge base, Verisk found that using just standard vector search wasn’t enough to retrieve all the relevant contexts pertaining to a question. Therefore, a solution was implemented to combine a sparse bm25 search in combination with the dense vector search to create a hybrid search approach, which yielded much better context retrieval results.
Data separation and filters – Another issue Verisk ran into was that, because of the vast amount of documents and the overlapping content within certain topics, incorrect documents were being retrieved for some questions that asked for specific topics that were present across multiple sources—some of these weren’t needed or appropriate in the context of the user’s question. Therefore, data separation was implemented to split the documents based on document type and filter by line of business to improve context retrieval within the application.

By thoroughly experimenting and optimizing both the knowledge base powering the PAAS AI and the queries to extract answers from it, Verisk was able to achieve very high answer accuracy during the proof of concept, paving the way for further development. The techniques explored—hybrid querying, HTML section chunking, and index filtering—became core elements of Verisk’s approach for extracting quality contexts.

LLM parameters and models

Experimenting with prompt structure, length, temperature, role-playing, and context was key to improving the quality and accuracy of the PAAS AI’s Claude-powered responses. The prompt design guidelines provided by Anthropic were incredibly helpful.

Verisk crafted prompts that provided Anthropic’s Claude with clear context and set roles for answering user questions. Setting the temperature to 0 helped reduce the randomness and indeterministic nature of LLM-generated responses.

Verisk also experimented with different models to improve the efficiency of the overall solution. For scenarios where latency was more important and less reasoning was required, Anthropic’s Claude Haiku was the perfect solution. For other scenarios such as question answering using provided contexts where it was more important for the LLM to be able to understand every detail given in the prompt, Anthropic’s Claude Sonnet was the better choice to balance latency, performance, and cost.

Guardrails

LLM guardrails were implemented in the PAAS AI project using both the guardrails provided by Amazon Bedrock and specialized sections within the prompt to detect unrelated questions and prompt attack attempts. Amazon Bedrock guardrails can be attached to any Amazon Bedrock model invocation call and automatically detect if the given model input and output are in violation of the language filters that are set (violence, misconduct, sexual, and so on), which helps with screening user inputs. The specialized prompts further improve LLM security by creating a second net that uses the power of the LLMs to catch any inappropriate inputs from the users.

This allows Verisk to be confident that the model will only answer to its intended purpose surrounding premium auditing services and will not be misused by threat actors.

After validating several evaluation tools such as Deepeval, Ragas, Trulens, and so on, the Verisk PAAS team realized that there were certain limitations to using these tools for their specific use case. Consequently, the team decided to develop its own evaluation API, shown in the following figure.

This custom API evaluates the answers based on three major metrics:

Answer relevancy score – Using LLMs, the process assesses whether the answers provided are relevant to the customer’s prompt. This helps make sure that the responses are directly addressing the questions posed.
Context relevancy score – By using LLMs, the process evaluates whether the context retrieved is appropriate and aligns well with the question. This helps make sure that the LLM has the appropriate and accurate contexts to generate a response.
Faithfulness score – Using LLMs, the process checks if the responses are generated based on their retrieved context or if they are hallucinated. This is crucial for maintaining the integrity and reliability of the information provided.

This custom evaluation approach helps make sure that the answers generated are not only relevant and contextually appropriate but also faithful to the established generative AI knowledge base, minimizing the risk of misinformation. By incorporating these metrics, Verisk has enhanced the robustness and reliability of their PAAS AI, providing customers with accurate and trustworthy responses.

The Verisk PAAS team has implemented a comprehensive feedback loop mechanism, shown in the following figure, to support continuous improvement and address any issues that might arise.

This feedback loop is structured around the following key components:

Customer feedback analysis – The team actively collects and analyzes feedback from customers to identify potential data issues or problems with the generative AI responses. This analysis helps pinpoint specific areas that need improvement.
Issue categorization – After an issue is identified, it’s categorized based on its nature. If it’s a data-related issue, it’s assigned to the internal business team for resolution. If it’s an application issue, a Jira ticket is automatically created for the PAAS IT team to address and fix the problem.
QA test case updates – The system provides an option to update QA test cases based on the feedback received. This helps make sure that the test scenarios remain relevant and comprehensive, covering a wide range of potential issues.
Ground truth agreements – Ground truth agreements, which serve as the benchmark for evaluating LLM response quality, are periodically reviewed and updated. This helps make sure that the evaluation metrics remain accurate and reflective of the desired standards.
Ongoing evaluations – Regular evaluations of the LLM responses are conducted using the updated QA test cases and ground truth agreements. This helps in maintaining high-quality responses and quickly addressing any deviations from the expected standards.

This robust feedback loop mechanism enables Verisk to continuously fine-tune the PAAS AI, making sure that it delivers precise, relevant, and contextually appropriate answers to customer queries. By integrating customer feedback, categorizing issues efficiently, updating test scenarios, and adhering to stringent evaluation protocols, Verisk maintains a high standard of service and drives continuous improvement in its generative AI capabilities.

Business impact

Verisk initially rolled out the PAAS AI to one beta customer to demonstrate real-world performance and impact. Supporting a customer in this way is a stark contrast to how Verisk has historically engaged with and supported customers in the past, where Verisk would typically have a team allocated to interact with the customer directly. Verisk’s PAAS AI has revolutionized the way subject matter experts (SMEs) work and cost-effectively scales while still providing high-quality assistance. What previously took hours of manual review can now be accomplished in minutes, resulting in an extraordinary 96–98% reduction in processing time per specialist. This dramatic improvement in efficiency not only streamline operations but also allows Verisk’s experts to focus on more strategic initiatives that drive greater value for the organization.

In analyzing this early usage data, Verisk uncovered additional areas where it can drive business value for its customers. As Verisk collects additional information, this data will help uncover what will be needed to improve results and prepare to roll out to a wider customer base of approximately 15,000 users.

Ongoing development will focus on expanding these capabilities, prioritized based on the collected questions. Most exciting, though, are the new possibilities on the horizon with generative AI. Verisk knows this technology is rapidly advancing and is eager to harness innovations to bring even more value to customers. As new models and techniques emerge, Verisk plans to adapt the PAAS AI to take advantage of the latest capabilities. Although the PAAS AI currently focuses on responding to user questions, this is only the starting point. Verisk plans to quickly improve its capabilities to proactively make suggestions and configure functionality directly in the system itself. The Verisk PAAS team is inspired by the challenge of pushing the boundaries of what’s possible with generative AI and is excited to test those boundaries.

Conclusion

Verisk’s development of a PAAS AI for its PAAS platform demonstrates the transformative power of generative AI in customer support and operational efficiency. Through careful data harvesting, structuring, retrieval, and the use of LLMs, semantic search functionalities, and stringent evaluation protocols, Verisk has crafted a robust system that delivers accurate, real-time answers to user questions. By continuing to enhance the PAAS AI’s features while maintaining ethical and responsible AI practices, Verisk is set to provide increased value to its customers, enable staff to concentrate on innovation, and establish new benchmarks for customer service in the insurance sector.

For more information, see the following resources:

Explore generative AI on AWS
Learn about unlocking the business value of generative AI
Learn more about Anthropic’s Claude 3 models on Amazon Bedrock
Learn about Amazon Bedrock and how to build and scale generative AI applications with FMs
Explore generative AI quickstart proofs of concept

About the Authors

Sajin Jacob is the Director of Software Engineering at Verisk, where he leads the Premium Audit Advisory Service (PAAS) development team. In this role, Sajin plays a crucial part in designing the architecture and providing strategic guidance to eight development teams, optimizing their efficiency and ensuring the maintainability of all solutions. He holds an MS in Software Engineering from Periyar University, India.

Jerry Chen is a Lead Software Developer at Verisk, based in Jersey City. He leads the GenAi development team, working on solutions for projects within the Verisk Underwriting department to enhance application functionalities and accessibility. Within PAAS, he has worked on the implementation of the conversational RAG architecture with enhancements such as hybrid search, guardrails, and response evaluations. Jerry holds a degree in Computer Science from Stevens Institute of Technology.

Sid Mohanram is the Senior Vice President of Core Lines Technology at Verisk. His area of expertise includes data strategy, analytics engineering, and digital transformation. Sid is head of the technology organization with global teams across five countries. He is also responsible for leading the technology transformation for the multi-year Core Lines Reimagine initiative. Sid holds an MS in Information Systems from Stevens Institute of Technology.

Luis Barbier is the Chief Technology Officer (CTO) of Verisk Underwriting at Verisk. He provides guidance to the development teams’ architectures to maximize efficiency and maintainability for all underwriting solutions. Luis holds an MBA from Iona University.

Kristen Chenowith, MSMSL, CPCU, WCP, APA, CIPA, AIS, is PAAS Product Manager at Verisk. She is currently the product owner for the Premium Audit Advisory Service (PAAS) product suite, including PAAS AI, a first to market generative AI chat tool for premium audit that accelerates research for many consultative questions by 98% compared to traditional methods. Kristen holds an MS in Management, Strategy and Leadership at Michigan State University and a BS in Business Administration at Valparaiso University. She has been in the commercial insurance industry and premium audit field since 2006.

Michelle Stahl, MBA, CPCU, AIM, API, AIS, is a Digital Product Manager with Verisk. She has over 20 years of experience building and transforming technology initiatives for the insurance industry. She has worked as a software developer, project manager, and product manager throughout her career.

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build, and reinvent. He is creative, fast-paced, deeply customer-obsessed, and uses the working backward process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.

Ryan Doty is a Solutions Architect Manager at AWS, based out of New York. He helps financial services customers accelerate their adoption of the AWS Cloud by providing architectural guidelines to design innovative and scalable solutions. Coming from a software development and sales engineering background, the possibilities that the cloud can bring to the world excite him.

Apoorva Kiran, PhD, is a Senior Solutions Architect at AWS, based out of New York. He is aligned with the financial service industry, and is responsible for providing architectural guidelines to design innovative and scalable fintech solutions. He specializes in developing and commercializing artificial intelligence and machine learning products. Connect with him on LinkedIn.

Exploring the structural changes driving protein function with BioEmu-1

The image shows eight different 3D models of protein structures. Each model is color-coded with various segments in blue, green, orange, and other colors to highlight different parts of the protein.

From forming muscle fibers to protecting us from disease, proteins play an essential role in almost all biological processes in humans and other life forms alike. There has been extraordinary progress in recent years toward better understanding protein structures using deep learning, enabling the accurate prediction of protein structures from their amino acid sequences. However, predicting a single protein structure from its amino acid sequence is like looking at a single frame of a movie—it offers only a snapshot of a highly flexible molecule. Biomolecular Emulator-1 (BioEmu-1) is a deep-learning model that provides scientists with a glimpse into the rich world of different structures each protein can adopt, or structural ensembles, bringing us a step closer to understanding how proteins work. A deeper understanding of proteins enables us to design more effective drugs, as many medications work by influencing protein structures to boost their function or prevent them from causing harm.

One way to model different protein structures is through molecular dynamics (MD) simulations. These tools simulate how proteins move and deform over time and are widely used in academia and industry. However, in order to simulate functionally important changes in structure, MD simulations must be run for a long time. This is a computationally demanding task and significant effort has been put into accelerating simulations, going as far as designing custom computer architectures (opens in new tab). Yet, even with these improvements, many proteins remain beyond what is currently possible to simulate and would require simulation times of years or even decades.

Enter BioEmu-1 (opens in new tab)—a deep learning model that can generate thousands of protein structures per hour on a single graphics processing unit. Today, we are making BioEmu-1 open-source (opens in new tab), following our preprint (opens in new tab) from last December, to empower protein scientists in studying structural ensembles with our model. It provides orders of magnitude greater computational efficiency compared to classical MD simulations, thereby opening the door to insights that have, until now, been out of reach. BioEmu-1 is featured in Azure AI Foundry Labs (opens in new tab), a hub for developers, startups, and enterprises to explore groundbreaking innovations from research at Microsoft.

We have enabled this by training BioEmu-1 on three types of data sets: (1) AlphaFold Database (AFDB) (opens in new tab) structures (2) an extensive MD simulation dataset, and (3) an experimental protein folding stability dataset (opens in new tab). Training BioEmu-1 on the AFDB structures is like mapping distinct islands in a vast ocean of possible structures. When preparing this dataset, we clustered similar protein sequences so that BioEmu-1 can recognize that a protein sequence maps to multiple distinct structures. The MD simulation dataset helps BioEmu-1 predict physically plausible structural changes around these islands, mapping out the plethora of possible structures that a single protein can adopt. Finally, through fine-tuning on the protein folding stability dataset, BioEmu-1 learns to sample folded and unfolded structures with the right probabilities.

Figure 1: BioEmu-1 predicts diverse structures of LapD protein unseen during training. We sampled structures independently and reordered the samples to create a movie connecting two experimentally known structures.

Combining these advances, BioEmu-1 successfully generalizes to unseen protein sequences and predicts multiple structures. In Figure 1, we show that BioEmu-1can predict structures of the LapD protein (opens in new tab) from Vibrio cholerae bacteria, which causes cholera. BioEmu-1 predicts structures of LapD when it is bound and unbound with c-di-GMP molecules, both of which are experimentally known but not in the training set. Furthermore, our model offers a view on intermediate structures, which have never been experimentally observed, providing viable hypotheses about how this protein functions. Insights into how proteins function pave the way for further advancements in areas like drug development.

The figure compares Molecular Dynamics (MD) simulation and BioEmu-1, and shows that BioEmu-1 can emulate the equilibrium distribution 100,000 times faster than running a MD simulation to full convergence. The middle part of the figure shows that the 2D projections of the structure distributions obtained from MD simulation and BioEmu-1 are nearly identical. The bottom part of the figure shows three representative structures from the equilibrium distribution. — Figure 2: BioEmu-1 reproduces the D. E. Shaw research (DESRES) simulation of Protein G accurately with a fraction of the computational cost. On the top, we compare the distributions of structures obtained by extensive MD simulation (left) and independent sampling from BioEmu-1 (right). Three representative sample structures are shown at the bottom.

Moreover, BioEmu-1 reproduces MD equilibrium distributions accurately with a tiny fraction of the computational cost. In Figure 2, we compare 2D projections of the structural distribution of D. E. Shaw research (DESRES) simulation of Protein G (opens in new tab) and samples from BioEmu-1. BioEmu-1 reproduces the MD distribution accurately, while requiring 10,000-100,000 times fewer GPU hours.

The left panel of the figure shows a scatter plot of the experimental folding free energies ΔG against those predicted by BioEmu-1. The plot shows a good correlation between the two. The right panel of the figure shows folded and unfolded structures of a protein. — Figure 3: BioEmu-1 accurately predicts protein stability. On the left, we plot the experimentally measured free energy differences ΔG against those predicted by BioEmu-1. On the right, we show a protein in folded and unfolded structures.

Furthermore, BioEmu-1 accurately predicts protein stability, which we measure by computing the folding free energies—a way to quantify the ratio between the folded and unfolded states of a protein. Protein stability is an important factor when designing proteins, e.g., for therapeutic purposes. Figure 3 shows the folding free energies predicted by BioEmu-1, obtained by sampling protein structures and counting folded versus unfolded protein structures, compared against experimental folding free energy measurements. We see that even on sequences that BioEmu-1 has never seen during training, the predicted free energy values correlate well with experimental values.

Professor Martin Steinegger (opens in new tab) of Seoul National University, who was not part of the study, says “With highly accurate structure prediction, protein dynamics is the next frontier in discovery. BioEmu marks a significant step in this direction by enabling blazing-fast sampling of the free-energy landscape of proteins through generative deep learning.”

We believe that BioEmu-1 is a first step toward generating the full ensemble of structures that a protein can take. In these early days, we are also aware of its limitations. With this open-source release, we hope scientists will start experimenting with BioEmu-1, helping us carve out its potentials and shortcomings so we can improve it in the future. We are looking forward to hearing how it performs on various proteins you care about.

Acknowledgements

BioEmu-1 is the result of highly collaborative team effort at Microsoft Research AI for Science. The full authors: Sarah Lewis, Tim Hempel, José Jiménez-Luna, Michael Gastegger, Yu Xie, Andrew Y. K. Foong, Victor García Satorras, Osama Abdin, Bastiaan S. Veeling, Iryna Zaporozhets, Yaoyi Chen, Soojung Yang, Arne Schneuing, Jigyasa Nigam, Federico Barbero, Vincent Stimper, Andrew Campbell, Jason Yim, Marten Lienen, Yu Shi, Shuxin Zheng, Hannes Schulz, Usman Munir, Ryota Tomioka, Cecilia Clementi, Frank Noé

The post Exploring the structural changes driving protein function with BioEmu-1 appeared first on Microsoft Research.

It’s a Sign: AI Platform for Teaching American Sign Language Aims to Bridge Communication Gaps

American Sign Language is the third most prevalent language in the United States — but there are vastly fewer AI tools developed with ASL data than data representing the country’s most common languages, English and Spanish.

NVIDIA, the American Society for Deaf Children and creative agency Hello Monday are helping close this gap with Signs, an interactive web platform built to support ASL learning and the development of accessible AI applications.

Sign language learners can access the platform’s validated library of ASL signs to expand their vocabulary with the help of a 3D avatar that demonstrates signs — and use an AI tool that analyzes webcam footage to receive real-time feedback on their signing. Signers of any skill level can contribute by signing specific words to help build an open-source video dataset for ASL.

The dataset — which NVIDIA aims to grow to 400,000 video clips representing 1,000 signed words — is being validated by fluent ASL users and interpreters to ensure the accuracy of each sign, resulting in a high-quality visual dictionary and teaching tool.

“Most deaf children are born to hearing parents. Giving family members accessible tools like Signs to start learning ASL early enables them to open an effective communication channel with children as young as six to eight months old,” said Cheri Dowling, executive director of the American Society for Deaf Children. “And knowing that professional ASL teachers have validated all the vocabulary on the platform, users can be confident in what they’re learning.”

NVIDIA teams plan to use this dataset to further develop AI applications that break down communication barriers between the deaf and hearing communities. The data is slated to be available to the public as a resource for building accessible technologies including AI agents, digital human applications and video conferencing tools. It could also be used to enhance Signs and enable ASL platforms across the ecosystem with real-time, AI-powered support and feedback.

Three people practicing sign language using Signs AI platform — Whether novice or expert, volunteers can record themselves signing to contribute to the ASL dataset.

Supporting ASL Education, Exploring Language Nuance

During the data collection phase, Signs already provides a powerful platform for ASL language acquisition, offering opportunities for individuals to learn and practice an initial set of 100 signs so they can more effectively communicate with friends or family members who use ASL.

“The Signs learning platform could help families with deaf children quickly search for a specific word and see how to make the corresponding sign. It’s a tool that can help support their everyday use of ASL outside of a more formal class,” Dowling said. “I see both kids and parents exploring it — and I think they could play with it together.”

While Signs currently focuses on hand movements and finger positions for each sign, ASL also incorporates facial expressions and head movements to convey meaning. The team behind Signs is exploring how these non-manual signals can be tracked and integrated in future versions of the platform.

They’re also investigating how other nuances, like regional variations and slang terms, can be represented in Signs to enrich its ASL database — and working with researchers at the Rochester Institute of Technology’s Center for Accessibility and Inclusion Research to evaluate and further improve the user experience of the Signs platform for deaf and hard-of-hearing users.

“Improving ASL accessibility is an ongoing effort,” said Anders Jessen, founding partner of Hello Monday/DEPT, which built the Signs web platform and previously worked with the American Society for Deaf Children on Fingerspelling.xyz, an application that taught users the ASL alphabet. “Signs can serve the need for advanced AI tools that help transcend communication barriers between the deaf and hearing communities.”

The dataset behind Signs is planned for release later this year.

Start learning or contributing with Signs at signs-ai.com, and learn more about NVIDIA’s trustworthy AI initiatives. Attendees of NVIDIA GTC, a global AI conference taking place March 17-21 in San Jose, will be able to participate in Signs live at the event.

Step Into the World of ‘Avowed’ on GeForce NOW

Wield magic and steel as GeForce NOW’s fifth-anniversary celebration summons Obsidian Entertainment’s highly anticipated Avowed to the cloud.

This first-person fantasy role-playing game is ready to enchant cloud gamers, leading the charge of six titles joining the over 2,000 games in the cloud gaming library.

GeForce NOW day passes are available to purchase again, in limited quantities each day. Members can currently purchase one day at a time, based on available capacity. Day pass users get 24-hour access to powerful cloud gaming with all the benefits of a GeForce NOW Ultimate or Performance membership. Stay tuned for updates as more membership options become available.

Choose Your Own Adventure

Avowed on GeForce NOW — *Cloudy with a chance of dragons.*

Embark on a thrilling adventure in Avowed, set in the captivating world of Eora. As an envoy of Aedyr, explore the mysterious Living Lands, an island teeming with ancient magic and shifting secrets, as a dire threat looms over the realm: a mysterious plague that defies nature and reason, spreading chaos across the sprawling wilderness.

The Living Lands offer a diverse array of environments to explore, each with a unique ecosystem. Engage in visceral combat by mixing and matching swords, spells, guns and shields. Companions of various species, each with their own abilities and quests, will join the adventure, their fates intertwined with the players’ choices. As the story unfolds, every decision will ripple across the Living Lands, shaping the future of its inhabitants and testing the players’ resolve in the face of intrigue, power and danger.

GeForce NOW members can dive into this immersive fantasy world with the power of GeForce RTX-powered gaming rigs in the cloud. Ultimate members can stream the game at up to 4K resolution and 60 frames per second with high dynamic range on supported devices. These members enjoy additional benefits like NVIDIA DLSS 3 technology for enhanced frame rates and NVIDIA Reflex for ultra-low latency, delivering a seamless and visually stunning adventure through the Living Lands.

Time to Play

Lost Records: Bloom & Rage on GeForce NOW — *Some mixtapes are better left unplayed.*

Lost Records: Bloom & Rage is the recently released narrative-adventure game by Don’t Nod, the creators of Life Is Strange. Set in the fictional Michigan town of Velvet Cove, the game follows four friends — Swann, Nora, Autumn and Kat — during the summer of 1995, as well as 27 years later in 2022.

Explore Swann’s world through a nostalgic 90s lens, complete with a camcorder for capturing and reliving memories. The story unfolds across two timelines, delving into themes of friendship and identity, as well as a mysterious secret that tore the group apart. With its immersive storytelling, interactive environments and choice-driven gameplay, Lost Records: Bloom & Rage promises a captivating journey through time, nostalgia and the complexities of lifelong friendships.

Look for the following games available to stream in the cloud this week:

Avowed (New release on Steam, Battle.net and Xbox, available on PC Game Pass, Feb. 18)
Warhammer 40,000: Rogue Trader (New release on Xbox, available on PC Game Pass, Feb. 20)
Lost Records: Bloom & Rage (New release on Steam, Feb. 18)
Abiotic Factor (Steam)
HUMANITY (Steam)
Songs of Silence (Steam)

What are you planning to play this weekend? Let us know on X or in the comments below.

What’s a gaming promise you made to yourself this year?

— NVIDIA GeForce NOW (@NVIDIAGFN) February 19, 2025

Into the Omniverse: How OpenUSD and Synthetic Data Are Shaping the Future for Humanoid Robots

Editor’s note: This post is part of Into the Omniverse, a series focused on how developers, 3D practitioners and enterprises can transform their workflows using the latest advances in OpenUSD and NVIDIA Omniverse.

Humanoid robots are rapidly becoming a reality. Those built on NVIDIA Isaac GR00T are already learning to walk, manipulate objects and otherwise interact with the real world.

Gathering diverse and large datasets to train these sophisticated machines can be time-consuming and costly. Using synthetic data (SDG), generated from physically-accurate digital twins, researchers and developers can train and validate their AI models in simulation before deployment in the real world.

Universal Scene Description, aka OpenUSD, is a powerful framework that makes it easy to build these physically accurate virtual environments. Once 3D environments are built, OpenUSD allows teams to develop detailed, scalable simulations along with lifelike scenarios where robots can practice, learn and improve their skills.

This synthetic data is essential for humanoid robots to learn humanlike behaviors such as walking, grasping objects and navigating complex environments. OpenUSD is enhancing the development of humanoid robots and paving the way for a future where these machines can seamlessly integrate into people’s daily lives.

The NVIDIA Omniverse platform, powered by OpenUSD, provides developers a way to unify 3D assets from disparate sources such as 3DCAD and digital content creation (DCC) tools. This allows them to build large-scale 3D virtual environments and run complex simulations to train their robots, streamlining the entire process and delivering faster, more cost-effective ways to collaborate and develop physical AI.

Advancing Robot Training With Synthetic Motion Data

At CES last month, NVIDIA announced the Isaac GR00T Blueprint for synthetic motion generation to help developers generate exponentially larger synthetic motion datasets to train humanoids using imitation learning.

Highlights of the release include:

Large-Scale Motion Data Generation: Uses simulation as well generative AI techniques to generate exponentially large and diverse datasets of humanlike movements, speeding up the data collection process.
Faster Data Augmentation: NVIDIA Cosmos world foundation models generate photorealistic videos at scale using the ground-truth simulation from Omniverse. This equips developers to augment synthetic datasets faster, for training physical AI models, reducing the simulation-to-real gap.
Simulation-First Training: Instead of relying solely on real-world testing, developers can train robots in virtual environments, making the process faster and more cost-effective.
Bridging Virtual to Reality: The combination of real and synthetic data along with simulation-based training and testing allows developers to transfer the robots skills learned in the virtual world to the real-world seamlessly.

Simulating the Future of Robotics

Humanoid robots are enhancing efficiency, safety and adaptability across industries like manufacturing, warehouse and logistics, and healthcare by automating complex tasks and increasing safety conditions for human workers.

Major robotics companies including Boston Dynamics and Figure have already started adopting and demonstrating results with Isaac GR00T.

Get Plugged Into the World of OpenUSD

Learn more about OpenUSD, humanoid robots and the latest AI advancements at NVIDIA GTC, a global AI conference running March 17-21 in San Jose, California.

Don’t miss NVIDIA founder and CEO Jensen Huang’s GTC keynote on Tuesday, March 18 — in person at the SAP Center or online. He’ll share the latest technologies driving the next wave in AI, digital twins, cloud technologies and sustainable computing.

The inaugural GTC Humanoid Developer Day will take place on Wednesday, March 18. Following the sessions, join the Physical AI Developer Meetup to network with developers and researchers at NVIDIA GTC. Discuss the latest breakthroughs in OpenUSD and generative AI-powered simulation and digital twins, as well as innovations in generalist robotics for the next frontier of industries.

Learn how to use USD and continue to optimize 3D workflows with the new self-paced “Learn OpenUSD” curriculum for 3D developers and practitioners, available for free through the NVIDIA Deep Learning Institute. For more resources on OpenUSD, explore the Alliance for OpenUSD forum and the AOUSD website.

Stay up to date by subscribing to NVIDIA news, joining the community and following NVIDIA Omniverse on Instagram, LinkedIn, Medium and X.

Featured image courtesy of Fourier.

Solution overview

Solution architecture

Solution walkthrough

Step 0: Set up the necessary infrastructure

Step 1: Set up two Amazon Bedrock knowledge bases

Step 2: Populate the agent knowledge base and associate it with an Amazon Bedrock agent

Step 3: Create a cache dataset with known question-answer pairs and populate the cache knowledge base

Step 4: Implement the verified semantic cache logic

Step 5: Evaluate results and performance

Step 6: Resource clean up

Production readiness considerations

Conclusion

About the Authors

Overview of the continuous self-instruct fine-tuning framework

Compound AI system and the DSPy framework

RAG pipeline with continuous fine-tuning in a compound AI system

Prerequisites

Dataset

Prepare question-answer pairs

Create a RAG pipeline

RAG optimization with DSPy

Configure the continuous fine-tuning framework

Define an evaluation approach with DSPy

Benchmark RAG and LLM fine-tuning with DSPy

Clean up

Cost considerations

Conclusion

About the Authors

Overview of the Amazon Q data source connector

Supported document types

Secure access with supported authentication types

Fine-grained control with ACLs and identity crawling

Overview of solution

Prerequisites

Create an Amazon Q Business application

Select the retriever

Add a data source

Synchronize your file system data

Validate the Amazon Q application functionality

Troubleshooting

Clean up

Conclusion

About the Authors

Solution overview

Data indexing

Data generation

Data validation

Prerequisites

Setup

Index data in the Chroma database

Generate data

Verify the statistical properties of the generated data

Clean up

Responsible use of AI

Conclusion

About the Authors

The opportunity

The approach

Solution overview

Amazon ElastiCache

Amazon Bedrock

Amazon OpenSearch Service

Snowflake in Amazon

Structuring and retrieving the data

LLM parameters and models

Guardrails

Business impact

Conclusion

About the Authors

Microsoft Research Forum Episode 4

Acknowledgements

Supporting ASL Education, Exploring Language Nuance

Choose Your Own Adventure

Time to Play

Advancing Robot Training With Synthetic Motion Data

Simulating the Future of Robotics

Get Plugged Into the World of OpenUSD

Navigation

GenAI Vision Endless Possibilities

"I'm interested in things that change the world or that affect the future and wondrous, new technology where you see it, and you're like, 'Wow, how did that even happen? How is that possible?'" -- Elon Musk